In this post we will discuss two prevailing views of experts on Data Science that can be of interest for newcomers in this field. According to the first approach, the larger amount of data is obtained and processed, the more successful data extraction and analytics results are. Therefore, primary raw data has the greatest value. Another approach is that getting too much data makes it difficult to extract the required data, and only data that mostly fits the processing requirements should be selected. Consequently, a strict process of filtration and data cleaning should be organized right after data is received. This process is known as ETL – Extracting-Transforming-Loading.
The number of followers of the second approach obviously exceeds those who support the first view and who are regarded Big Data geeks. In disputes, Occam’s razor principle is often referred to Big Data stating that entities should not be multiplied unnecessarily. However, let’s now consider an example of a Big Data project. Smart grids have already become common in many countries, and tens of millions of smart meters send current meter readings to the analytics and energy system management centers. Data from each meter is transferred via packet networks consisting of numerous data channels, such as power line modems, WiFi, optical lines. Several protocols are used for data collection and transfer, and they are eventually compiled in a format accepted to upload raw data into the storage system, often called “Data Lake” or even “Data Sea”, that can be accessed by various processing systems. In each meter, data is generated in a fixed period of time and transmitted via a network creating data flows from each source with a definite address.
However, some flows can be non-equidistant, i.e. a part of data transmitted in a definite period of time can be either destroyed or lost. While designing a data collection system, a developer decided to detect such out-of-range data and replace it on the fly in the flow with the value of the previously received packet. Such approach completely corresponds to a method of data cleaning accepted in analytics that is known as interpolation of time series. Besides, a billing processing system worked well with such a “lake”. The load management system in the network also operated well with the cleaned data. Corrupted or missing data that in the initial design required assigning of the special value NaN was completely excluded as redundant.
However, over time some meters started transmitting strange flows with invariable values within long time intervals. The thorough study showed that it had happened as a result of communication failures in these meters. Therefore, the channel control tools were added to the system. And when the design of this subsystem was already completed, an analyst asked a question: why was it necessary to build a separate technical system? Let’s remove a primary system of flow cleaning and restore design values of NaN data. Then the frequency of occurrence of these values in each flow will be a quality parameter of a communication channel with the relevant device, and no additional subsystem will be required. Thus, technical control service departments started using data that was regarded unnecessary and was to be filtered. A key lesson drawn from this case favors the opinion of Big Data evangelists – there is no such thing as too much data. And I will add- there is just such thing as untimely data.
At this place I am usually interrupted by a request to give a reference to the cost of data collection and storage. If data is untimely, it is necessary to discard it as a source of additional expenses. The argument seems to be rather strong. The project of any data processing system has restrictions for expenses. However, we should remember that in case of designing Big Data systems, we should not focus on the maximum size of a warehouse. In case of initial deployment, this volume can be rather small, but in the course of operation it will increase and exceed the primary volume by far. Therefore, the project should not be based on absolute values of data storage costs, but on their initial value and cost scaling coefficient – $/GB. Normally, this coefficient is minimum when cloud storages are used as compared to purchasing equipment for extension of the initial SAN (Storage Area Network). Only solutions based on distributed systems using commodity computers serve as competitive ones. Probably, therefore solutions that use cheap clusters and Apache Hadoop with the Hadoop Distributed File System (HDFS) have become so popular as a basis for building Big Data Systems.
Turning back to the storage costs of untimely data and finishing this blog, I would like to add the following: the feasibility study of data storage that at the beginning of the project seems unnecessary has to be made with the account to the forecast of their non-use period. The effect of using such data should be compared to the product of three values: a cost scaling coefficient, a volume of incoming unused data in a unit of time, and a period of their non-use. In many cases, it can turn out that the possible effect in many respects will exceed such expenses. So it`s not just the matter whether you are a Big Data geek or not.