TalkBI: Big Unstructured Data v/s Structured Relational Data

What is structured data?

Structured data refers to data that is organized and manageable. Moreover, its inclusion in relational database or well structured file format such as XML is seamless due to its organized structure. It is searchable by simple straight-forward search algorithms and are mostly clean and analytical. Structured data accounts for just 20% of all the data available.

What is unstructured data?

On the contrary, unstructured data is raw and unorganized and raw. It is data which does not have a predefined data model. It is any electronic information that is not stored in the tables, rows and columns of databases and enterprise applications. Unstructured data is everywhere. It is human-generated data such as email messages, social media posts, instant messages and other communications, documents, images, audio and video. Consider an example of an e-mail. It is indexed by date, time, sender, recipient and subject, but the most important part i.e. the body remains unstructured. Out of all the data available, 80% of data is unstructured. Until recently, no technology was available to do much with it except storing it or analyzing it manually. But with the advent of Big Data tools, things look promising.

How are they different?

The following table briefly describes the difference between structured and unstructured data

Types of data in a data warehouse

Data warehouse is a large store of data accumulated from a wide range of sources within a company and used to guide management decision. Unlike OLTP database, a Data warehouse typically contains aggregated historical data and is optimized for particular types of analyses, depending upon the client applications i.e. it is used for business analysis (OLAP). Traditionally data warehouse store the following three types of data:

Historical data

A data warehouse typically contains several years of historical data. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. The amount of data to be stored depends on the available infrastructure, cost and type of analysis required to be done.

Derived Data

Derived data is generated from existing data using a mathematical operation or a data transformation. This type of data is usually used to increase performance of queries. More often such data is also used for data maintenance operations.

Metadata

In simple words, metadata is data about data. Information such as the format, last operation, when and by whom it was collected, timestamp etc. of a data can be categorized as metadata.

Limitations of data warehouse

Analysis of unstructured data: As mentioned earlier, almost 80% of the data available today is unstructured. Analyzing such types of data within the world of a data warehouse is not only expensive and time consuming but also complex. Moreover unstructured data cannot be analyzed directly. Structured information has to be extracted from the unstructured source. And only from this extract of structured data can analysis be performed. A good example would be that of matching of fingerprints. Fingerprint scan is basically unstructured data. Whenever two fingerprints need to be matched, matching of map or polygon that is created from the prints is actually matched and not the fingerprints in its raw form. Another popular example is that of sentiment analysis. Social media comments are basically parsed into words and these words are flagged as good or bad. Finally, based on the number of good or bad words, the sentiment of the comment or tweet is determined.

Complex: Data warehousing is complex to implement and needs multiple tools, and sometimes extraction, transformation and loading process may take significant time and effort which can decrease the value of produced results.

Data security: It is another sensitive issue with respect to data warehouse and needs to be taken care of the most.

Data Consistency: There is a good chance that data may not be consistent with its original source especially in cases where data is updated periodically.

Cost: Huge infrastructure in terms of storage capacity is required to build and house a data warehouse. Moreover, it is even more expensive in maintaining these data warehouses.

Role of data warehouse in the future

Data warehouses will evolve into analytics warehouse. A distributed file system (like Hadoop) will sit between source data systems and the data warehouse. It will process unstructured data and load it into the data warehouse. Operational data warehouses will become the norm as they will have the ability to combine data from multiple sources seamlessly and go beyond dashboard and reports to be able to use data for daily operations. Processing data and analytics in the cloud will become a necessity that will make possible simple, convenient and cost effective ways to efficiently manage data. Evolution in data compression will help reduce the cost associated with raising an infrastructure to house a data warehouse. In-memory technologies will be a standard associated with data warehouses to process large datasets in system memory leading to rapid increase in performance.

References:

http://deloitte.wsj.com/cio/2013/07/17/the-future-of-data-warehouses-in-the-age-of-big-data/

http://www.bisoftwareinsight.com/future-of-data-warehousing/

http://docs.oracle.com/cd/B10501_01/olap.920/a95295/designd4.htm

http://www.itinfo.am/eng/business-intelligence/

http://iianalytics.com/research/why-nobody-is-actually-analyzing-unstructured-data