Thursday, February 19, 2015

Big Unstructured Data v/s Structured Relational Data

What is structured data?



Structured data refers to data that is organized and manageable. Moreover, its inclusion in relational database or well structured file format such as XML is seamless due to its organized structure. It is searchable by simple straight-forward search algorithms and are mostly clean and analytical. Structured data accounts for just 20% of all the data available.

What is unstructured data?



On the contrary, unstructured data is raw and unorganized and raw. It is data which does not have a predefined data model. It is any electronic information that is not stored in the tables, rows and columns of databases and enterprise applications. Unstructured data is everywhere. It is human-generated data such as email messages, social media posts, instant messages and other communications, documents, images, audio and video. Consider an example of an e-mail. It is indexed by date, time, sender, recipient and subject, but the most important part i.e. the body remains unstructured. Out of all the data available, 80% of data is unstructured. Until recently, no technology was available to do much with it except storing it or analyzing it manually. But with the advent of Big Data tools, things look promising.
     

How are they different?

The following table briefly describes the difference between structured and unstructured data



 Types of data in a data warehouse





Data warehouse is a large store of data accumulated from a wide range of sources within a company and used to guide management decision. Unlike OLTP database, a Data warehouse typically contains aggregated historical data and is optimized for particular types of analyses, depending upon the client applications i.e. it is used for business analysis (OLAP). Traditionally data warehouse store the following three types of data:

Historical data
A data warehouse typically contains several years of historical data. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. The amount of data to be stored depends on the available infrastructure, cost and type of analysis required to be done.

Derived Data
Derived data is generated from existing data using a mathematical operation or a data transformation. This type of data is usually used to increase performance of queries. More often such data is also used for data maintenance operations.

Metadata
In simple words, metadata is data about data. Information such as the format, last operation, when and by whom it was collected, timestamp etc. of a data can be categorized as metadata.

Limitations of data warehouse


Analysis of unstructured data: As mentioned earlier, almost 80% of the data available today is unstructured. Analyzing such types of data within the world of a data warehouse is not only expensive and time consuming but also complex. Moreover unstructured data cannot be analyzed directly. Structured information has to be extracted from the unstructured source. And only from this extract of structured data can analysis be performed. A good example would be that of matching of fingerprints. Fingerprint scan is basically unstructured data. Whenever two fingerprints need to be matched, matching of map or polygon that is created from the prints is actually matched and not the fingerprints in its raw form. Another popular example is that of sentiment analysis. Social media comments are basically parsed into words and these words are flagged as good or bad. Finally, based on the number of good or bad words, the sentiment of the comment or tweet is determined.

Complex: Data warehousing is complex to implement and needs multiple tools, and sometimes extraction, transformation and loading process may take significant time and effort which can decrease the value of produced results.

Data security: It is another sensitive issue with respect to data warehouse and needs to be taken care of the most.

Data Consistency: There is a good chance that data may not be consistent with its original source especially in cases where data is updated periodically.

Cost: Huge infrastructure in terms of storage capacity is required to build and house a data warehouse. Moreover, it is even more expensive in maintaining these data warehouses.

Role of data warehouse in the future


Data warehouses will evolve into analytics warehouse. A distributed file system (like Hadoop) will sit between source data systems and the data warehouse. It will process unstructured data and load it into the data warehouse. Operational data warehouses will become the norm as they will have the ability to combine data from multiple sources seamlessly and go beyond dashboard and reports to be able to use data for daily operations. Processing data and analytics in the cloud will become a necessity that will make possible simple, convenient and cost effective ways to efficiently manage data. Evolution in data compression will help reduce the cost associated with raising an infrastructure to house a data warehouse. In-memory technologies will be a standard associated with data warehouses to process large datasets in system memory leading to rapid increase in performance.


References:




Tuesday, February 3, 2015

Business Intelligence & Analysis Products Scan & Evaluation

Business Intelligence (BI) tools come in all shapes and sizes. Some emphasize one feature over the other. Some are very expensive and some are affordable. With all of these options, how do you separate the good from the bad? Here are 5 known BI tools that are currently used in the market. I have tried finding the best BI solution by doing a weighted score analysis of the criteria against each BI tool and scored them accordingly.

Firstly which are the 5 tools compared?


Yellowfin

Yellowfin BI offers easy to use BI tools that make it simple to assess, monitor and understand any bit of data related to a business. It provides a fine balance between ease of use and governance needs of enterprise IT.

What’s unique?
Heavy focus on mobile and collaboration functionality along with location intelligence

Tableau

Marketed as the future of enterprise business intelligence, Tableau enables businesses to share, collaborate and make data driven decisions on variety of platforms. It has a secure environment with excellent data governance and many enterprise ready features.

What’s unique?
Easy to learn with swift data access. Create dashboards and visualization within few clicks.

IBM Cognos

With its wide product portfolio for the individual, workgroup, department, midsize business and large enterprise, Cognos software is designed to help everyone in an organization make the decisions that achieve better business outcomes. Its unique software can provide even small business with the same level of analytical insights as corporate giants.
What’s unique?
Usable for any size company.

Spotfire

Tibco Spotfire offers its users a versatile, feature-rich business intelligence platform that can deliver fast answers to important business questions without having to rely on IT.

What’s unique?
With its unique data connectivity, Spotfire Desktop lets you quickly mashup multiple datasets into a single, visual experience.

Business Objects

SAP’s Business Object BI solution gives all users information that drive smarter processes, regardless of job function, to customize, and analyze BI data with little to no involvement from their own IT department.
What’s unique?
Self serve access with very little dependency on IT department.


Criteria Used


To evaluate the above 5 vendors, certain criteria are used. These criteria are weighed according to their importance. The following are the criteria:

Data Source

It is very important for a tool to connect to a wide set of data sources. Data sources include relational, multi-dimensional data sources, spreadsheets, text files etc.
This feature carries a weight of 20%

      Platform

This criteria carries a weight of 20%. This criteria describes the platform on which the product is available. Product can be available on any/all of the following platforms:

  • Mobile
  • Cloud
  • On-premise

      

      Reporting features

This is an important criteria and carries a weight of 25%
The main purpose of a business intelligence (BI) tool is to enable decision makers make informed decision. Reporting tools help to gather that information. Reports can have various features such as:

  • Ad Hoc Reporting
  • Automatic Scheduled Reporting
  • Customizable Dashboard
  • Customizable Features
  • Ranking Reports
  • Financial Forecast/Budget
  • Graphic Benchmark Tools
  • Performance Measurement


Analytical features

BI tools can have analytical features such as   Pivot table/OLAP, What if analysis, Ad-Hoc Analysis,  Predictive Analysis, Trend Indicators etc.This is an important criteria and carries a weight of 25%.

Support

Now a days, it is very essential to have a dedicated support team from the vendor to work on production issues, product management and user training post go live. The weight for the feature is 5%.The following are some of the ways by which a vendor can support their BI tool

  • Email
  • In person training
  • Chat
  • Online Support

Free Trial

It is nice to have a ‘try it out’ feature, before licensing/buying the product to test how well the tool helps the business decision makers. Some vendors have free trial and some do not. The weight for this feature is 5%


How did the 5 tools fare?

It is time to analyze the 5 BI tools against the above criteria and crown the BI tool with the most weighted score as the winner. Each tool can receive a maximum points of 10 and a minimum points of 6. The category wise winner scores a full 10 and based on the winner a comparative analysis is done for the other 4 tools and graded accordingly. The tool which fares the worst when compared to other tools receives 6 points.

The following section will give a criteria wise winner

Data Sources  > Tableau
Tableau stands out as the clear winner. It not only supports data sources from Microsoft, IBM, HP, Teradata, SAP, Oracle, Salesforce etc. which are supported by other vendors. It also supports a host of big data sources and data sources from Google and Amazon such as Google BigQuery, Google Analytics and Amazon RedShift and Amazon Elastic MapReduce.  Spotfire and SAP BO are tied at rank 2. While Cognos and Yellowfin score comparatively less.

Platform  > Spotfire
Tibco Spotfire beats the other vendors when it comes down to the number of platforms the tool is available on. Tibco is available online, on premise and on mobile. Yellowfin and Tableau come in at rank 2 since they support online and mobile platforms. Cognos edges past SAP BO as it also available on premise apart from being available online when compared to SAP BO.

Reporting features > Cognos
IBM Cognos stands out as the winner with a full 10 points. Some of the reporting features provided by Cognos are:

  • Ad Hoc Reporting
  • Automatic Scheduled Reporting
  • Customizable Dashboard
  • Customizable Features
  • Dashboard
  • Financial Forecast/Budget
  • Graphic Benchmark Tools
  • Performance Measurements

Analytics features >  Tableau
Tableau marginally pips SAP BO and stands out as the winner. Ad Hoc analysis being the differentiator between them. Some of the analytic features provided by Tableau are Ad Hoc Analysis,     OLAP, Predictive Analysis,  Trend Indicators etc.

Support >  Tableau
In terms of support, it is very essential to have support on online chat, tutorial and in-person training and no points for guessing that the vendor which has the above 3 support features receives extra points. Tableau once again pips its competition as it has support for email, phone, in-person training, online chat, Tutorials. Spotfire fairs poorly as it has only email and phone support. Cognos comes in at No. 2 since it has an additional support of in person training when compared to SAP BO.

Free Trial > Tableau, Spotfire, Cognos
Yellowfin and SAP BO does not provide a free trial and hence have received 6 points. All the others score 10 points since they all provide free Trial.


Weighted Score Analysis


Features
Weight
Tableau
Spotfire
Yellowfin
Cognos
SAP BO
Data Sources
20%
10
8
6
7
8
Platform
20%
8
10
8
7
6
Reporting Features
25%
9
6
7
10
8
Analytics Features
25%
10
7
6
8
9
Support
5%
10
6
7
9
8
Free Trial
5%
10
10
6
10
6
Points
9.35
7.65
6.7
8.25
7.75
Rank
1
4
5
2
3


Rank


  1. Tableau
  2. IBM Cognos
  3. SAP BO
  4. Tibco Spotfire
  5. Yellowfin



Winner