BigData FAQs

What is Big Data?

Big Data includes data sets whose size is beyond the ability of current software tools to capture, manage, and process the data in a reasonable time.

Who are some of the BIG DATA users?

From cloud companies like Amazon to health care companies to financial firms, it seems as if everyone is developing a strategy to use big data.  For example, every mobile phone user has a monthly bill which catalogs every call and every text; processing the sheer volume of that data can be challenging.  Software logs, remote sensing technologies, information-sensing mobile devices all pose a challenge in terms of the volumes of data created.  The size of Big Data can be relative to the size of the enterprise.  For some, it may be hundreds of gigabytes, for others, tens or hundreds of terabytes to cause consideration.

What is Hadoop?

The Apache Hadoop software library allows for the distributed processing of large data sets across clusters of computers using a simple programming model. The software library is designed to scale from single servers to thousands of machines; each server using local computation and storage. Instead of relying on hardware to deliver high-availability, the library itself handles failures at the application layer.  As a result, the impact of failures is minimized by delivering a highly-available service on top of a cluster of computers.

What exactly is Big Data Analytics?

It’s two things: big data and the kind of analytics users want to do with big data. Let’s start with big data, then come back to analytics.

Some user organizations have cached away hundreds of terabytes–just for analytics. The size of big data is relative; hundreds of TBs isn’t new, but hundred just for analytics is—at least, for most user organizations.

Big Data is all about multi-terabyte datasets, right?

No, there’s more to it than that. Size aside, there are other ways to define big data. In particular, big data tends to be diverse, and it’s the diversity that drives up the data volume. For example, analytic methods that are on the rise need to correlate data points drawn from many sources, both in the enterprise and outside it. Furthermore, one of the new things about analytics is that it’s NOT just based on structured data, but on unstructured data (like human language text) and semi-structured data (like XML files, RSS feeds), and data derived from audio and video. Again, the diversity of data types drives up data volume.

Finally, big data can be defined by its velocity or speed. This may also be defined by the frequency of data generation. For example, think of the stream of data coming off of any kind of sensor, say thermometers sensing temperature, microphones listening for movement in a secure area, or video cameras scanning for a specific face in a crowd. With sensor data flying at you relentlessly in real time, data volumes get big in a hurry. Even more challenging, the analytics that go with streaming data have to make sense of the data and possibly take action—all in real time.

Hence, big data is more than large datasets. It’s also about diverse data sources or data types (and these may be arriving at various speeds), plus the challenges of analyzing data in these demanding circumstances.

What kinds of analytics go with big data?

The kind of analytics applied to big data is often called “advanced analytics.” A better term would be “discovery analytics” because that’s what users are trying to accomplish. In other words, with big data analytics, the user is typically a business analyst who is trying to discover new business facts that no one in the enterprise knew before. To do that, you need large volumes of data that has a lot of details. And this is usually data that the enterprise has not tapped for analytics. For example, in the middle of the recent economic recession, companies were constantly being hit by new forms of customer churn. To discover the root cause of the newest form of churn, a business analyst grabs several terabytes of detailed data drawn from operational applications to get a view of recent customer behaviors. He may mix that data with historic data from a data warehouse. Dozens of queries later, he’s discovered a new churn behavior in a subset of the customer base. With any luck, he’ll turn that information into an analytic model, with which the company can track and predict the new form of churn.

What kind of analytic tool does a business analyst need for the “discovery analytics” that’s common with big data?

Discovery analytics against big data can be enabled by different types of analytic tools, including those based on SQL queries, data mining, statistical analysis, fact clustering, data visualization, natural language processing, text analytics, artificial intelligence, and so one. It’s quite an arsenal of tool types, and savvy users get to know their analytic requirements first before deciding which tool type is appropriate to their needs.