Hadoop and Big Data

What is hadoop?

Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing.

lets see some of the terms first!

Big data

Big data is a marketing term, not a technical term. Everything is big data these days.This is a totally unspecific term that is largely defined by what the marketing departments of various very optimistic companies can sell - and the C*Os of major companies buy, in order to make magic happen.

Data mining

Actually, data mining was just as overused... it could mean anything such as

collecting data (think NSA)
storing data
machine learning / AI (which predates the term data mining)
non-ML data mining (as in "knowledge discovery", where the term data mining was actually coined; but where the focus is on new knowledge, not on learning of existing knowledge)
business rules and analytics
visualization
anything involving data you want to sell for truckloads of money

It's just that marketing needed a new term. "Business intelligence", "business analytics", ... they still keep on selling the same stuff, it's just rebranded as "big data" now.
Most "big" data mining isn't big

Now "Big data" is real. Google has Big data, and CERN also has big data. Most others probably don't. Data starts being big, when you need 1000 computers just to store it.

What hadoop does?

Big data technologies such as Hadoop are also real. They aren't always used sensibly (don't bother to run hadoop clusters less than 100 nodes - as this point you probably can get much better performance from well-chosen non-clustered machines), but of course people write such software.

But most of what is being done isn't data mining. It's Extract, Transform, Load (ETL), so it is replacing data warehousing. Instead of using a database with structure, indexes and accelerated queries, the data is just dumped into hadoop, and when you have figured out what to do, you re-read all your data and extract the information you really need, tranform it, and load it into your excel spreadsheet. Because after selection, extraction and transformation, usually it's not "big" anymore.
Data quality suffers with size

Many of the marketing promises of big data will not hold. Twitter produces much less insights for most companies than advertised (unless you are a teenie rockstar, that is); and the Twitter user base is heavily biased. Correcting for such a bias is hard, and needs highly experienced statisticians.

Bias from data is one problem - if you just collect some random data from the internet or an appliction, it will usually be not representative; in particular not of potential users. Instead, you will be overfittig to the existing heavy-users if you don't manage to cancel out these effects.

The other big problem is just noise. You have spam bots, but also other tools (think Twitter "trending topics" that cause reinforcement of "trends") that make the data much noiser than other sources. Cleaning this data is hard, and not a matter of technology but of statistical domain expertise. For example Google Flu Trends was repeatedly found to be rather inaccurate. It worked in some of the earlier years (maybe because of overfitting?) but is not anymore of good quality.

Unfortunately, a lot of big data users pay too little attention to this; which is probably one of the many reasons why most big data projects seem to fail (the others being incompetent management, inflated and unrealistic expectations, and lack of company culture and skilled people).

Hadoop != data mining

Hadoop doesn't do data mining. Hadoop manages data storage (via HDFS, a very primitive kind of distributed database) and it schedules computation tasks, allowing you to run the computation on the same machines that store the data. It does not do any complex analysis.

There are some tools that try to bring data mining to Hadoop. In particular, Apache Mahout can be called the official Apache attempt to do data mining on Hadoop. Except that it is mostly a machine learning tool (machine learning != data mining; data mining sometimes uses methods from machine learning). Some parts of Mahout (such as clustering) are far from advanced. The problem is that Hadoop is good for linear problems, but most data mining isn't linear. And non-linear algorithms don't just scale up to large data; you need to carefully develop linear-time approximations and live with losses in accuracy - losses that must be smaller than what you would lose by simply working on smaller data.

Sources-a compilation of answers given in Stackoverflow and from various sources!...

http://hadoop.apache.org/
http://www.sas.com/en_us/insights/big-data/hadoop.html
http://www.informationweek.com/…/how-to-expla…/d/d-id/899721
http://www.cloudera.com/…/cloudera/en/about/hadoop-and-big-…

Project site........!

Welcome to Apache™ Hadoop®!

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather t…

hadoop.apache.org

| Welcome to my blog!

Meet the Author

Subscribe to this blog!

Monday, December 8, 2014