What is hadoop?
Hadoop
is an open-source software framework for storing and processing big
data in a distributed fashion on large clusters of commodity hardware.
Essentially, it accomplishes two tasks: massive data storage and faster
processing.
lets see some of the terms first!
Big data
Big data is a marketing term, not a technical term. Everything is big
data these days.This is a totally unspecific term that is largely
defined by what the marketing departments of various very optimistic
companies can sell - and the C*Os of major companies buy, in order to
make magic happen.
Data mining
Actually, data mining was just as overused... it could mean anything such as
- collecting data (think NSA)
- storing data
- machine learning / AI (which predates the term data mining)
- non-ML data mining (as in "knowledge discovery", where the term
data mining was actually coined; but where the focus is on new
knowledge, not on learning of existing knowledge)
- business rules and analytics
- visualization
- anything involving data you want to sell for truckloads of money
It's just that marketing needed a new term. "Business intelligence",
"business analytics", ... they still keep on selling the same stuff,
it's just rebranded as "big data" now.
Most "big" data mining isn't big
Now "Big data" is real. Google has Big data, and CERN also has big
data. Most others probably don't. Data starts being big, when you need
1000 computers just to store it.
What hadoop does?
Big
data technologies such as Hadoop are also real. They aren't always used
sensibly (don't bother to run hadoop clusters less than 100 nodes - as
this point you probably can get much better performance from well-chosen
non-clustered machines), but of course people write such software.
But most of what is being done isn't data mining. It's Extract,
Transform, Load (ETL), so it is replacing data warehousing. Instead of
using a database with structure, indexes and accelerated queries, the
data is just dumped into hadoop, and when you have figured out what to
do, you re-read all your data and extract the information you really
need, tranform it, and load it into your excel spreadsheet. Because
after selection, extraction and transformation, usually it's not "big"
anymore.
Data quality suffers with size
Many of the marketing
promises of big data will not hold. Twitter produces much less insights
for most companies than advertised (unless you are a teenie rockstar,
that is); and the Twitter user base is heavily biased. Correcting for
such a bias is hard, and needs highly experienced statisticians.
Bias from data is one problem - if you just collect some random data
from the internet or an appliction, it will usually be not
representative; in particular not of potential users. Instead, you will
be overfittig to the existing heavy-users if you don't manage to cancel
out these effects.
The other big problem is just noise. You have
spam bots, but also other tools (think Twitter "trending topics" that
cause reinforcement of "trends") that make the data much noiser than
other sources. Cleaning this data is hard, and not a matter of
technology but of statistical domain expertise. For example Google Flu
Trends was repeatedly found to be rather inaccurate. It worked in some
of the earlier years (maybe because of overfitting?) but is not anymore
of good quality.
Unfortunately, a lot of big data users pay too
little attention to this; which is probably one of the many reasons why
most big data projects seem to fail (the others being incompetent
management, inflated and unrealistic expectations, and lack of company
culture and skilled people).
Hadoop != data mining
Hadoop
doesn't do data mining. Hadoop manages data storage (via HDFS, a very
primitive kind of distributed database) and it schedules computation
tasks, allowing you to run the computation on the same machines that
store the data. It does not do any complex analysis.
There are
some tools that try to bring data mining to Hadoop. In particular,
Apache Mahout can be called the official Apache attempt to do data
mining on Hadoop. Except that it is mostly a machine learning tool
(machine learning != data mining; data mining sometimes uses methods
from machine learning). Some parts of Mahout (such as clustering) are
far from advanced. The problem is that Hadoop is good for linear
problems, but most data mining isn't linear. And non-linear algorithms
don't just scale up to large data; you need to carefully develop
linear-time approximations and live with losses in accuracy - losses
that must be smaller than what you would lose by simply working on
smaller data.
Sources-a compilation of answers given in Stackoverflow and from various sources!...
No comments :
Post a Comment