Hadoop: What is it?
Courtesy Tom White, Dr. Dobbs
The core map-reduce framework for big data consists of several interlocking technologies. This first installment of our tutorial explains what Hadoop does and how the pieces fit together.
Big Data is in the news these days, and Apache Hadoop is one of the most popular platforms for working with Big Data. Hadoop itself is undergoing tremendous growth as new features and components are added, and for this reason alone, it can be difficult to know how to start working with it. In this three-part series, I explain what Hadoop is and how to use it, presenting a simple, hands-on examples that you can try yourself. First, though, let’s look at the problem that Hadoop was designed to solve.
In 2002, Doug Cutting and Mike Cafarella started building a Web crawler and searcher, called Nutch, that they wanted to scale up to crawl and search the entire Web (several billion pages at the time). The difficult part, of course, was designing something that could cope with this amount of data; and it wasn’t until 2003, when Google published papers on its data processing infrastructure that solved the same problems, that "the route became clear," as Cutting put it.
The route was to build a distributed file system and a distributed processing engine that could scale to thousands of nodes. That project became Hadoop, named after the toy elephant of one of Cutting’s children. Hadoop had two principal parts: the Hadoop Distributed File System (HDFS) and MapReduce. The latter is the name Google engineers gave the distributed processing model and its implementation.
HDFS makes it easy to store large files. It optimizes for transfer speed over latency, which is what you want when you are writing files with billions of records (Web pages, in the original use case). You don’t need to alter the files once they are written, because the next crawl will rewrite the whole file anew.
MapReduce enables you to operate on the contents of a file in parallel by having an independent process read each chunk (or "block") of the file. With HDFS and MapReduce, the speed of analysis scales to the size of the cluster: A well-written job can run twice as fast on a cluster that is twice as big.
While the initial application of Web search is not one that many organizations need to solve, it turns out that HDFS and MapReduce are general enough to support a very broad range of applications. Like any computer system, they both have limitations — the main one being that MapReduce is batch-oriented; that is, operations take minutes or hours, not seconds, to complete. But even with these limitations, the number of problems the technologies can solve is still impressive (see MapReduce Design Patterns for a comprehensive catalog). And over time, we have seen the limitations being relaxed both incrementally (as with the performance improvements that are going into HDFS reads) and by major platform changes (as with the appearance of new processing frameworks running on data in HDFS that eschew MapReduce).
So how do you know if you need Hadoop? Simply put, Hadoop is useful when one machine is not big enough to process the data you need to process in a reasonable amount of time. Of course, what counts as reasonable depends on the task in hand, but if you find it’s taking 10 minutes, say, to grep a bunch of log files, or if generating your SQL reports is taking hours, then that is a sign that Hadoop is worth investigating.