Years ago, I naively thought that Google somehow had amazing machines and software that managed to do most everything in real-time even though the huge amounts of data they process pretty much preclude doing any such thing if I had bothered to think about it rationally. I imagined that they were processing each site they crawled as soon as they found it and into the search engine it went. Each news item from RSS was similarly fed straight into an index and made available immediately and no batch processing of reams of data was done.
Fortunately, such magical thinking has not persisted. Google does not use elves in a hollow tree to produce their results, they use intelligent engineers and many of the same tools available to you and me. They have developed all kinds of innovative solutions in order to be dealt with the huge amounts of data they have. Those solutions include:
- Building a truly enormous array of commodity PCs on which they run Linux to handle the computing needs for all of Google. When individual computers fail, their software simply shifts the workload to other functional machines. Supposedly, they buy large quantities of parts in bulk and make their purchases in a variety of ways to avoid being gouged by vendors.
- They created a distributed filesystem that spreads all files across hard drives on three separate machines in order to reduce the chance of failure causing loss of data.
- Built software that makes it easy to handle machine failures, distribute computing tasks across a large number of CPUs, etc.
The best thing about all of this is that they haven’t been particularly quiet about how they do a lot of it. For example, if you go to their Research Publications site you’ll see papers about The Google File System and Web Search for a Planet: The Google Cluster Architecture.
Now, I’m not going to snow you on this, if you aren’t of a technical bent, this stuff is going to be a hard boring slog. Michael Chabon it’s not. But, if data analysis of truly ginormous data sets interests you, then you want to read their paper on MapReduce: Simplified Data Processing on Large Clusters [PDF].
It’s all about how they split up many data analysis processing in such a way that it is easy to write the algorithm to process the data and not spend time worrying about hardware failures, how many machines you might be allocated to run your software, or how to optimally use those machines to get the data processed in the least amount of time. Instead, it forms a kind of support system that reminded me of using the genetic programming package JGAP. I’ll talk in a future entry about how JGAP can make it easy to find optimal or near optimal solutions for problems that would be tedious or impossible for humans. But the important thing it did was to make it easy for me to focus on the specifics of my problem and not on the mechanics of a framework. MapReduce is one of Google’s means to achieve that same kind of focus and I think it makes for a really interesting read.
The Java Nutch project includes a Java version of MapReduce and a distributed file system that you could use as part of your own huge data set processing so reading these articles isn’t just an academic exercise. You can actually put this to use if you have a project that needs it. Be sure to check out the wiki for the Nutch project for more helpful information.