Tuesday, January 8, 2008

MapReducing 20 petabytes per day

Googlers Jeff Dean and Sanjay Ghemawat have an article, " MapReduce: Simplified Data Processing on Large Clusters", in the January 2008 issue of Communications of the ACM.

It is a great introduction to MapReduce, but what I found most interesting was the numbers they cite on usage of MapReduce at Google.

Jeff and Sanjay report that, on average, 100k MapReduce jobs are executed every day, processing more than 20 petabytes of data per day.

More than 10k distinct MapReduce programs have been implemented. In the month of September 2007, 11,081 machine years of computation were used for 2.2M MapReduce jobs. On average, these jobs used 400 machines and completed their tasks in 395 seconds.

What is so remarkable about this is how casual it makes large scale data processing. Anyone at Google can write a MapReduce program that uses hundreds or thousands of machines from their cluster. Anyone at Google can process terabytes of data. And they can get their results back in about 10 minutes, so they can iterate on it and try something else if they didn't get what they wanted the first time.

It is an amazing tool for Google. Google's massive cluster and the tools built on top of it rightly have been called a " competitive advantage", the "secret source of Google's power", and a " major force multiplier".

By the way, there is another little tidbit in the paper about Google's machine configuration that might be of interest. They describe the machines as dual processor boxes with gigabit ethernet and 4-8G of memory. The question of how much memory Google has per box in its cluster has come up a few times, including in my previous posts, "Four petabytes in memory?" and " Power, performance, and Google".

No comments:


Blog Widget by LinkWithin