The Apache Spark cluster compute engine is expected to become the standard computing paradigm for massive data processing. Hard to disagree.
On November 5, 2014 the team responsible for the development and management of the Spark ecosystem (Databricks) announced the results of an impressive benchmarking contest.
We are proud to announce that Spark won the 2014 Gray Sort Benchmark (Daytona 100TB category). A team from Databricks including Spark committers, Reynold Xin, Xiangrui Meng, and Matei Zaharia, entered the benchmark using Spark. Spark won a tie with the Themis team from UCSD, and jointly set a new world record in sorting.
Sorting of numbers is a foundational task in computation. Tim sort was the benchmark technique used in this case. Spark beat the legacy “Big Data” Hadoop MapReduce benchmark handily. 30X better.
Spark sorted the same data 3X faster using 10X fewer machines.
Their official entry can be found here.
A 100TB sort could be thought of as approximately taking the 80 odd trillion dollars in Global GDP as of 2014 in $10 notes with unique sequence numbers and then sorting them in order. The Spark team took less than 30 minutes to do this using 207 nodes(renting each node would have cost about $0.25/hour @ Elastic Map Reduce pricing). So this would have cost about $25 in total. That’s uniquely arranging every $10 produced in an entire year on this planet in under 30 minutes. If that doesn’t get you thinking about tracking consumption and production at a granular level not sure what will.
For a nice visual representation of why sorting 100TB may not be as easy as it sounds click on the lovely illustration by Mike Bostock below.