GraphChi: How a Mac Mini outperformed a 1,636 node Hadoop cluster

Last year GraphChi, a spin-off of GraphLab, a distributed graph-based high performance computation framework, did something remarkable. GraphChi outperformed a 1,636 node Hadoop cluster processing a Twitter graph (dataset from 2010) with 1.5 billion edges – using a single Mac Mini. The task was triangle counting and the Hadoop cluster required over 7 hours while GraphChi on the Mac Mini did it in 1 hour! The answer lies in ruthless optimisation on the algorithm and towards the benefits of single machine versus the overhead of a generic cluster setup.

Context

The first advantage of GraphChi is that it can simplify a lot of assumptions and subsequent algorithms by not dealing with distributed processing. Armed with this knowledge and understanding of the general benefits and disadvantages of single machine performance the processing steps can be designed. A single computer has usually two characteristics, firstly a large graph problem does not fit into the fast RAM (Random Access Memory), and secondly it has a large disk, which is large enough to hold the data. The traditional disks are only performant on sequential and not on random reads. Modern computers may come with solid state disks for faster random read and write though they are still significantly slower than RAM. Consequently, any algorithm aiming to solve graph problems on single machine commodity hardware has to utilise the disk and minimise the random access to data.

Divide and Conquer

Aapo Kyrola, a PhD candidate at Carnegie Melon University, took this knowledge to improve GraphLab, a distributed graph computation framework. His idea was to divide the Graph into disjoint shards, each of which fits into the machine’s RAM. The shards then can be processed in parallel in memory. Updates that have to be written to other shards are subsequently done in a sequential update. This minimises random operations on disk and exploits the machine’s RAM and parallel processing abilities.

Parallel Sliding Window processing of split graph

Parallel Sliding Window processing of split graph

Aapo invented the PSW (Parallel Sliding Window) algorithm to achieve the key performance improvement, the (nearly exclusive) sequential disk behaviour. The PSW sorts all vertices in a shard by source shards. This means that each shard in itself is split into blocks of vertices relevant to the remaining shards.

PSW toy example

PSW toy example

For example, at interval 1 (see figures above) shard 1 is processed in memory. It contains a subset of the graph vertices’ destination values. These destinations are a continuous block of the sorted source values in the remaining shards and can be read sequentially. All updates are computed and stored in place in memory for shard 1 and sequentially written back to the remaining shards updating the blocks previously read. At the end the in memory updated version of shard 1 is written sequentially to disk. At interval 2 shard 2 is loaded and the same process applied to the remaining shards again.

This approach utilises the architecture of modern commodity computers very well as some performance tests in the original paper illustrate. For example, striping data on several disks, using SSD instead of rotational disks all improve the performance but hardly ever more than two fold because the algorithm optimised away the need for high permanent storage performance. Even the increase of the number of shards has little influence on GraphChi’s throughput promising reliant performance with even larger graphs. Impressively, another sign for the efficiency of the algorithm is that moving the computation from disk to complete in memory does only improve computation time by a factor of 1.1 to 2.5 over SSD.

Comparison

GraphChi demonstrates paradigm shifting performance gains compared to alternative generic solutions like Hadoop, Spark, or even the for graph computing highly optimised GraphLab and PowerGraph. The latter is an optimised parallel distributed approach. It can solve the Twitter triangle count problem in 1.5 minutes. However, it employs 64 nodes with each 8 cores for a total of 512 cores. Roughly a performance improvement of factor 40 thus required 256 times the compute power (in cores). While there are various ways of comparing these two very different approaches and architectural requirements the take away is that a one magnitude performance improvement required a two magnitude increase in computing resources. As mentioned in the beginning of the article, Hadoop as a generic framework performs very poorly on this task.

GraphChi performance comparison

GraphChi performance comparison

Future of GraphChi

GraphLab Inc, a spin-off of this research, received $6.75 million in venture funding to develop products from the GraphLab algorithms. I had a chance of trying out their beta program focused on a cloud and web based service solution on large scale machine learning problems and I was impressed (watch this space for a future article on this). In the meantime, you can download and compile GraphChi and try it out.

Summary
Article Name
GraphChi: How a Mac Mini outperformed a 1,636 node Hadoop cluster
Description
Big Data processing can come in small packages. GraphChi (GraphLab) on a Mac Mini outperformed a huge Hadoop cluster on graph computing tasks.
Author

Leave a Reply