Let’s benchmark our Hadoop clusters (join in!)

Introduction

For our evaluations within SCAPE it would be useful to have the ability to quantitatively measure the abilities of the Hadoop clusters available to us, to allow results from each cluster to be compared.

Fortunately as part of the standard Hadoop distribution there are some examples included that can be run as tests. Intel has produced a benchmarking suite – HiBench – that uses those included Hadoop examples to produce a set of results.

There are various aspects of performance that can be assessed. The main ones being:

CPU loaded workflows (e.g. file format migration) where the workflow speed is limited by the CPU processing available
I/O loaded workflows (e.g. identification/characterisation) where the workflow speed is limited by the I/O bandwidth available

For the testing of our cluster I used HiBench 2.2.1. I made some notes about getting it to run that should be useful (see below). Apart from the one change described below in the notes, there was no need to edit or change the code.

In SCAPE testbeds we are running various workflows on various clusters. However, individual workflows tend to be run on only one cluster. Running a standard benchmark on each Hadoop installation may allow us to better compare and extrapolate results from the different testbed workflows.

Notes – These are only required to be done on the node where HiBench is run from.

JAVA_HOME is needed by some tests – I set this using “export JAVA_HOME=/usr/lib/jvm/j2sdk1.6-oracle/”.
For the kmeans test I changed the HADOOP_CLASSPATH line in “kmeans/bin/prepare.sh” to “export HADOOP_CLASSPATH=`mahout classpath | tail -1`” as it was unable to run without that change; mahout already being in the path.
The nutchindexing and bayes tests required a dictionary to be installed on the node that HiBench was started from – I installed the “wbritish-insane” package.

Caveats

Some tests use less map/reduce slots than are available and therefore are not that useful for comparison as we want to max out the cluster. For example, the kmeans tests only used 5 map slots.

Results

I have created a page on the SCAPE wiki where I have put the results from our cluster: “Benchmarking Hadoop installations”. I invite and encourage you to run the same tests above and add them to the wiki page. Running the tests was much quicker than I thought it might be – it took less than a morning to setup and execute.

To get a better understanding of which benchmarks are more/less appropriate I propose we first get some metrics from all the HiBench tests across different clusters. In future we may choose to refine or change the tests to be run but this is just a start of a process to better understand how our Hadoop clusters perform. It’s only through you participating that we will get useful results, so please join in!

Let’s benchmark our Hadoop clusters (join in!)

Leave a Reply

You might also like…

Droid file format identification using Hadoop

SCAPE Demo Day at Statsbiblioteket

Big data processing: chaining Hadoop jobs using Taverna

Join the conversation

Member-only content

or

or

or

or

Download

or