Experimenting With Basis In Hadoop Performance

I've been mentioning performance in Hadoop here previously, but I'd like to dedicate some more space to the subject since I am beginning to see that it is quite essential.

When doing experiments using Hadoop (or any cluster) as your system, there are some basic data that is needed in the process to be able to benchmark your specific system. In this post I am going to elaborate on the experiments that I have been doing on Hadoop, and my current testing scheme.

The way I see it there are two ways to measure your cluster job performance: The qualitative and quantitative approach. While the qualitative measure will consider descriptive, observed and other, kind of subjective, aspects - the quantitative methodology typically deals with numbers. While the qualitative method is certainly appreciated in Hadoop as well, quantitative measurements are often the basis of comments on a clusters performance.

Quantitative Aspects In Hadoop Performance

Cluster performance isn't really that different from measuring performance on a single node system. In my tests so far, I have been using four fundamental hardware dimensions to see how well my clusters perform: Memory, CPU I/O wait, processing and network. In Hadoop there are many other metrics that can be measured - but I've found these basic and understandable for both me and my peers.

I've find memory utilization to be important for performance in that it is possible to cache data for the processing from disk. In addition the running jobs will typically run in e.g. 2GB Java VM containers. This needs to be balanced with the total memory dedicated to MapReduce and the total number of processor cores (as mentioned in my tuning HDP Ambari post).

The CPU I/O wait is always important. In Hadoop I have the impression that this is a tiny bit less important than in single-node systems though. This rule, says that your WIO-rate shouldn't be above 1/# of cores. E.g. in a 12 core system, the %wio shouldn't be above 8% for instance. It seems to make sense in my experiments at least.

Especially when having a number of reducers (where shuffling occurs) the network becomes increasingly important to handle how Hadoop combines your records. In network packet parsing, you could be assembling TCP sessions - this will create a lot of switching records between nodes. With only maps I have been doing very well with a bandwidth of about 1Gbps, but 10-20Gbps can be an advantage. There are lots of discussions on whether to invest in 10-20Gbps infrastructure when it comes to Hadoop.

Also, balancing the four dimensions may be well as important to get the most out of your money. In my tests, the CPU utilization has been the bottleneck - and that is where I'd like it to hit. It's a question of what dimension you let control the performance.

In addition to the four hardware dimensions, I have been considering other quantifiable numbers and general runtime information as well:

  • General cluster throughput
  • Throughput per node
  • Throughput in packets per second (specialized for my case)
  • Block size
  • Run time
  • Which script was used
  • What UDF code commit was used during the experiment

Some of the above numbers are used for later being able to track down where the improvements have been made, and what changed the performance in a positive or negative way. Other numbers were used to calculate information that can be used to explain what the performance in the cluster are compared to more traditional systems. A good example is the throughput in packets per second. In this particular experiment I were using packet captures from a network. That again is often described in packets. It isn't necessarily a measure that is correct in terms of comparing single-node real time systems to Hadoop - but it will help explain the problems and how Hadoop works to the peers.

Another reason for using such descriptive information, as for instance the Git commit, is that it makes the experiment re-creatable for others building and verifying the work.

Here is an example of a performance measurement calculation:

# of maps2000
Block size5,12E+08 B
Current job size1,0E+12 B
Run time1000s
# of nodes10
Total cluster throughput1024MB/s
Throughput per node102MB/s

I use the above as a generic summary of each test. In addition I have been using metrics images from Ganglia that corresponds to the four hardware dimensions, and HDPSlaves category (makes sense if you use Ambari which comes with Ganglia preloaded).

Summarized I'd say that the above tests is not much more than what is required for a scientific foundation, and rationale, in proving if Hadoop works sufficiently for your usage. What I notice when starting to optimize a cluster is that I hit these specific road bumps that may be specific to my UDF code, or implementation. The metrics may because of that vary from cluster to cluster - and that also makes it hard to say definitely how the tests should be designed. It's a question of building the performance metrics over time, and comparing them with the evolving baseline.


Tommy is an analyst and incident handler with more than seven years of experience from the government and private industry. He holds an M.Sc. in Digital Forensics and a B.Tech. in information security