Currently preparing some internal benchmarking tests have been incredibly illuminating. It's at this point I have begun to see the shortcomings of my Hadoop configuration and how to patch it up to find the optimums in performance gain.
I noticed that some of the default values in Ambari was too low as well. I'll get back to that at the end of this post.
One of the most significant issues I've had so far is having 40 cores in the cluster available while doing some of the tests. It turned out that YARN was only allocating 23 simultaneous mappers. In theory that should be at least 40 mappers (if not having any reducers). As far as I've read you can't control the number of map/reduce slots in YARN (like you could in Hadoop 1). At the same time as only 23 mappers was initialised, I saw that all the resources were under-utilised. It was frustrating.
You might have heard many people say that you really need a lot of data to take advantage of Hadoop. Let me explain why and how it works. Hadoop works with blocks, much like the ones on your hard drive. Each block can be for instance 512MB (instead of 4kB as is normal in NTFS). In fact, if choosing a block size of 512MB, and having 10 nodes each having 8 cores, you will effectively have 80 cores available for processing your 512MB blocks. You can in theory process at least 40GB of data simultaneously then.
I noticed that there are three ways to get the interesting statistics:
- First of all, review you currently running job. First, give the cluster something to work on to see how it functions, e.g. 200GB of data. To get an overview of all jobs go to the URL specified by the node. You can also find an overview of the running jobs at http://
- Check out single datanodes to see how they work with Hadoop on a low level. Good file and tools I found were: /proc/meminfo, atop, top and iotop
- The Ambari and Ganglia monitoring dashboard. You can find Ganglia's detailed statistics at http://
As you notice in point two there, much of what you already know, such as top, can be used to track down bottlenecks. As a result of one of tips I got, I found this thread on IO-performance on Reddit. IO is often important - even though not that important in Hadoop.
Back to my initial problem, being under-utilisation of the cluster. I started speculating in what could cause Hadoop to not to use all its resources. Typically that will involve one of three: Network throughput, number of cores, IO and memory. In my search I came across this post on tuning HDP2 - YARN, at the Hortonworks blog. Being so fortunate that my cluster were having 48GB memory nodes as well I could double check my auto-configured values in Ambari directly with the post. Go figure, the
yarn.nodemanager.resource.memory-mb was only set to 5096MB. Increasing that to 40960MB spun up a number of tasks resulting in a more or less full-utilisation.
Edit 11/02/2014: One other thing that I needed to wrap my head around before I saw later that the article from Hortonworks actually mentioned it, was the
yarn.scheduler.minimum-allocation-mb. When I first started a job with the settings above, 200 jobs was spun up simultaneously. 200 jobs, equals to 40GB of memory and 1GB per job (which was the default in my Ambari configuration). That again results in that each node takes on the daunting task of processing 40 jobs. No secret that is going to be slow if running with 8 cores, each handling 5 maps at once. According to Hortonworks that number should be 1-2 containers (or jobs) per disk and core. So if having 12/12, you should set
yarn.scheduler.minimum-allocation-mb to a number around
yarn.scheduler.minimum-allocation-mb=memory/number of jobs => 40960MB/20=2048≈2GB
That should give you more like 100 simultaneous jobs if running with 5 data nodes.
The perfect scenario is when network links, memory, IO and processors are balanced and running at full utilisation. That will probably never happen though.
I also increased the HBase RegionServer's maximum Java heap size from 1024 to 1568MB. Since my Pig jobs were inserting a pretty high quantity of records it was running out of memory all the time. Speculating a bit now, it was probably not being able to insert the records fast enough.
Another thing I am looking forward to is Tez to go production stable for Pig, which should speed up things a lot. Follow that one here.
Continuing to profile performance in my cluster, I'd like to underscore that this is where you really see what you get for the money - and how it all pans out.
For later record (some are old but good):