Working with Hadoop: My first MapReduce App

Update: I removed the source code from the blog post. You can now find it on my github account.

Most Hadoop tutorials use the wordcount application as a demo application. And while this might be a good demo application, it is not particularly helpful. So i wanted to think of an idea for a more useful application to use on a cluster.

My first thought was trying to implement the famous Sieve of Eratosthenes. But it turned out that this might not be a good first application because i think there needs to be a shared state between map tasks. So its better left for another time.

So instead, i opted for the methods mentioned in Primality Test. And went with the first naive approach as an example. So i tested numbers by checking if any number preceeding them divides them.(again,no optimization)

This might seem like a simple application to do, since you can basically change a couple of things in the wordcount demo application. But you see, the first thing that i noticed while reading up on Hadoop is that most apps focus on the same file-based approaches. Which makes sense when you consider the project’s roots in indexing. And that is a limitation if you consider the possible projects that can and do make use of distributed computing such as medical research, cryptography work, and so on…

Well, while Hadoop does have built-in classes to deal easily with file based input, you can easily make it deal with any other kind of input you want. So while most apps don’t actually go deep enough to tell you how to do this, I will. Although i am still scratching the surface, i hope this would be useful to somebody out there.

So, if you just finished reading about the wordcount application, you’d think that you can the prime test by writing all the numbers that you want to check into a file and then just use that file as input to your MapReduce app. While this would work, why do the extra step of writing all the numbers to a file and then reading from that file again. We’re already have to generate the numbers, so just hand out those numbers to the Map tasks.

From this point on, I assume that you’ve already read the tutorial that comes with Hadoop, so I won’t go into details on anything basic.

In a nutshell, to make the mapping tasks acccept generated input, you have to define a new type of input format. You do that by develping a class that implements InputFormat. With that, you also have to implement a RecordReader and an InputSplit.

In the end, you will have the following files:

PrimeTest.java: Contains the MapReduce procedures.
PrimeTestInputFormat.java: Defines the input format and provides the input splits for the map tasks and also defines the record reader.
PrimeInputSplit.java: Defines a split in the input data for each map task.
PrimeRecordReader.java: Responsible for returning <Key, Value> records from bye-based input splits.

This application will use a record <K,V> of <long, long> type with both the key and value being the prime number to test. And the intermediate record <K', V'> from the map tasks will be of types <boolean, <Long….» with the key being a true/false primality result. The result record <K'', V''> from the reduce tasks will be of type <boolean, Long> and is again the number keyed by primality result. Not really the best of use for records, but for out specific example, that’s all we need.

Starting with the main file, In PrimeTest.java, we define the map/reduce procedures.

The map procedure is a direct implementation of the prime test algorithm. And so is quite straightforward. So is our reduce procedure; just placing the values in the output collector.

You’ll notice that we define our own input format and that we do not need to define an input path since our input is generated.

Now we need to implement the input format class. And it would look like: The only interesting part in this class is the getSplits() method which splits the input data to feed the map tasks. One thing to note here is that the “from “and “to” numbers are hardcoded while it would have been better to place them as properties and retrieve them via the JobConf. Also, the input generation is a naive one and has many assumptions.

Looking at the PrimeTestInputFormat class, you’ll also notice that we’re referncing a PrimeInputSplit class and a PrimeRecordReader. So we also need to implement those.

As mentioned earlier, the PrimeInputSplit class represents a byte view to a particular data split. In our case, this class will hold the starting and ending numbers for each split. So it could like this:

Next, we implement the record reader which translates the byte view input split into a record for the map task.

And that’s about it. Just jar up the application and deploy it on Hadoop.

There are a couple of interesting things to try here. First up would be to implement some optimizations to the prime testing function. Another thing would be to implement a better way to generate input splits to accommodate sparse data for example. Another interesting thing that I’d like to try sometime is do a multi-threaded application with the same functionality and then compare its performance to a small non-uniform cluster. Ofcourse the overhead of Hadoop on a single node cluster would mean that the local app would be faster so i am interested in observing how much this overhead affects performance so spreading it out over a couple of machines should balance things out. At least in theory…so we’ll see.

Comments, feedback are welcome.