Working with Hadoop: My first MapReduce App

Update: I removed the source code from the blog post. You can now find it on my github account.

Most Hadoop tutorials use the wordcount application as a demo application. And while this might be a good demo application, it is not particularly helpful. So i wanted to think of an idea for a more useful application to use on a cluster.

My first thought was trying to implement the famous Sieve of Eratosthenes. But it turned out that this might not be a good first application because i think there needs to be a shared state between map tasks. So its better left for another time.

So instead, i opted for the methods mentioned in Primality Test. And went with the first naive approach as an example. So i tested numbers by checking if any number preceeding them divides them.(again,no optimization)

This might seem like a simple application to do, since you can basically change a couple of things in the wordcount demo application. But you see, the first thing that i noticed while reading up on Hadoop is that most apps focus on the same file-based approaches. Which makes sense when you consider the project’s roots in indexing. And that is a limitation if you consider the possible projects that can and do make use of distributed computing such as medical research, cryptography work, and so on…

Well, while Hadoop does have built-in classes to deal easily with file based input, you can easily make it deal with any other kind of input you want. So while most apps don’t actually go deep enough to tell you how to do this, I will. Although i am still scratching the surface, i hope this would be useful to somebody out there.

So, if you just finished reading about the wordcount application, you’d think that you can the prime test by writing all the numbers that you want to check into a file and then just use that file as input to your MapReduce app. While this would work, why do the extra step of writing all the numbers to a file and then reading from that file again. We’re already have to generate the numbers, so just hand out those numbers to the Map tasks.

From this point on, I assume that you’ve already read the tutorial that comes with Hadoop, so I won’t go into details on anything basic.

In a nutshell, to make the mapping tasks acccept generated input, you have to define a new type of input format. You do that by develping a class that implements InputFormat. With that, you also have to implement a RecordReader and an InputSplit.

In the end, you will have the following files:

  • Contains the MapReduce procedures.
  • Defines the input format and provides the input splits for the map tasks and also defines the record reader.
  • Defines a split in the input data for each map task.
  • Responsible for returning <Key, Value> records from bye-based input splits.

This application will use a record <K,V> of <long, long> type with both the key and value being the prime number to test. And the intermediate record <K’, V’> from the map tasks will be of types <boolean, <Long….>> with the key being a true/false primality result. The result record <K”, V”> from the reduce tasks will be of type <boolean, Long> and is again the number keyed by primality result. Not really the best of use for records, but for out specific example, that’s all we need.

Starting with the main file, In, we define the map/reduce procedures.

The map procedure is a direct implementation of the prime test algorithm. And so is quite straightforward. So is our reduce procedure; just placing the values in the output collector.

You’ll notice that we define our own input format and that we do not need to define an input path since our input is generated.

Now we need to implement the input format class. And it would look like:
The only interesting part in this class is the getSplits() method which splits the input data to feed the map tasks. One thing to note here is that the “from “and “to” numbers are hardcoded while it would have been better to place them as properties and retrieve them via the JobConf. Also, the input generation is a naive one and has many assumptions.

Looking at the PrimeTestInputFormat class, you’ll also notice that we’re referncing a PrimeInputSplit class and a PrimeRecordReader. So we also need to implement those.

As mentioned earlier, the PrimeInputSplit class represents a byte view to a particular data split. In our case, this class will hold the starting and ending numbers for each split. So it could like this:

Next, we implement the record reader which translates the byte view input split into a record for the map task.

And that’s about it. Just jar up the application and deploy it on Hadoop.

There are a couple of interesting things to try here. First up would be to implement some optimizations to the prime testing function. Another thing would be to implement a better way to generate input splits to accommodate sparse data for example. Another interesting thing that I’d like to try sometime is do a multi-threaded application with the same functionality and then compare its performance to a small non-uniform cluster. Ofcourse the overhead of Hadoop on a single node cluster would mean that the local app would be faster so i am interested in observing how much this overhead affects performance so spreading it out over a couple of machines should balance things out. At least in theory…so we’ll see.

Comments, feedback are welcome.

12 replies on “Working with Hadoop: My first MapReduce App”

  1. Thanks a ton. I am a newbe to Hadoop MapReduce and was looking for an example for custom InputFormat.


  2. Hi,
    it did help. It was not easy to find an example for a custom format.
    I actually was looking for a simple LongWritable,LongWritable instead of the preset LongWritable, Text that the KeyValueTextInputFormat forces to use, so I got into hadoop code and copied/adapted the KeyValueTextInputFormat and KeyValueLineRecordReader.
    It was easy, but that allowed to make simpler the specification of formats. Othewise I had to convert from Text to LongWritable all the time in the hadoop code which is ugly and confusing for the reader (as it happened to me first! Didnt understand why Longs were specified as Text, so I though it was another thing!)

  3. THANK YOU !!!!
    I need to use Hadoop MapReduce for a school project and I’m glad I’ve finally found a normal example. The official documentation blabbers on and on for hundreds of pages and countless video hours about how MapReduce works and how endless is the choice of programming languages and scripts that can be used. Yet there isn’t a single piece of useful info other than that word counter which lacks too many features.

  4. Hi.. Nice article.. I have a few doubts, in, there is a int numSplits argument. what factor decides the value of the argument ?

  5. Thanks a lot 🙂 Although, I’m a little suspicious about:

    //add last split, with remainder if any // Is there an error?? start+remainder
    splits.add(new PrimeInputSplit(startingNumberInSplit,
    endingNumberInSplit + remainderInLastSplit));

    because endingNumerInSplit is already updated in the loop above it to be starting+NumInSplit … So that line should be:

    splits.add(new PrimeInputSplit(startingNumberInSplit,
    startingNumberInSplit + remainderInLastSplit));


    Anyways, thanks again, I’ve been looking for this for days !!

  6. another question …

    What happens if you write: return splits.toArray() ;
    instead of : return splits.toArray(new PrimeInputSplit[splits.size()]);

    in the getSplits of the PrimeTextInputSplit class ?

  7. hi ,
    I am using hadoop 0.20 version and java 1.6.
    I need to process xml via mapreduce program. But i am having issues.
    I have put JDOM 1.0 jar but still ClassNotFound Exception of JDOM Exception and SAX Builder coming pl help to fix the issue.

  8. is it possible to edit the above program to read from a file,instead of generating a numbers

Comments are closed.