Monday, September 16, 2013

How to write MapReduce program in JAVA

In this blog i am going to show you how to write MapReduce job and how we can apply 3 different approaches to get same output.

For the sake of running the example, i didn't had data to run MapReduce job on, so i wrote a program to create the sample data. Benefit of this approach is, that i will be able to verify the output after running the MapReduce job.

What my program does?
In my program i have taken following 9 different strings 
                    "this is first string",
                    "got it.......................... yeah",
                    "sorry...........................not found",
                    "find me if you can......................",
                    "still not found......................",
                    "okay................try moreeeeee",
                    "work harder...........................",
                    "done..... not found",
                    "try and get it.................."

My program is writing random combinations of these string with Timestamp and name of the class until total log size reaches to number of bytes you provide as input.

I am counting 2nd String ("got it.......................... yeah") and Last string ("try and get it..................") while writing combination of strings in log files.

To create sample data, open command prompt on your Linux machine and follow below steps:

1. hit command : "wget https://www.dropbox.com/s/ifm9xcw6rh855ps/ProduceData.jar"




Now you should have "ProduceData.jar" in the current directory. Change permission of the jar to make it executable if it is already not. (Command  : chmod +x ProduceData.jar )

Note: alternatively you can go to this url and download jar manually.

2. Now to produce data hit below command
./ProduceData.jar <directory in which you want to create data> <size of each file in bytes> <number of files to be created>

for example: 
./ProduceData.jar /home/deepak/Desktop/test 10000000 4
























At the end of execution of command you will see a screen like this.


























Please look at the screen carefully, as the last 2 lines of the output will be the output after running MapReduce job on the data.

In my example it is : 

Total number of hits: 89044 i.e; total count of string "got it.......................... yeah" in all log files
Total number of almost hits: 88690 i.e; total count of string "try and get it.................." in all log files

So our job is to find total count of string "got it.......................... yeah" in all log files, which is 89044 in my case.

You can also download the source code of my program from GitHub.

Now, we have data and we need to write MapReduce jobs to find count of above said string. 

If you don't have development environment, you can set it up by following my another blog post. Once your development environment is up, we are ready to write the MapReduce job.  

For solving this particular problem i have thought of 3 different approaches for the sake of showing capability of Hadoop. I will explain each approach below.

1st approach: 
This is very simple approach where each line of all the logs is Input Split for Mappers. Map will read each line and check if it matches with required string. This approach will take more time as you can see number of input splits is very large and size of input splits is very small. For those who don't understand Input Split please see the explanation below.

Hadoop relies on the input format of the job to do three things:1. Validate the input configuration for the job (i.e., checking that the data is there).2. Split the input blocks and files into logical chunks of type InputSplit, each of which is assigned to a map task for processing.3. Create the RecordReader implementation to be used to create key/value pairs from the raw InputSplit. These pairs are sent one by one to their mapper.


A RecordReader uses the data within the boundaries created by the input split to generate key/value pairs. In the context of file-based input, the “start” is the byte position in the file where the RecordReader should start generating key/value pairs. The “end” is where it should stop reading records. These are not hard boundaries as far as the API is concerned—there is nothing stopping a developer from reading the entire file for each map task. While reading the entire file is not advised, reading outside of the boundaries it often necessary to ensure that a complete record is generated

2nd approach:
If you look closely you will see that each block of write operation is with Timestamp, which is following some pattern. We can define our InputSplit in such a way that each split starts with date format. In this case we will need to define custom RecordReader which will be Pattern delimited.


3rd approach:
In this approach i will use fixed delimiter. If you see the logs, each block in log file is containing string "com.deepak.utils.ProduceData main". So we can use this string as record delimiter so that this string will act as starting point of our input split.

There can be some more approaches also, but i explained only 3 approaches here, because these 3 examples will give you a fair idea about how we can write MapReduce jobs.

I will explain 1st approach here and 2nd and 3rd approach will be on a separate blog post.

1st approach explained with solution:

Please follow my another blog post to create MapReduce project on Eclipse IDE and Mapper and Reducer classes.

Your Mapper class will look like below:


















Your Reducer class will look like below:


















Your Test class will look like below:



Once you create the above classes and setup hadoop environment as suggested, you will be ready to run your MapReduce Job.

To run the job Click on Run -> Run Configurations...
in argument panel put your argument like this: 































Now Click on Run.

If everything is fine your MapReduce program will be executed and produce below output on eclipse console. 





























If you see the last line of the output on console which says : 
"13/09/16 16:40:16 INFO mapred.JobClient:     Map output records=89044" 

you can also verify it from the file "part-00000" at your output directory stated as 2nd argument to your program, which says: 
"got it.......................... yeah 89044"

As you can see it is consistent with the produced output from the ProduceData jar.

Thank you for following this post.

In the next blog i will explain 2nd and 3rd approach for solving the same problem.

In case if you find any difficulty or you want to suggest something to improve this post, please comment or write me mail.


18 comments:

  1. Equally impressive as the last blog.

    ReplyDelete
  2. I got a job by saying this answer in my last interview. thanks for awesome help.
    I got more idea about Hadoop from Besant Technologies. If anyone wants to get Hadoop Training in Chennai visit Besant Technologies.

    http://www.hadooptrainingchennai.co.in

    ReplyDelete
  3. Very nice article Dhruba. Hadoop as is will not get much speed up using SSD as you described, HBase may be slightly better. But I see Spark / Shark should get a significant speed up using SSD.
    I wanted to get your thoughts on Spark / Shark.

    hadoop training in chennai

    ReplyDelete
  4. Big Data and Data Science Course Material. Avail 15 Day Free Trial! Learn Flume, Sqoop, Pig, Hive, MapReduce, Yarn & More. Get Certified By Experts! big data training

    ReplyDelete
  5. Good day. I was impressed with your article. Keep it up.Its really helpful for me to understand.Thanks for sharing.
    java training institutes in chennai

    ReplyDelete
  6. Great and interesting article to read.. i Gathered more useful and new information from this article.thanks a lot for sharing this article to us..

    best big data hadoop training and certification | best institute for big data in Chennai

    ReplyDelete
  7. Inspiring writings and I greatly admired what you have to say , I hope you continue to provide new ideas for us all and greetings success always for you..Keep update more information..

    Hadoop Training in Chennai

    Base SAS Training in Chennai

    ReplyDelete
  8. Hello,
    Google’s MAPREDUCE IS A PROGRAMMING MODEL serves for processing large data sets in a massively parallel manner. We deliver the first rigorous description of the model, including its advancement as Google’s domain-specific language Sawzall. To this end, we reverse-engineer the seminal papers on MapReduce and Sawzall, and we capture our findings as an executable specification. Really very good information sharing here, Thank you for sharing. Know More About MapReduce Programming Model.

    ReplyDelete
  9. Thanks for the blog, The global Hadoop market is expected to reach a great extent
    Hadoop training in Hyderabad

    ReplyDelete
  10. Amazing web journal I visit this blog it's extremely marvelous. Interestingly, in this blog content composed plainly and reasonable. The substance of data is educational.
    Oracle Fusion Financials Online Training
    Oracle Fusion HCM Online Training
    Oracle Fusion SCM Online Training

    ReplyDelete

  11. Writing a MapReduce program in Java involves creating two main parts: Mapper and Reducer classes. MapReduce is a programming model designed for processing and generating large datasets that can be parallelized across a distributed cluster. Below, I'll outline the basic steps to create a simple MapReduce program in Java using Hadoop, a popular framework for distributed computing.its a very intresting blog

    ReplyDelete
  12. I was impressed with your article. Its really helpful for me to understand. Thanks for sharing. Please visit our website:- amazon aws certification

    ReplyDelete