Hadoop Beginner's Guide

上QQ阅读APP看书，第一时间看更新

Time for action – fixing WordCount to work with a combiner

Let's make the necessary modifications to WordCount to correctly use a combiner.

Copy WordCount2.java to a new file called WordCount3.java and change the reduce method as follows:

public void reduce(Text key, Iterable<IntWritable> values,            
Context context) throws IOException, InterruptedException 
{
int total = 0 ;
for (IntWritable val : values))
{
total+= val.get() ;
}
            context.write(key, new IntWritable(total));
}

Remember to also change the class name to WordCount3 and then compile, create the JAR file, and run the job as before.

What just happened?

The output is now as expected. Any map-side invocations of the combiner performs successfully and the reducer correctly produces the overall output value.

Tip

Would this have worked if the original reducer was used as the combiner and the new reduce implementation as the reducer? The answer is no, though our test example would not have demonstrated it. Because the combiner may be invoked multiple times on the map output data, the same errors would arise in the map output if the dataset was large enough, but didn't occur here due to the small input size. Fundamentally, the original reducer was incorrect, but this wasn't immediately obvious; watch out for such subtle logic flaws. This sort of issue can be really hard to debug as the code will reliably work on a development box with a subset of the data set and fail on the much larger operational cluster. Carefully craft your combiner classes and never rely on testing that only processes a small sample of the data.

Reuse is your friend

In the previous section we took the existing job class file and made changes to it. This is a small example of a very common Hadoop development workflow; use an existing job file as the starting point for a new one. Even if the actual mapper and reducer logic is very different, it's often a timesaver to take an existing working job as this helps you remember all the required elements of the mapper, reducer, and driver implementations.

Pop quiz – MapReduce mechanics

Q1. What do you always have to specify for a MapReduce job?

The classes for the mapper and reducer.
The classes for the mapper, reducer, and combiner.
The classes for the mapper, reducer, partitioner, and combiner.
None; all classes have default implementations.

Q2. How many times will a combiner be executed?

At least once.
Zero or one times.
Zero, one, or many times.
It's configurable.

Q3. You have a mapper that for each key produces an integer value and the following set of reduce operations:

Reducer A: outputs the sum of the set of integer values.
Reducer B: outputs the maximum of the set of values.
Reducer C: outputs the mean of the set of values.
Reducer D: outputs the difference between the largest and smallest values in the set.

Which of these reduce operations could safely be used as a combiner?

All of them.
A and B.
A, B, and D.
C and D.
None of them.