Hadoop MapReduce Job 構(gòu)建 Inverted index， Reducer 沒有按照預(yù)期工作

用Google cloud跑一個(gè)Hadoop MapReduce 的 job，想建一個(gè) inverted index。
輸入文件5個(gè)，第一個(gè)詞是文件的ID，后面就是文件內(nèi)容，ID 和內(nèi)容之間用 tab 隔開，內(nèi)容詞與此之間用空格隔開，如下：

圖片描述

我期望的結(jié)果是找出每個(gè)詞出現(xiàn)在哪些文件中，以及出現(xiàn)的頻率。map 結(jié)束后，輸出應(yīng)該是這樣：

圖片描述

Reduce 完成后應(yīng)該得到這樣的結(jié)果：

圖片描述

我的代碼如下：

import java.io.IOException;
import java.util.StringTokenizer;
import java.util.Map;
import java.util.HashMap;
import java.lang.StringBuilder;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class BuildIndex {

  public static class IndexMapper extends Mapper<Object, Text, Text, Text>{

    private Text word = new Text();
    private Text docID = new Text();

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
      String[] pair = value.toString().split("\t");
      docID.set(pair[0]);
      StringTokenizer itr = new StringTokenizer(pair[1]);
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, docID);
      }
    }
  }

  public static class IndexReducer extends Reducer<Text,Text,Text,Text> {
    private Text result = new Text();

    public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
      Map<String, Integer> map = new HashMap<>();
      for (Text val : values) {
        Integer counter = map.get(val.toString());
        if (counter == null) {
          counter = 1;
        } else {
          counter += 1;
        }
        map.put(val.toString(), counter);
      }
      StringBuilder sb = new StringBuilder();
      for (String s : map.keySet()) {
        sb.append(s + ":" + map.get(s) + " ");
      }
      result.set(sb.toString());
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "build inverted index");
    job.setJarByClass(BuildIndex.class);
    job.setMapperClass(IndexMapper.class);
    job.setReducerClass(IndexReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

但是我的輸出結(jié)果卻是這樣：

圖片描述

感覺reducer完全沒有工作，請(qǐng)問這是為什么？

回答

編輯回答

憶當(dāng)年

一個(gè)map和reduce完成不了. 中間要再加一個(gè)combiner.
捋一下思路
假設(shè)a.txt文件里數(shù)據(jù)是{tom jack tom jack rose tom}
那么map的輸出就是
<tom,1> <jack,1> <tom,1> <jack,1> <rose,1> <tom,1>
然后匯總工作到reduce去做.reduce接收的就是這樣的數(shù)據(jù)
<tom,[1,1,1]><jack,[1,1]><rose,[1]>
reducer輸出的是<tom,3><jack,2><rose,1>, 這樣就統(tǒng)計(jì)出來了. 當(dāng)然這是普通的做法

你現(xiàn)在想要實(shí)現(xiàn)的是同級(jí)單詞在a文件和b文件...中各出現(xiàn)的次數(shù).
假設(shè)有兩個(gè)文件a.txt和b.txt.
a文件里面的數(shù)據(jù)是{tom jack tom jack rose tom}
b文件里面的數(shù)據(jù)是{google apple tom google rose}
把編號(hào)換成了文件名.
map的輸出就是這樣<tom:a.txt,1> <jack:a.txt,1> <tom:a.txt,1><jack:a.txt,1> <rose:a.txt,1>

           <google:b.txt,1><apple:b.txt,1><tom:b.txt,1>....

這樣的數(shù)據(jù)給到reducer,reducer統(tǒng)計(jì)不了.因?yàn)閗ey不相同.key有的是<tom:a.txt,1><tom:b.txt,1>.
為了解決這個(gè)問題,map輸入的內(nèi)容不要直接到reducer中,中間加一層combiner來處理匯總數(shù)據(jù)
combiner接收<tom:a.txt,1><tom:b.txt,1>
combiner把key做一下切割 .切割成<tom, a.txt:1 ><tom , b.txt:2> ,這樣key相同了.就可以統(tǒng)計(jì)了
下面把代碼貼上注[我用的是文件的名稱,不是文件里的開頭編號(hào),要用的話還得把文件名換成編號(hào),這樣做有寫問題,你可以下去試一試.我找到解決辦法在補(bǔ)充.]
mapper類

public class WCMapper extends Mapper<LongWritable,Text,Text,Text>{

    Text text = new Text();
    Text val = new Text( "1" );

    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String line = value.toString();
        String [] strings = line.split( " " ); //根據(jù)空格切割

        FileSplit fileSplit = (FileSplit) context.getInputSplit();// 得到這行數(shù)據(jù)所在的文件切片
        String fileName = fileSplit.getPath().getName();// 根據(jù)文件切片得到文件名

        for (String s : strings){
            text.set(s + ":" + fileName);
            context.write(text,val);
        }
    }
}

combiner類

public class WCCombiner extends Reducer<Text,Text,Text,Text> {
    Text text = new Text( );
    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {

        //map傳進(jìn)來的是 <apple:2 , 1> <google:2 ,1>

        int sum = 0; //統(tǒng)計(jì)詞頻
        for (Text val : values){
            sum += Integer.parseInt(val.toString());
        }

        //切割key
        int index = key.toString().indexOf( ":" );
        text.set(key.toString().substring( index + 1 ) + ":" + sum); // value ---> 2:1
        key.set( key.toString().substring( 0,index )); // key --> apple
        context.write( key,text );
    }
}

reducer類:

public class WCReduce  extends Reducer<Text,Text,Text,Text>{


    Text result = new Text(  );
    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {

        String file = new String();

        for (Text t : values){
            file += t.toString();
        }

        result.set( file );

        context.write( key,result);
    }
}

注[本地鏈接的linux環(huán)境hadoop] ,要在本機(jī)的/etc/hosts文件中添加 export HADOOP_USER_NAME=hdfs
a.txt和b.txt都是單詞,以空格分割,你可以做假數(shù)據(jù)測(cè)試一下.
這是測(cè)試結(jié)果.

2018年8月14日 01:24