!/usr/bin/python
import sys
# Input takes from standard input for myline in sys.stdin: 
# Remove whitespace either side myline = myline.strip() 
# Break the line into words words = myline.split() 
# Iterate the words list for myword in words: 
# Write the results to standard output print '%s\t%s' % (myword, 1)

請確保此文件具有執(zhí)行權(quán)限（使用chmod +x /home/ expert/hadoop-1.2.1/mapper.py）。

減速器階段代碼

#!/usr/bin/python
from operator import itemgetter 
import sys 
current_word = ""
current_count = 0 
word = "" 
# Input takes from standard input for myline in sys.stdin: 
# Remove whitespace either side myline = myline.strip() 
# Split the input we got from mapper.py word, count = myline.split('\t', 1) 
# Convert count variable to integer 
   try: 
      count = int(count) 
except ValueError: 
   # Count was not a number, so silently ignore this line continue
if current_word == word: 
   current_count += count 
else: 
   if current_word: 
      # Write result to standard output print '%s\t%s' % (current_word, current_count) 
   current_count = count
   current_word = word
# Do not forget to output the last word if needed! 
if current_word == word: 
   print '%s\t%s' % (current_word, current_count)

保存mapper.py和reducer.py 在 Hadoop 的主目錄映射器和減速器代碼。確保這些文件具有執(zhí)行權(quán)限（使用chmod +x mapper.py 和 chmod +x reducer.py）。由于python具有大小寫敏感，因此相同的代碼可以從以下鏈接下載。

wordCount程序的執(zhí)行

$ $HADOOP_HOME/bin/hadoop jar contrib/streaming/hadoop-streaming-1.
2.1.jar \
   -input input_dirs \ 
   -output output_dir \ 
   -mapper <path/mapper.py \ 
   -reducer <path/reducer.py

其中“\”用于續(xù)行以便于閱讀。

例如，

./bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -input myinput -output myoutput -mapper /home/expert/hadoop-1.2.1/mapper.py -reducer /home/expert/hadoop-1.2.1/reducer.py

數(shù)據(jù)流工作原理

在上面的例子中，這兩個映射器和減速是從標(biāo)準(zhǔn)輸入讀取作為輸入，并輸出到標(biāo)準(zhǔn)輸出到Python腳本。實(shí)用程序?qū)?chuàng)建一個Map/Reduce作業(yè)，并將作業(yè)提交到一個合適的集群，并監(jiān)督工作的進(jìn)展情況，直至完成。

當(dāng)指定映射器的腳本，每個映射任務(wù)將啟動腳本作為一個單獨(dú)的進(jìn)程時映射器初始化。作為mapper任務(wù)運(yùn)行時，輸入轉(zhuǎn)換成行給進(jìn)程的標(biāo)準(zhǔn)輸入（STDIN）。在此期間，映射器收集從該方法的標(biāo)準(zhǔn)輸出（stdout）面向行輸出和每一行轉(zhuǎn)換為鍵/值對，其被收集作為映射器的輸出。缺省情況下，一行到第一個制表符的前綴是鍵和行（不包括制表符）的其余部分為值。如果在該行沒有任何制表符，則整行鍵和值被視為null。然而，這可以被定制，每次需要1個。

當(dāng)指定減速腳本，每個減速器任務(wù)將啟動腳本作為一個單獨(dú)的進(jìn)程，然后減速初始化。減速器任務(wù)運(yùn)行時將其轉(zhuǎn)換其輸入鍵/值對，進(jìn)入行并將該行進(jìn)程的標(biāo)準(zhǔn)輸入（STDIN）。在此期間，在減速機(jī)收集來自該過程的標(biāo)準(zhǔn)輸出（stdout）的面向行的輸出，每行轉(zhuǎn)換成一個密鑰/值對，其被收集作為減速機(jī)的輸出。缺省情況下，一行到第一個制表符的前綴是鍵，（不包括制表符）的其余部分的值為行。然而，這可以被定制為每次具體要求。

重要的命令

參數(shù)	描述
-input directory/file-name	輸入位置映射。（必填）
-output directory-name	輸出位置的減速器。（必填）
-mapper executable or script or JavaClassName	映射器可執(zhí)行文件。（必填）
-reducer executable or script or JavaClassName	減速器的可執(zhí)行文件。（必填）
-file file-name	使現(xiàn)有的映射器，減速機(jī)，或組合的可執(zhí)行本地計(jì)算節(jié)點(diǎn)上。
-inputformat JavaClassName	類，應(yīng)該提供返回鍵/值對文字類。如果沒有指定，使用TextInputFormat作為默認(rèn)。
-outputformat JavaClassName	類，提供應(yīng)采取鍵/值對文字類的。如果沒有指定，使用TextOutputformat作為默認(rèn)值。
-partitioner JavaClassName	類，確定哪個減少一個鍵被發(fā)送。
-combiner streamingCommand or JavaClassName	組合可執(zhí)行文件映射輸出。
-cmdenv name=value	通過環(huán)境變量數(shù)據(jù)流的命令。
-inputreader	對于向后兼容性：指定記錄讀取器類（而不是輸入格式類）。
-verbose	詳細(xì)的輸出。
-lazyOutput	創(chuàng)建懶輸出。例如，如果輸出格式是基于FileOutputFormat，輸出文件僅在第一次調(diào)用output.collect（或Context.write）創(chuàng)建。
-numReduceTasks	指定減速器的數(shù)目。
-mapdebug	當(dāng)map任務(wù)失敗的腳本調(diào)用。
-reducedebug	腳本調(diào)用時降低任務(wù)失敗。

上一篇：Hadoop教程下一篇：Pig & Hive介紹

在线观看不卡亚洲电影_亚洲妓女99综合网_91青青青亚洲娱乐在线观看_日韩无码高清综合久久

Hadoop Streaming

使用Python示例

映射階段代碼

減速器階段代碼

wordCount程序的執(zhí)行

例如，

數(shù)據(jù)流工作原理

重要的命令

例如，