我正在从事MapReduce项目,并希望改善输出。我在发布的票证上使用带有日期的CSV文件,我需要查看票证最多的彩色汽车。列33包含有关车辆颜色和标题“车辆颜色”的信息。我的MapReduce可以工作,但是效果可能更好。第33列具有空白值,并且许多值的书写方式不同但含义相同。示例:WH和白色,BK黑色BLA。我的MapReducer将它们视为不同的颜色。将它们组合成一个密钥的最佳方法是什么。
sys_stdin = open("Parking_Violations.csv", "r")
for line in sys_stdin:
vehiclecolor = line.split(",")[33].strip()
vehiclecolor = vehiclecolor.strip("Vehicle Color")
if vehiclecolor:
issuecolor = str(vehiclecolor)
print("%s\t%s" % (issuecolor, 1))
dict_color_count = {}
for line in sys_stdin:
line = line.strip()
color, num = line.split('\t')
try:
num = int(num)
dict_color_count[color] = dict_color_count.get(color, 0) + num
except ValueError:
pass
sorted_dict_color_count = sorted(dict_color_count.items(), key=itemgetter(1), reverse=True)
for color, count in sorted_dict_color_count:
print('%s\t%s') % (color, count)
MY Result after MapReduce
BLK 35
WH 21
WHITE 20
BK 16
GRAY 14
WHT 8
BLACK 6
BLA 1
我认为您可以采用的方法是添加一个字典,其中包含到目前为止所有颜色的变种,并在对它们进行计数之前将其替换。例如:
# Dictionary with all the colors that you have identified so far
color_dict = {
"BLK":["BLK","BLACK","BLA"],
"WHT":["WHITE","WHT","WHIT"],
}
for line in sys_stdin:
vehiclecolor = line.split(",")[33].strip()
vehiclecolor = vehiclecolor.strip("Vehicle Color")
if vehiclecolor:
testcolor = str(vehiclecolor).upper()
issuecolor = testcolor
for k,v in color_dict.items()
if testcolor in v:
issuecolor = k
print("%s\t%s" % (issuecolor, 1))
从这个意义上讲,您将能够用已经知道的结果替换并改善您的颜色计数。
让我知道这是否有帮助! :D