针对txt文件中特定列的字数统计映射

问题描述 投票:1回答:1

我有映射器和归约器代码,可在文本文件中找到最常用的单词。我想在我的文本文件的特定列中输出最常用的单词。 txt文件中列的名称为“流派”。该列有多个字符串,以逗号分隔。这是我的txt文件的示例:

tconst  averageRating   numVotes    titleType   primaryTitle    startYear   genres
tt0002020   5.2 85  short   Queen Elizabeth 1912    Biography,Drama,History
tt0002026   4   7   movie   Anny - Story of a Prostitute    1912    Drama,Romance
tt0002029   6.1 33  short   Poor Jenny  1912    Short
tt0002031   4.6 8   movie   As You Like It  1912    \N
tt0002033   5.6 26  short   Asesinato y entierro de Don JosŽ Canalejas  1912    Short
tt0002034   4.9 17  short   At Coney Island 1912    Comedy,Short
tt0002041   3.9 14  short   The Baby and the Stork  1912    Crime,Drama,Short
tt0002045   4.2 71  short   The Ball Player and the Bandit  1912    Drama,Romance,Short

    //Mapper code   
    import sys

        def read_input(file):
            for line in file:
                # split the line into words
                yield line.split()

        def main(separator='\t'):
            # input comes from STDIN (standard input)
            data = read_input(sys.stdin)
            for words in data:
                # write the results to STDOUT (standard output);
                # what we output here will be the input for the
                # Reduce step, i.e. the input for reducer.py
                #
                # tab-delimited; the trivial word count is 1
                for word in words:
                    print '%s%s%d' % (word, separator, 1)

        if __name__ == "__main__":
            main()



 //Reducer
    from itertools import groupby
    from operator import itemgetter
    import sys

    current_word = None
    current_count = 0
    word = None
    max_count = 0
    max_word = None

    for line in sys.stdin:
        # remove leading and trailing whitespace
        line = line.strip()

        # parse the input we got from mapper.py
        word, count = line.split('\t', 1)

        # convert count (currently a string) to int
        try:
            count = int(count)
        except ValueError:
            # count was not a number, so silently
            # ignore/discard this line
            continue

        # this IF-switch only works because Hadoop sorts map output
        # by key (here: word) before it is passed to the reducer
        if current_word == word:
            current_count += count
        else:
            # check if new word greater
            if current_count > max_count:
                max_count= current_count 
                max_word = current_word        
            current_count = count
            current_word = word

    # do not forget to check last word if needed!
    if current_count > max_count:
        max_count= current_count 
        max_word = current_word

    print '%s\t%s' % (max_word, max_count)

您能否指导我如何更改此代码以在'体裁'列中打印最常用的词。我也想输出“请”中所有单词的单词计数。如果需要提供其他信息,请让我知道。

python mapreduce word-count
1个回答
0
投票

尝试使用line变量的索引的倍数。使用索引来查找最常见单词的特定列。

© www.soinside.com 2019 - 2024. All rights reserved.