无法获得正确的Python字数值

问题描述 投票:0回答:1

我们提供了一个包含多个条目的文本文件,主要目标是输出具有最高单词数的20个单词。我们给定了一个定界符变量,它将把句子分成单词。我们还列出了一些我们不应该计算在内的常用单词。我们要注意的另一件事是,如果两个单词的计数相同,则必须使用笔法。当我尝试运行程序时,最常见的单词中只有8/20是正确的。我需要纠正我的计数错误的帮助。这是我的代码:

import random 
import os
import string
import sys

stopWordsList = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours",
            "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its",
            "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that",
            "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having",
            "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while",
            "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before",
            "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again",
            "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each",
            "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than",
            "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]

delimiters = " \t,;.?!-:@[](){}_*/"

def getIndexes(seed):
    random.seed(seed)
    n = 10000
    number_of_lines = 50000
    ret = []
    for i in range(0,n):
        ret.append(random.randint(0, 50000-1))
    return ret

def split_sentence(sentence):
    words_list = []
    for a in sentence:
    for b in delimiters:
        a = a.replace(b, ' ')
    for c in a.split(' '):
        if c and c.lower() not in stopWordsList:
            words_list.append(c.lower())
    return words_list

def process(userID):
    indexes = getIndexes(userID)
    ret = []
    # TODO
    d = [
    e.strip('\n')
    for index, e in enumerate(sys.stdin.readlines())
    if index in indexes
    ]

    words_list = split_sentence(d)

    words_count = dict()
    for f in words_list:
    words_count[f] = words_count.get(f, 0) + 1

    words_sorted = sorted(words_count.items(), key = lambda x: (-x[1], x[0]))[0:20]

    ret = [word[0] for word in words_sorted]    

    for word in ret:
        print word

process(sys.argv[1])
python python-2.7 python-2.x
1个回答
0
投票

您可以提供样本数据吗?另外,getIndexes应该做什么?即便如此,我认为这些集合。Counter可以大大简化您的代码。

© www.soinside.com 2019 - 2024. All rights reserved.