我们提供了一个包含多个条目的文本文件,主要目标是输出具有最高单词数的20个单词。我们给定了一个定界符变量,它将把句子分成单词。我们还列出了一些我们不应该计算在内的常用单词。我们要注意的另一件事是,如果两个单词的计数相同,则必须使用笔法。当我尝试运行程序时,最常见的单词中只有8/20是正确的。我需要纠正我的计数错误的帮助。这是我的代码:
import random
import os
import string
import sys
stopWordsList = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours",
"yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its",
"itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that",
"these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having",
"do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while",
"of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before",
"after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again",
"further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each",
"few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than",
"too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]
delimiters = " \t,;.?!-:@[](){}_*/"
def getIndexes(seed):
random.seed(seed)
n = 10000
number_of_lines = 50000
ret = []
for i in range(0,n):
ret.append(random.randint(0, 50000-1))
return ret
def split_sentence(sentence):
words_list = []
for a in sentence:
for b in delimiters:
a = a.replace(b, ' ')
for c in a.split(' '):
if c and c.lower() not in stopWordsList:
words_list.append(c.lower())
return words_list
def process(userID):
indexes = getIndexes(userID)
ret = []
# TODO
d = [
e.strip('\n')
for index, e in enumerate(sys.stdin.readlines())
if index in indexes
]
words_list = split_sentence(d)
words_count = dict()
for f in words_list:
words_count[f] = words_count.get(f, 0) + 1
words_sorted = sorted(words_count.items(), key = lambda x: (-x[1], x[0]))[0:20]
ret = [word[0] for word in words_sorted]
for word in ret:
print word
process(sys.argv[1])
您可以提供样本数据吗?另外,getIndexes应该做什么?即便如此,我认为这些集合。Counter可以大大简化您的代码。