我有一个小的Python脚本,可以计算.txt文档中前10个最常用的单词,10个最不常用的单词以及单词总数。根据分配,将单词定义为2个字母或更多。我可以打印10个最常用的单词和10个最不常用的单词,但是当我尝试打印文档中的单词总数时,它会打印所有单词的总数,包括单个字母单词(例如“ a” )。如何获取单词总数以仅计算两个或两个以上字母的单词?
这是我的剧本:
from string import *
from collections import defaultdict
from operator import itemgetter
import re
number = 10
words = {}
total_words = 0
words_only = re.compile(r'^[a-z]{2,}$')
counter = defaultdict(int)
"""Define function to count the total number of words"""
def count_words(s):
unique_words = split(s)
return len(unique_words)
"""Define words as 2 letters or more -- no single letter words such as "a" """
for word in words:
if len(word) >= 2:
counter[word] += 1
"""Open text document, strip it, then filter it"""
txt_file = open('charactermask.txt', 'r')
for line in txt_file:
total_words = total_words + count_words(line)
for word in line.strip().split():
word = word.strip(punctuation).lower()
if words_only.match(word):
counter[word] += 1
# Most Frequent Words
top_words = sorted(counter.iteritems(),
key=lambda(word, count): (-count, word))[:number]
print "Most Frequent Words: "
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
# Least Frequent Words:
least_words = sorted(counter.iteritems(),
key=lambda (word, count): (count, word))[:number]
print " "
print "Least Frequent Words: "
for word, frequency in least_words:
print "%s: %d" % (word, frequency)
# Total Unique Words:
print " "
print "Total Number of Words: %s" % total_words
我不是Python的专家,这是我当前正在学习的Python类。我的代码整洁,格式正确,这对我来说很重要,如果可能的话,有人还可以告诉我这段代码的格式是否被认为是“好的做法”?
列表理解方法:
def countWords(s):
words = s.split()
return len([word for word in words if len(word)>=2])
详细方法:
def countWords(s):
words = s.split()
count = 0
for word in words:
if len(word) >= 2:
count += 1
return count
顺便说一句,对使用defaultdict
表示赞赏,但我会选择collections.Counter
:
collections.Counter
希望这会有所帮助
算字仅使用split()
您也应该在这里使用match_words正则表达式
words = collections.Counter([word for line in open(filepath) for word in line.strip()])
words = dict((k,v) for k,v in words.iteritems if len(k)>=2)
mostFrequent = [w[0] for w in words.most_common(10)]
leastFrequent = [w[0] for w in words.most_common()[-10:]]
您的样式看起来很棒:)
很抱歉,但是我似乎对此解决方案有些不满。我的意思是我已经[def count_words(s):
unique_words = split(s)
return len(filter(lambda x: words_only.match(x):, unique_words))
以下是我所做更改的摘要,以及原因
Do n't do from collections import defaultdict
from operator import itemgetter
from heapq import nlargest, nsmallest
from itertools import starmap
from textwrap import dedent
import re
class WordCounter(object):
"""
Count the number of words consisting of two letters or more.
"""
words_only = re.compile(r'[a-z]{2,}', re.IGNORECASE)
def __init__(self, filename, number=10):
self.counter = defaultdict(int)
# Open text document and find all words
with open(filename, 'r') as txt_file:
for word in self.words_only.findall(txt_file.read()):
self.counter[word.lower()] += 1
# Get total count
self.total_words = sum(self.counter.values())
# Most Frequent Words
self.top_words = nlargest(
number, self.counter.items(), itemgetter(1))
# Least Frequent Words
self.least_words = nsmallest(
number, self.counter.items(), itemgetter(1))
def __str__(self):
"""
Summary of least and most used words, and total word count.
"""
template = dedent("""
Most Frequent Words:
{0}
Least Frequent Words:
{1}
Total Number of Words: {2}
""")
line_template = "{0}: {1}".format
top_words = "\n".join(starmap(line_template, self.top_words))
least_words = "\n".join(starmap(line_template, self.least_words))
return template.format(top_words, least_words, self.total_words)
print WordCounter("charactermask.txt")
。
from x import *
。这将减少错误代码。[使其成为一个类。
import string as st
,就可以了。[from wordcounter import WordCounter
在代码块内移动。
注释通常以help(my_class_or_function)
为前缀,而不是一次性字符串
打开文件时使用#
[with
statement是多余的。
with
。使用.strip().split()
。
.split()
来计算“顶部”和“缺口”。而且,它更快。但是我们必须稍微修改一下正则表达式。re.findall
dict未使用。
使用re.findall
计算总单词数。
findall
模式两次-总计一次然后一次进行字数统计-以获得一致的结果。使用words
。
使函数返回您可能希望打印或不希望打印的字符串。
对于新代码,请使用sum
字符串方法而不是words_only
运算符。
使用多行字符串而不是多个连续的打印。
heapq.nlargest
and heapq.nsmallest
功能会有所帮助。heapq.nlargest
或heapq.nsmallest
。大多数人总是喜欢列表理解,并且我通常也同意它们,但是在这里,我喜欢starmap方法的简洁性。话虽如此,我同意user1552512,您的风格看起来很棒!精美,易读的代码,注释良好,非常符合format
。你会走的很远。 :)
%
该代码不喜欢我在代码中标记的括号。
有人知道解决方案吗?
我正在使用python最新的python 3版本。