计算Python句子中使用的单词数和平均单词长度

问题描述 投票:-1回答:2

我已经尝试让我的代码来计算句子中的单词数(用于在.txt文件上进行测试之前,但是它给了我这个结果:

Mr. Blah has a lot of Sr?
 and Mrs. blah does not care.
 lol
[[1, 1.0], [1, 1.0], [1, 1.0]]

而不是以下结果:

Mr. Blah has a lot of Sr?
 and Mrs. blah does not care.
 lol
[[7, 2.4], [6, 3.5], [1, 3.0]]

我下面的代码到目前为止是我一直在努力的。它应该计算一个句子中的单词数。然后计算句子中使用的字母数。最后,计算句子中使用的单词的平均字母数]

terminators = ["?", "!"] #Characters that always end a sentence other than a period
abrevs = ["Mrs", "Mr", "Dr", "Fr", "Jr", "Sr"] #Abbreviations that prevent a period from ending a sentence

#Replaced the word_length_list function from 1a. with this new one
def word_length_list(sentence):
    print(sentence)
    return [1]

#Once a sentence is found, this will calculate statistics for it
def collect_statistics(sentence):
    word_lengths = word_length_list(sentence)
    words_in_sentence = len(word_lengths)  #Get word count

    #Average word length
    sum_of_word_lengths = 0
    for length in word_lengths:
        sum_of_word_lengths = sum_of_word_lengths + length
    average_word_length = sum_of_word_lengths/words_in_sentence;

    return [words_in_sentence, average_word_length]
# Replaced given text with this to test if it does work for the abbreviations and ellipses
story_text = "Mr. Blah has a lot of Sr? and Mrs. blah does not care. lol"

story_length = len(story_text)

statistics = []

sentence = ""

for i in range(story_length):
    sentence_over = False # Assumption that this sentence will continue after the next character
    nextchar = story_text[i] # Look at the next character in the story

    if nextchar in terminators:
        sentence_over = True  #Change assumption.  
                              #If it is a period, we have some special handling to do.
    elif nextchar == ".": #End the sentence after this if-else block.
                          #But if it is a period, we have to deal with ellipsis and abbreviations

        #If the period is followed by another period, probably an ellipsis & want to include in the sentence.
        is_part_of_elipse = i+1 < story_length and story_text[i+1] == "."

        is_part_of_abbrev = False  # Assumption that this sentence will continue after a period, an abbreviation

        for ab in abrevs: #Then check for abbreviation
            if sentence.endswith(ab):
                is_part_of_abbrev = True

        if not (is_part_of_elipse or is_part_of_abbrev): # If not part of abbreviation and not part of ellipsis, 
            sentence_over = True                         # end of sentence by (period)

    sentence = sentence + nextchar;

    # Calculate the sentence statistcs
    if sentence_over:
        statistics.append(collect_statistics(sentence))
        # Clear the sentence variable to make room for the next
        sentence = ""

#Incase the last sentence was not terminated, add it to the stats
if len(sentence)>0:
    statistics.append(collect_statistics(sentence))

print(statistics)
python string function
2个回答
0
投票

此函数总是返回相同的结果:

def word_length_list(sentence):
    print(sentence)
    return [1]

您可能想要查看计算句子中单词数的方式。


0
投票

您需要修复几件事。

第一个word_length_list返回[1],没有其他。

将该功能更改为:

def word_length_list(sentence):
    return sentence.split()

接下来,我们需要在collect_statistics中进行一些更改,以获得所需的结果:

将该功能更改为:

def collect_statistics(sentence):
    word_lengths = word_length_list(sentence)
    words_in_sentence = len(word_lengths)
    sum_of_word_lengths = 0
    for word in word_lengths:
        sum_of_word_lengths += len(word)
    average_word_length = sum_of_word_lengths/words_in_sentence;
    return [words_in_sentence, average_word_length]

那表示数学中的某些行为会导致一些长的十进制返回,因此您需要对此进行补偿。我认为我得到的数字稍微多一点,因为代码仍在计算.部分和Sr.中的?,因此您期望的2.4实际上是2.7。

© www.soinside.com 2019 - 2024. All rights reserved.