将一个段落标记为句子，然后将其转换为NLTK中的单词

Question

我试图将整个段落输入到我的文字处理器中，先将其分成句子然后再分成单词。

我尝试了以下代码，但它不起作用，

    #text is the paragraph input
    sent_text = sent_tokenize(text)
    tokenized_text = word_tokenize(sent_text.split)
    tagged = nltk.pos_tag(tokenized_text)
    print(tagged)

但这不起作用，给我错误。那么如何将段落标记为句子然后单词呢？

一个示例段落：

这件事似乎压倒了这只小黑褐色的狗，使他受伤了。他在孩子脚下绝望地沉了下去。当重复一击，伴随着幼稚的句子中的警告，他翻过身来，用一种特殊的方式抓住他的爪子。在他的耳朵和眼睛的同时，他向孩子祈祷。

**警告：**这只是来自互联网的随机文本，我不拥有上述内容。

Answer 1

你可能打算循环sent_text：

import nltk

sent_text = nltk.sent_tokenize(text) # this gives us a list of sentences
# now loop over each sentence and tokenize it separately
for sentence in sent_text:
    tokenized_text = nltk.word_tokenize(sentence)
    tagged = nltk.pos_tag(tokenized_text)
    print(tagged)

Answer 2

这是一个较短的版本。这将为您提供每个单独句子的数据结构，以及句子中的每个标记。我更喜欢TweetTokenizer用于凌乱的现实世界语言。句子标记符被认为是不错的，但是在这一步之后要小心不要降低你的单词大小写，因为它可能会影响检测凌乱文本边界的准确性。

from nltk.tokenize import TweetTokenizer, sent_tokenize

tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in 
nltk.sent_tokenize(input_text)]
print(tokens_sentences)

这是输出的样子，我清理了所以结构突出：

[
['This', 'thing', 'seemed', 'to', 'overpower', 'and', 'astonish', 'the', 'little', 'dark-brown', 'dog', ',', 'and', 'wounded', 'him', 'to', 'the', 'heart', '.'], 
['He', 'sank', 'down', 'in', 'despair', 'at', 'the', "child's", 'feet', '.'], 
['When', 'the', 'blow', 'was', 'repeated', ',', 'together', 'with', 'an', 'admonition', 'in', 'childish', 'sentences', ',', 'he', 'turned', 'over', 'upon', 'his', 'back', ',', 'and', 'held', 'his', 'paws', 'in', 'a', 'peculiar', 'manner', '.'], 
['At', 'the', 'same', 'time', 'with', 'his', 'ears', 'and', 'his', 'eyes', 'he', 'offered', 'a', 'small', 'prayer', 'to', 'the', 'child', '.']
]

将一个段落标记为句子，然后将其转换为NLTK中的单词

问题描述投票：26回答：2

2个回答

最新问题

将一个段落标记为句子，然后将其转换为NLTK中的单词

问题描述 投票：26回答：2

2个回答

最新问题

问题描述投票：26回答：2