nltk中的word_tokenize没有将字符串列表作为参数

问题描述 投票:0回答:2
from nltk.tokenize import word_tokenize

music_comments = [['So cant you just run the bot outside of the US? ', ''], ["Just because it's illegal doesn't mean it will stop. I hope it actually gets enforced. ", ''], ['Can they do something about all the fucking bots on Tinder next?   \n\nEdit: Holy crap my inbox just blew up ', '']]

print(word_tokenize(music_comments[1]))

我发现this other question说要将字符串列表传递给word_tokenize,但在我的情况下运行上面后我得到以下输出:

Traceback (most recent call last):
  File "testing.py", line 5, in <module>
    print(word_tokenize(music_comments[1]))
  File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 109, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 94, in sent_tokenize
    return tokenizer.tokenize(text)
  File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1237, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1285, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in <listcomp>
    return [(sl.start, sl.stop) for sl in slices]
  File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "C:\Users\Shraddheya Shendre\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1289, in _slices_from_text
    for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or bytes-like object

问题是什么?我错过了什么?

python nltk tokenize
2个回答
4
投票

您正在向tokenize()提供包含两个项目的列表:

["Just because it's illegal doesn't mean it will stop. I hope it actually gets enforced. ", '']

即句子和空字符串。

将代码更改为此应该可以解决问题:

print(word_tokenize(music_comments[1][0]))

1
投票
def word_tokenize(self, s):
    """Tokenize a string to split off punctuation other than periods"""
    return self._word_tokenizer_re().findall(s)

这是'Source code for nltk.tokenize.punkt'的一部分。

函数word_tokenize()的输入应该是一个字符串,而不是一个列表。

© www.soinside.com 2019 - 2024. All rights reserved.