Python标记化文本：如何将标记化列表转换为字符串？

Question

我正在尝试标记文本

from nltk.tokenize import sent_tokenize, word_tokenize 

text = '''The team used archive "data" from 2016...and 2017 
captured by the ESA/NASA Hubble Space Telescope and developed 
open-source algorithms to analyse the starlight filtered through 
K2-18b’s atmosphere. The results revealed the molecular 
signature of water vapour, also indicating the presence of 
hydrogen and helium in the planet’s atmosphere.'''

token = (sent_tokenize(text))
token

这给了我

['The team used archive "data" from 2016...and 2017 captured by the ESA/NASA Hubble Space Telescope and developed open-source algorithms to analyse the starlight filtered through K2-18b’s atmosphere.',
 'The results revealed the molecular signature of water vapour, also indicating the presence of hydrogen and helium in the planet’s atmosphere.']

如何将其转换为字符串，但在每个句子周围都保留''？

我发现的所有内容都将列表中的元素连接起来，并删除了标记化。

我本质上是想要

text = ('This is sentence one.' 
'This is sentence two.')

谢谢

Answer 1

根据您当前在OP中拥有的信息，您可以尝试以下操作：

Answer 2

如果您有要标记的文件，这是NLTK中的一个不错的CLI技巧：https://github.com/nltk/nltk/pull/2337#issue-297882069

Python标记化文本：如何将标记化列表转换为字符串？

问题描述投票：0回答：2

2个回答

最新问题

Python标记化文本：如何将标记化列表转换为字符串？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2