Python标记化文本:如何将标记化列表转换为字符串?

问题描述 投票:0回答:2

我正在尝试标记文本

from nltk.tokenize import sent_tokenize, word_tokenize 

text = '''The team used archive "data" from 2016...and 2017 
captured by the ESA/NASA Hubble Space Telescope and developed 
open-source algorithms to analyse the starlight filtered through 
K2-18b’s atmosphere. The results revealed the molecular 
signature of water vapour, also indicating the presence of 
hydrogen and helium in the planet’s atmosphere.'''

token = (sent_tokenize(text))
token

这给了我

['The team used archive "data" from 2016...and 2017 captured by the ESA/NASA Hubble Space Telescope and developed open-source algorithms to analyse the starlight filtered through K2-18b’s atmosphere.',
 'The results revealed the molecular signature of water vapour, also indicating the presence of hydrogen and helium in the planet’s atmosphere.']

如何将其转换为字符串,但在每个句子周围都保留''?

我发现的所有内容都将列表中的元素连接起来,并删除了标记化。

我本质上是想要

text = ('This is sentence one.' 
'This is sentence two.')

谢谢

python nltk tokenize
2个回答
0
投票

根据您当前在OP中拥有的信息,您可以尝试以下操作:


0
投票

如果您有要标记的文件,这是NLTK中的一个不错的CLI技巧:https://github.com/nltk/nltk/pull/2337#issue-297882069

© www.soinside.com 2019 - 2024. All rights reserved.