如何使用标记化句子的列进一步将标记化为单词

问题描述 投票:0回答:1

我已将一列中的文本标记为句子标记的新列“ token_sentences”。我想使用“ token_sentences”列来创建一个新的列“ token_words”,其中包含标记化单词。

我正在使用的df

article_id      article_text                                       
1           Maria Sharapova has basically no friends as te...   
2           Roger Federer advance...    
3           Roger Federer has revealed that organisers of ...   
4           Kei Nishikori will try to end his long losing ...

添加token_sentences

article_id      article_text                                      token_sentences                          
1           Maria Sharapova has basically no friends as te...    [Maria Sharapova has basically no friends as te    
2           Roger Federer advance...                             [Roger Federer advance...
3           Roger Federer has revealed that organisers of ...    [Roger Federer has revealed that organisers of...
4           Kei Nishikori will try to end his long losing ...    [Kei Nishikori will try to end his long losing...

这是每行句子的列表。我无法将token_sentences列中的列表弄平,无法在下一步中使用

我想使用token_sentences列使df看起来像

article_id  article_text    token_sentences                         token_words                       
1           Maria...        ["Maria Sharapova..",["..."]]           [Maria, Sharapova, has, basically, no, friends,...]       
2           Roger...        ["Roger Federer advanced  ...",["..."]] [Roger,Federer,...]
3           Roger...        ["Roger Federer...",["..."]]            [Roger ,Federer,...]
4           Kei ...         ["Kei Nishikori will try...",["..."]]   [Kei,Nishikori,will,try,...]

python pandas nlp
1个回答
0
投票
from nltk.tokenize import word_tokenize
new_df = df['token_sentences'].apply(word_tokenize)

new_df将是您的象征性句子,然后将此df添加到您的df中,例如

df['token_words'] = new_df

安装nltk

pip install nltk
© www.soinside.com 2019 - 2024. All rights reserved.