我已将一列中的文本标记为句子标记的新列“ token_sentences”。我想使用“ token_sentences”列来创建一个新的列“ token_words”,其中包含标记化单词。
我正在使用的df
article_id article_text
1 Maria Sharapova has basically no friends as te...
2 Roger Federer advance...
3 Roger Federer has revealed that organisers of ...
4 Kei Nishikori will try to end his long losing ...
添加token_sentences
列
article_id article_text token_sentences
1 Maria Sharapova has basically no friends as te... [Maria Sharapova has basically no friends as te
2 Roger Federer advance... [Roger Federer advance...
3 Roger Federer has revealed that organisers of ... [Roger Federer has revealed that organisers of...
4 Kei Nishikori will try to end his long losing ... [Kei Nishikori will try to end his long losing...
这是每行句子的列表。我无法将token_sentences
列中的列表弄平,无法在下一步中使用
我想使用token_sentences
列使df看起来像
article_id article_text token_sentences token_words
1 Maria... ["Maria Sharapova..",["..."]] [Maria, Sharapova, has, basically, no, friends,...]
2 Roger... ["Roger Federer advanced ...",["..."]] [Roger,Federer,...]
3 Roger... ["Roger Federer...",["..."]] [Roger ,Federer,...]
4 Kei ... ["Kei Nishikori will try...",["..."]] [Kei,Nishikori,will,try,...]
from nltk.tokenize import word_tokenize
new_df = df['token_sentences'].apply(word_tokenize)
new_df将是您的象征性句子,然后将此df添加到您的df中,例如
df['token_words'] = new_df
安装nltk
pip install nltk