NLTK令牌-从熊猫系列中创建单词的单个列表

Question

下午好，乡亲，>]

我正在寻找有关NLTK的帮助，或任何其他可以帮助我解决所面临问题的库。

我不是Python专家（我实际上才4个月前才开始学习Python，而我的主要是SQL），但是在向Folks寻求帮助之前，我已经做了很多研究。

Tokenizing words into a new column in a pandas dataframe Passing a pandas dataframe column to an NLTK tokenizer等等...

[这里是我所拥有的：一个数据框，其中包含有关我们的学生在我们的网站上搜索信息时所寻找的内容（这是校园的网站）的大量信息。

看起来有点像这样：

我想拥有的是一个看起来很大的列表

：[“查询”，“考试”，“会话”，“ june”，“ 2020”，“何时”，“是”，“该”，考试，“考试”，“什么”，“ s” '我'，'老师'，'s'，'电子邮件'，'地址]

===>一个字符串，所有单词（没有句子），没有标点符号。

我尝试过：

tokens = df['query'].apply(word_tokenize)
text = nltk.Text(tokens)
===>
给我每一行一个单独的字符串

sentences = pd.Series(df.Name)
sentences = sentences.str.replace('[^A-z ]','').str.replace(' +',' ').str.strip()
splitwords = [ nltk.word_tokenize( str(sentence) ) for sentence in sentences ]
print(splitwords)
===>
好一点了，但也不是我想要的

我希望我的帖子格式正确，如果我浪费任何人的时间，我深表歉意。

感谢您阅读我的文章，朱利安

[下午好，人们，我正在寻找有关NLTK或其他任何可以帮助我解决所面临问题的图书馆的帮助。我不是Python专家（我实际上只是开始学习Python 4 ...

Answer 1

您可以执行此操作：

df['student_query'] = df['student_query'].str.replace(r'\?|\.|\'', ' ')
list_of_words = ' '.join(df['student_query']).split()
print(list_of_words)

['exams', 'session', 'june', '2020', 'when', 'are', 'the', 'exams', 'exams', 'what', 's', 'my', 'teacher', 's', 'email', 'address']

NLTK令牌-从熊猫系列中创建单词的单个列表

问题描述投票：0回答：1

1个回答

最新问题

NLTK令牌-从熊猫系列中创建单词的单个列表

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1