如何从字符串的DataFrame列中获得唯一的单词？

Question

我正在寻找一种在DataFrame的字符串列中获取唯一单词列表的方法。

import pandas as pd
import numpy as np

df = pd.read_csv('FinalStemmedSentimentAnalysisDataset.csv', sep=';',dtype= 
       {'tweetId':int,'tweetText':str,'tweetDate':str,'sentimentLabel':int})

tweets = {}
tweets[0] = df[df['sentimentLabel'] == 0]
tweets[1] = df[df['sentimentLabel'] == 1]

我正在使用的数据集来自此链接：http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

我在此列中使用了可变长度的字符串，我想获取该列中每个唯一单词的列表及其计数，我该如何获取呢？我在python中使用Pandas，原始数据库的行数超过了1M，所以我还需要一种有效的方法来足够快地处理此过程，并且不要使代码运行太长时间。

列的示例可能是：

对我的apl朋友感到非常难过。
糟糕，这很糟糕。
这是什么新歌？
列表可能是类似的。

[is,so,sad,for,my,apl,friend,omg,this,terrible,what,new,song]

Answer 1

但是开始时，您必须清除句子以删除.，?等字符。我使用regex仅保留一些字符和空格。最终，您必须将所有单词都转换为小写或大写。

import pandas as pd df = pd.DataFrame({ 'sentences': [ 'is so sad for my apl friend.', 'omg this is terrible.', 'what is this new song?', ] }) unique = set(df['sentences'].str.replace('[^a-zA-Z ]', '').str.lower().str.split(' ').sum()) print(list(sorted(unique)))

结果

['apl', 'for', 'friend', 'is', 'my', 'new', 'omg', 'sad', 'so', 'song', 'terrible', 'this', 'what']


EDIT：如评论中提到的@HenryYik-可以使用findall('\w+')代替split()，也可以代替replace()unique = set(df['sentences'].str.lower().str.findall("\w+").sum())

编辑：
我用来自的数据对其进行了测试

http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

[column.sum()或sum(column)以外的所有方法都工作很快，我测量了1000行的时间，计算出了150万行的时间，这需要35分钟。
使用itertools.chain()更快，大约需要8秒。
import itertools words = df['sentences'].str.lower().str.findall("\w+") words = list(itertools.chain(words)) unique = set(words)
但是它可以直接转换为set()。
words = df['sentences'].str.lower().str.findall("\w+") unique = set() for x in words: unique.update(x)
大约需要5秒钟
完整代码：
import pandas as pd import time print(time.strftime('%H:%M:%S'), 'start') print('-----') #------------------------------------------------------------------------------ start = time.time() # `read_csv()` can read directly from internet and compressed to zip #url = 'http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip' url = 'SentimentAnalysisDataset.csv' # need to skip two rows which are incorrect df = pd.read_csv(url, sep=',', dtype={'ItemID':int, 'Sentiment':int, 'SentimentSource':str, 'SentimentText':str}, skiprows=[8835, 535881]) end = time.time() print(time.strftime('%H:%M:%S'), 'load:', end-start, 's') print('-----') #------------------------------------------------------------------------------ start = end words = df['SentimentText'].str.lower().str.findall("\w+") #df['words'] = words end = time.time() print(time.strftime('%H:%M:%S'), 'words:', end-start, 's') print('-----') #------------------------------------------------------------------------------ start = end unique = set() for x in words: unique.update(x) end = time.time() print(time.strftime('%H:%M:%S'), 'set:', end-start, 's') print('-----') #------------------------------------------------------------------------------ print(list(sorted(unique))[:10])
结果
00:27:04 start ----- 00:27:08 load: 4.10780930519104 s ----- 00:27:23 words: 14.803470849990845 s ----- 00:27:27 set: 4.338541269302368 s ----- ['0', '00', '000', '0000', '00000', '000000000000', '0000001', '000001', '000014', '00004873337e0033fea60']

如何从字符串的DataFrame列中获得唯一的单词？

问题描述投票：0回答：1

1个回答

最新问题

如何从字符串的DataFrame列中获得唯一的单词？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1