我对如何提取按日期分组的文本中使用的主要单词会有些疑问。
我有一个数据集
Date User Text
1 J.C. “The story of my life can be summerised as follows...”
1 M.J. I will go to a concert next week
1 M.J. Do you think I will play the guitar during the concert?
2 K.M. No one understands me
2 X.Z. We are alone in this world. I have no one on my side
3 L.B. I love you so much
3 M.B. I really love my kids
3 G.B. Stop to think about, and love me more
3 E.G. today is the best day of my life
4 V.B. Look! There is a cute little dog here
5 F.A. Stop killing animals
...等。我想做的是按日期提取最常用的单词,以查看是否有模式。那么,对于每个日期,我应该有一个新列来列出这些单词。
关于如何执行此操作的任何建议?
这样的事情?
(df.set_index('Date').Text
.str.lower() # lower case
.str.extractall(r'(\w+)')[0] # extract the words
.groupby('Date').value_counts() # frequency count by date
.groupby('Date').head(5) # top 5 by dates
)
输出:
Date 0
1 the 3
concert 2
i 2
will 2
a 1
2 no 2
one 2
alone 1
are 1
have 1
3 love 3
i 2
my 2
about 1
and 1
4 a 1
cute 1
dog 1
here 1
is 1
5 animals 1
killing 1
stop 1
Name: 0, dtype: int64