按日期提取热门单词

问题描述 投票:0回答:1

我对如何提取按日期分组的文本中使用的主要单词会有些疑问。

我有一个数据集

Date  User Text
1     J.C. “The story of my life can be summerised as follows...”
1    M.J.   I will go to a concert next week
1    M.J.   Do you think I will play the guitar during the concert?
2    K.M.   No one understands me
2    X.Z.   We are alone in this world. I have no one on my side
3    L.B.   I love you so much
3    M.B.   I really love my kids
3    G.B.   Stop to think about, and love me more 
3    E.G.   today is the best day of my life
4    V.B.   Look! There is a cute little dog here 
5    F.A.   Stop killing animals

...等。我想做的是按日期提取最常用的单词,以查看是否有模式。那么,对于每个日期,我应该有一个新列来列出这些单词。

关于如何执行此操作的任何建议?

python pandas nltk
1个回答
0
投票

这样的事情?

(df.set_index('Date').Text
   .str.lower()                       # lower case
   .str.extractall(r'(\w+)')[0]       # extract the words
   .groupby('Date').value_counts()    # frequency count by date
   .groupby('Date').head(5)           # top 5 by dates
)

输出:

Date  0      
1     the        3
      concert    2
      i          2
      will       2
      a          1
2     no         2
      one        2
      alone      1
      are        1
      have       1
3     love       3
      i          2
      my         2
      about      1
      and        1
4     a          1
      cute       1
      dog        1
      here       1
      is         1
5     animals    1
      killing    1
      stop       1
Name: 0, dtype: int64
© www.soinside.com 2019 - 2024. All rights reserved.