如何使用其他列创建包含每个单词及其计数的新数据框

问题描述 投票:0回答:2

让我解释。我的dflook像这样:

id `  text                             c1      
1     Hello world how are you people    1 
2     Hello people I am fine  people    1
3     Good Morning people               -1
4     Good Evening                      -1

c1只包含两个值1或-1

现在我想要一个像这样的数据帧(输出):

Word      Totalcount     Points      PercentageOfPointAndTotalCount

hello        2             2              100
world        1             1              100
how          1             1              100
are          1             1              100
you          1             1              100
people       3             1              33.33
I            1             1              100
am           1             1              100
fine         1             1              100
Good         2             -2            -100
Morning      1             -1            -100
Evening      1             -1            -100

在这里,Totalcounttext列中每个单词出现的总次数。

points是每个单词的c1的总和。示例:people word在c1为1的两行中,c1-1的一行。所以重点是1(2-1 = 1)。

PercentageOfPointAndTotalCount = Points / TotalCount * 100

print(df)

      id comment_text  target
0  59848  Hello world    -1.0
1  59849  Hello world    -1.0
python pandas
2个回答
3
投票

我在unnesting之后使用str.split,然后我们只需要groupby + agg

unnesting(df,['text']).groupby('text').c1.agg(['count','sum'])
Out[873]: 
         count  sum
text               
Evening      1   -1
Good         2   -2
Hello        2    2
I            1    1
Morning      1   -1
am           1    1
are          1    1
fine         1    1
how          1    1
people       4    2
world        1    1
you          1    1

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

1
投票

这是一个独立的版本:

new_df = (df.set_index('c1').text.str.split().apply(pd.Series)
      .stack().reset_index().drop('level_1', axis=1))

new_df.groupby(0).c1.agg(['sum','count'])

输出:

+---------+-----+-------+
|         | sum | count |
+---------+-----+-------+
|    0    |     |       |
+---------+-----+-------+
| Evening |  -1 |     1 |
| Good    |  -2 |     2 |
| Hello   |   2 |     2 |
| I       |   1 |     1 |
| Morning |  -1 |     1 |
| am      |   1 |     1 |
| are     |   1 |     1 |
| fine    |   1 |     1 |
| how     |   1 |     1 |
| people  |   2 |     4 |
| world   |   1 |     1 |
| you     |   1 |     1 |
+---------+-----+-------+
© www.soinside.com 2019 - 2024. All rights reserved.