pandas - 根据列值合并几乎重复的行

Question

我有一个

pandas

数据框，其中有几行几乎彼此重复，除了一个值之外。我的目标是将这些行合并或“合并”为一行，而不对数值求和。

这是我正在使用的示例：

Name   Sid   Use_Case  Revenue
A      xx01  Voice     $10.00
A      xx01  SMS       $10.00
B      xx02  Voice     $5.00
C      xx03  Voice     $15.00
C      xx03  SMS       $15.00
C      xx03  Video     $15.00

这就是我想要的：

Name   Sid   Use_Case            Revenue
A      xx01  Voice, SMS          $10.00
B      xx02  Voice               $5.00
C      xx03  Voice, SMS, Video   $15.00

我不想对“收入”列求和的原因是因为我的表格是在多个时间段内进行数据透视的结果，其中“收入”最终只是被列出多次，而不是每个“Use_Case”具有不同的值”。

解决这个问题的最佳方法是什么？我研究过

groupby()

函数，但我仍然不太理解它。

Answer 1

我认为您可以将

groupby

与

aggregate

first

和自定义函数

', '.join

:

一起使用

df = df.groupby('Name').agg({'Sid':'first', 
                             'Use_Case': ', '.join, 
                             'Revenue':'first' }).reset_index()

#change column order                           
print df[['Name','Sid','Use_Case','Revenue']]                              
  Name   Sid           Use_Case Revenue
0    A  xx01         Voice, SMS  $10.00
1    B  xx02              Voice   $5.00
2    C  xx03  Voice, SMS, Video  $15.00

评论的好主意，谢谢Goyo：

df = df.groupby(['Name','Sid','Revenue'])['Use_Case'].apply(', '.join).reset_index()

#change column order                           
print df[['Name','Sid','Use_Case','Revenue']]                              
  Name   Sid           Use_Case Revenue
0    A  xx01         Voice, SMS  $10.00
1    B  xx02              Voice   $5.00
2    C  xx03  Voice, SMS, Video  $15.00

Answer 2

您可以

groupby

和

apply

list

功能：

>>> df['Use_Case'].groupby([df.Name, df.Sid, df.Revenue]).apply(list).reset_index()
    Name    Sid     Revenue     0
0   A   xx01    $10.00  [Voice, SMS]
1   B   xx02    $5.00   [Voice]
2   C   xx03    $15.00  [Voice, SMS, Video]

（如果您担心重复，请使用

set

而不是

list

。）

Answer 3

我使用了一些我认为不是最佳的代码，最终找到了jezrael的答案。但在使用它并运行

timeit

测试后，我实际上回到了我正在做的事情，那就是：

cmnts = {}
for i, row in df.iterrows():
    while True:
        try:
            if row['Use_Case']:
                cmnts[row['Name']].append(row['Use_Case'])

            else:
                cmnts[row['Name']].append('n/a')

            break

        except KeyError:
            cmnts[row['Name']] = []

df.drop_duplicates('Name', inplace=True)
df['Use_Case'] = ['; '.join(v) for v in cmnts.values()]

根据我的 100 次运行

timeit

测试，迭代和替换方法比

groupby

方法快一个数量级。

import pandas as pd
from my_stuff import time_something

df = pd.DataFrame({'a': [i / (i % 4 + 1) for i in range(1, 10001)],
                   'b': [i for i in range(1, 10001)]})

runs = 100

interim_dict = 'txt = {}\n' \
               'for i, row in df.iterrows():\n' \
               '    try:\n' \
               "        txt[row['a']].append(row['b'])\n\n" \
               '    except KeyError:\n' \
               "        txt[row['a']] = []\n" \
               "df.drop_duplicates('a', inplace=True)\n" \
               "df['b'] = ['; '.join(v) for v in txt.values()]"

grouping = "new_df = df.groupby('a')['b'].apply(str).apply('; '.join).reset_index()"

print(time_something(interim_dict, runs, beg_string='Interim Dict', glbls=globals()))
print(time_something(grouping, runs, beg_string='Group By', glbls=globals()))

产量：

Interim Dict
  Total: 59.1164s
  Avg: 591163748.5887ns

Group By
  Total: 430.6203s
  Avg: 4306203366.1827ns

其中

time_something

是一个函数，它将片段与

timeit

相乘并以上述格式返回结果。

Answer 4

根据 @jezrael 和 @leoschet 的回答，我想提供一个更一般的示例，以防数据框中有更多列，这是我最近必须做的事情。

具体来说，我的数据框总共有 184 列。

REF

列应该用作

groupby

的参考，其余 182 列中只有另一列（称为

IDS

）不同，我想将其元素折叠到列表中

id1, id2, id3

...

所以：

# Create a dictionary {df_all_columns_name : 'first', 'IDS': join} for agg
# Also avoid REF column in dictionary (inserted after aggregation)
columns_collapse = {c: 'first' if c != 'IDS' else ', '.join for c in my_df.columns.tolist() if c != 'REF'}
my_df = my_df.groupby('REF').agg(columns_collapse).reset_index()

我希望这对某人也有用！

问候！

Answer 5

这是一种使用@jesrael的答案的方法，可以适应行/列是否包含重复项 - 这样您就不需要提前知道要合并哪一列。

def aggfun(df):
    dfd = df.duplicated()
    if (dfd.any()):
        return df[dfd]
    else:
        return ','.join(df.apply(str))
df = df.groupby('Name').agg(aggfun).reset_index()

pandas - 根据列值合并几乎重复的行

问题描述投票：0回答：5

5个回答

最新问题

pandas - 根据列值合并几乎重复的行

问题描述 投票：0回答：5

5个回答

最新问题

问题描述投票：0回答：5