如何基于group by添加新列并向列添加条件?

问题描述 投票:0回答:1

我有以下代码

import pandas as pd
import numpy as np

data = {
    'id': [1, 2, 3, 4, 5, 6, 7],
    'date': ['2019-02-01', '2019-02-10', '2019-02-25', '2019-03-05', '2019-03-16', '2019-04-05', '2019-05-15'],
    'date_difference': [None, 9, 15, 11, 10, 19, 40],
    'number': [1, 0, 1, 0, 0, 0, 0],
    'text': ['A', 'A', 'A', 'A', 'A', 'B', 'B']
}

df = pd.DataFrame(data)
id 日期 日期_差异 数字 文字
1 2019-02-01 1 A
2 2019-02-10 9 0 A
3 2019-02-25 15 1 A
4 2019-03-05 11 0 A
5 2019-03-16 10 0 A
6 2019-04-05 19 0 B
7 2019-05-15 40 0 B

基于

text
number
列,我想生成一个名为
test
的新列。 在每组中,
text
列从日期开始按降序排列。 当
number == 0
时,步长从
1
开始。 当它在组内找到
1
时,步长增加
1
。如果组内的
1
列中没有
number
,则步长在同一组中保持为
1

我有以下代码,但无法产生所需的结果。

df['test'] = df.groupby(['text', 'number'])['number'].transform(lambda x, step_size=1: step_size if x.iloc[0] == 0 else None) 

决赛桌应该是这样的

id 日期 日期_差异 数字 文字 测试
1 2019-02-01 1 A 2
2 2019-02-10 9 0 A 2
3 2019-02-25 15 1 A 1
4 2019-03-05 11 0 A 1
5 2019-03-16 10 0 A 1
6 2019-04-05 19 0 B 1
7 2019-05-15 40 0 B 1
python pandas dataframe group-by analytics
1个回答
0
投票

我的尝试:

import pandas as pd


data = {
    'id': [1, 2, 3, 4, 5, 6, 7],
    'date': ['2019-02-01', '2019-02-10', '2019-02-25', '2019-03-05', '2019-03-16', '2019-04-05', '2019-05-15'],
    'date_difference': [None, 9, 15, 11, 10, 19, 40],
    'number': [1, 0, 1, 0, 0, 0, 0],
    'text': ['A', 'A', 'A', 'A', 'A', 'B', 'B']
}

df = pd.DataFrame(data)

out = df.assign(
    # We assign the following values to the series name "test"
    test=df
    # Group on "text" -- if we grouped on ["text", "number"] we wouldn't see different numbers within the groups.
    .groupby("text")
    # Apply a chain of methods to the group (a pd.DataFrame).
    .apply(
        lambda g: (
            # We sort "date" in descending order as you mention this partially controls the step size.
            g.sort_values(by="date", ascending=False)
            # We shift "number" forward one period with a fill_value of 1 for any newly introduced nulls.
            .number.shift(periods=1, fill_value=1)
            # Cumulatively sum the shifted "number" values
            .cumsum()
        )
        # This will result in the new series, albeit sorted by descending "date".
    )
    # Drop the "text" level of the new multi-index.
    .droplevel("text")
    # The assign method acts as join, rearranging the newly created series to match the index of `df`.
)
print(out)
   id        date  date_difference  number text  test
0   1  2019-02-01              NaN       1    A     2
1   2  2019-02-10              9.0       0    A     2
2   3  2019-02-25             15.0       1    A     1
3   4  2019-03-05             11.0       0    A     1
4   5  2019-03-16             10.0       0    A     1
5   6  2019-04-05             19.0       0    B     1
6   7  2019-05-15             40.0       0    B     1
© www.soinside.com 2019 - 2024. All rights reserved.