我有以下代码
import pandas as pd
import numpy as np
data = {
'id': [1, 2, 3, 4, 5, 6, 7],
'date': ['2019-02-01', '2019-02-10', '2019-02-25', '2019-03-05', '2019-03-16', '2019-04-05', '2019-05-15'],
'date_difference': [None, 9, 15, 11, 10, 19, 40],
'number': [1, 0, 1, 0, 0, 0, 0],
'text': ['A', 'A', 'A', 'A', 'A', 'B', 'B']
}
df = pd.DataFrame(data)
id | 日期 | 日期_差异 | 数字 | 文字 |
---|---|---|---|---|
1 | 2019-02-01 | 空 | 1 | A |
2 | 2019-02-10 | 9 | 0 | A |
3 | 2019-02-25 | 15 | 1 | A |
4 | 2019-03-05 | 11 | 0 | A |
5 | 2019-03-16 | 10 | 0 | A |
6 | 2019-04-05 | 19 | 0 | B |
7 | 2019-05-15 | 40 | 0 | B |
基于
text
和 number
列,我想生成一个名为 test
的新列。
在每组中,text
列从日期开始按降序排列。
当number == 0
时,步长从1
开始。
当它在组内找到1
时,步长增加1
。如果组内的 1
列中没有 number
,则步长在同一组中保持为 1
。
我有以下代码,但无法产生所需的结果。
df['test'] = df.groupby(['text', 'number'])['number'].transform(lambda x, step_size=1: step_size if x.iloc[0] == 0 else None)
决赛桌应该是这样的
id | 日期 | 日期_差异 | 数字 | 文字 | 测试 |
---|---|---|---|---|---|
1 | 2019-02-01 | 空 | 1 | A | 2 |
2 | 2019-02-10 | 9 | 0 | A | 2 |
3 | 2019-02-25 | 15 | 1 | A | 1 |
4 | 2019-03-05 | 11 | 0 | A | 1 |
5 | 2019-03-16 | 10 | 0 | A | 1 |
6 | 2019-04-05 | 19 | 0 | B | 1 |
7 | 2019-05-15 | 40 | 0 | B | 1 |
我的尝试:
import pandas as pd
data = {
'id': [1, 2, 3, 4, 5, 6, 7],
'date': ['2019-02-01', '2019-02-10', '2019-02-25', '2019-03-05', '2019-03-16', '2019-04-05', '2019-05-15'],
'date_difference': [None, 9, 15, 11, 10, 19, 40],
'number': [1, 0, 1, 0, 0, 0, 0],
'text': ['A', 'A', 'A', 'A', 'A', 'B', 'B']
}
df = pd.DataFrame(data)
out = df.assign(
# We assign the following values to the series name "test"
test=df
# Group on "text" -- if we grouped on ["text", "number"] we wouldn't see different numbers within the groups.
.groupby("text")
# Apply a chain of methods to the group (a pd.DataFrame).
.apply(
lambda g: (
# We sort "date" in descending order as you mention this partially controls the step size.
g.sort_values(by="date", ascending=False)
# We shift "number" forward one period with a fill_value of 1 for any newly introduced nulls.
.number.shift(periods=1, fill_value=1)
# Cumulatively sum the shifted "number" values
.cumsum()
)
# This will result in the new series, albeit sorted by descending "date".
)
# Drop the "text" level of the new multi-index.
.droplevel("text")
# The assign method acts as join, rearranging the newly created series to match the index of `df`.
)
print(out)
id date date_difference number text test
0 1 2019-02-01 NaN 1 A 2
1 2 2019-02-10 9.0 0 A 2
2 3 2019-02-25 15.0 1 A 1
3 4 2019-03-05 11.0 0 A 1
4 5 2019-03-16 10.0 0 A 1
5 6 2019-04-05 19.0 0 B 1
6 7 2019-05-15 40.0 0 B 1