这里有一个有趣的问题。给定一个 pandas 数据框(甚至是一个 Python 列表),如何查找可能位于该列表中的小计?例如:
running value
0 False 50709
1 False 26715
2 False 1715
3 False 79139
4 False 34447
5 False 7256
6 False 1210
7 False 42913
8 True 36227
9 False 999
10 False 20107
11 False 5787
12 False -1466
13 False -216
14 False 615
15 False 24827
16 True 11400
17 False 5642
18 True 5758
19 False -5
20 True 5753
数据观察:
[3, 7, 15]
是小计,[8, 16, 18, 20]
是运行总计。[3, 7, 15]
分别代表行 [0, 1, 2]
、[4, 5, 6]
和 [10, 11, 12, 13, 14]
。我需要识别小计和每个小计代表的行。
到目前为止我所得到的:
import pandas as pd
def gen_subtotal_indices(df):
targets = set()
indices = []
for i, r in df.iterrows():
if r['running']:
continue
v = r['value']
if v in targets:
yield i, indices
targets = set()
indices = []
continue
if len(targets) == 0:
targets = {x for x in (0, v, -v)}
indices.append(i)
else:
targets |= {t + x for t in targets for x in (0, v, -v)}
indices.append(i)
df = pd.DataFrame({'running': [False, False, False, False, False, False, False, False, True, False, False,
False, False, False, False, False, True, False, True, False, True],
'value': [50709, 26715, 1715, 79139, 34447, 7256, 1210, 42913, 36227, 999, 20107, 5787, -1466,
-216, 615, 24827, 11400, 5642, 5758, -5, 5753]})
print(df)
result = list(gen_subtotal_indices(df))
print(result)
产生:
[(3, [0, 1, 2]), (7, [4, 5, 6]), (15, [9, 10, 11, 12, 13, 14])]
正确识别小计。但是,您可以看到第 9 项错误地包含在最后一个小计的列表中。
此外,我在推导式中使用 0,以防开头部分不属于小计的行。但是,它也可能会选取不连续的子列表,总和为小计,这是不正确的。
我有一个答案:
import pandas as pd
def gen_subtotal_indices(df):
targets = set() #used for fast test of inclusion
targets_lst = []
signs = []
indices = []
for i, r in df.iterrows():
if r['running']:
continue
v = r['value']
if v in targets:
yield i, indices, signs[targets_lst.index(v)]
targets = set()
targets_lst = []
signs = []
indices = []
continue
if len(targets) == 0:
targets = {x for x in (0, v, -v)}
targets_lst = [x for x in (0, v, -v)]
signs = [[x] for x in (0, 1, -1)]
indices.append(i)
else:
targets |= {t + x for t in targets for x in (0, v, -v)}
targets_lst = [t + x for t in targets_lst for x in (0, v, -v)]
signs = [t + [x] for t in signs for x in (0, 1, -1)]
indices.append(i)
df = pd.DataFrame({'running': [False, False, False, False, False, False, False, False, True, False, False,
False, False, False, False, False, True, False, True, False, True],
'value': [50709, 26715, 1715, 79139, 34447, -7256, 1210, 42913, 36227, 999, 20107, 5787, -1466,
-216, 615, 24827, 11400, 5642, 5758, -5, 5753]})
print(df)
result = list(gen_subtotal_indices(df))
print(result)
产生
您可以看到小计的索引,后跟所包含项目的索引,后跟一个包含 0、1 或 -1 的向量,表示原始数据的乘数。可能会更好,但这是总体思路。
任何想法或改进表示赞赏!