查找 pandas 数据框或数字列表中的现有小计

Question

这里有一个有趣的问题。给定一个 pandas 数据框（甚至是一个 Python 列表），如何查找可能位于该列表中的小计？例如：

    running  value
0     False  50709
1     False  26715
2     False   1715
3     False  79139
4     False  34447
5     False   7256
6     False   1210
7     False  42913
8      True  36227
9     False    999
10    False  20107
11    False   5787
12    False  -1466
13    False   -216
14    False    615
15    False  24827
16     True  11400
17    False   5642
18     True   5758
19    False     -5
20     True   5753

数据观察：

标志可能不正确。
数据中既有小计，也有累计。行
```
[3, 7, 15]
```
是小计，
```
[8, 16, 18, 20]
```
是运行总计。
小计 3 可以被视为特殊情况，因为它既是小计又是运行总计。
我可以通过其他方式确定运行总计，因此它们在样本数据中标记为 True。

小计

[3, 7, 15]

分别代表行

[0, 1, 2]

、

[4, 5, 6]

和

[10, 11, 12, 13, 14]

。

可以公平地假设小计遵循数字的连续子集。
可能没有任何小计。
我不知道是否存在小计集包含另一个较小小计集的情况。即使不考虑这一点的答案也会有帮助。
行数会比较少，不到100。

我需要识别小计和每个小计代表的行。

到目前为止我所得到的：

import pandas as pd


def gen_subtotal_indices(df):
    targets = set()
    indices = []
    for i, r in df.iterrows():
        if r['running']:
            continue
        v = r['value']
        if v in targets:
            yield i, indices
            targets = set()
            indices = []
            continue
        if len(targets) == 0:
            targets = {x for x in (0, v, -v)}
            indices.append(i)
        else:
            targets |= {t + x for t in targets for x in (0, v, -v)}
            indices.append(i)


df = pd.DataFrame({'running': [False, False, False, False, False, False, False, False, True, False, False,
                               False, False, False, False, False, True, False, True, False, True],
                   'value': [50709, 26715, 1715, 79139, 34447, 7256, 1210, 42913, 36227, 999, 20107, 5787, -1466,
                             -216, 615, 24827, 11400, 5642, 5758, -5, 5753]})

print(df)
result = list(gen_subtotal_indices(df))
print(result)

产生：

[(3, [0, 1, 2]), (7, [4, 5, 6]), (15, [9, 10, 11, 12, 13, 14])]

正确识别小计。但是，您可以看到第 9 项错误地包含在最后一个小计的列表中。

此外，我在推导式中使用 0，以防开头部分不属于小计的行。但是，它也可能会选取不连续的子列表，总和为小计，这是不正确的。

Answer 1

我有一个答案：

import pandas as pd


def gen_subtotal_indices(df):
    targets = set()  #used for fast test of inclusion
    targets_lst = []
    signs = []
    indices = []
    for i, r in df.iterrows():
        if r['running']:
            continue
        v = r['value']
        if v in targets:
            yield i, indices, signs[targets_lst.index(v)]
            targets = set()
            targets_lst = []
            signs = []
            indices = []
            continue
        if len(targets) == 0:
            targets = {x for x in (0, v, -v)}
            targets_lst = [x for x in (0, v, -v)]
            signs = [[x] for x in (0, 1, -1)]
            indices.append(i)
        else:
            targets |= {t + x for t in targets for x in (0, v, -v)}
            targets_lst = [t + x for t in targets_lst for x in (0, v, -v)]
            signs = [t + [x] for t in signs for x in (0, 1, -1)]
            indices.append(i)


df = pd.DataFrame({'running': [False, False, False, False, False, False, False, False, True, False, False,
                               False, False, False, False, False, True, False, True, False, True],
                   'value': [50709, 26715, 1715, 79139, 34447, -7256, 1210, 42913, 36227, 999, 20107, 5787, -1466,
                             -216, 615, 24827, 11400, 5642, 5758, -5, 5753]})

print(df)
result = list(gen_subtotal_indices(df))
print(result)

产生

您可以看到小计的索引，后跟所包含项目的索引，后跟一个包含 0、1 或 -1 的向量，表示原始数据的乘数。可能会更好，但这是总体思路。

任何想法或改进表示赞赏！

查找 pandas 数据框或数字列表中的现有小计

问题描述投票：0回答：1

1个回答

最新问题

查找 pandas 数据框或数字列表中的现有小计

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1