根据索引列值过滤pandas多索引数据框

Question

我有下面的多索引数据框，

|  shape  |  colour  |        data        |
|                    |  d1  |  d2  |  d3  |
-------------------------------------------
| circle  |  green   |  2   |  4   |  9   |
| circle  |  red     |  -9  |  3   |  1   |
| square  |  orange  |  5   |  -6  |  2   |
| square  |  yellow  |  9   |  8   |  2   |

使用以下内容创建，

header = [ "shape", "colour", "label", "data" ]  
data = [      
[ "circle",  "green",  "d1", 2  ],
[ "circle",  "green",  "d2", 4  ],
[ "circle",  "green",  "d3", 9  ],
[ "square",  "orange", "d1", 5  ],
[ "square",  "orange", "d2", -6 ],
[ "square",  "orange", "d3", 2  ],
[ "circle",  "red",    "d1", -9 ],
[ "circle",  "red",    "d2", 3  ],
[ "circle",  "red",    "d3", 1  ],
[ "square",  "yellow", "d1", 9  ],
[ "square",  "yellow", "d2", 8  ],
[ "square",  "yellow", "d3", 2  ],
]

raw = pd.DataFrame(data, columns=header)
df = raw.pivot(index=["shape", "colour"], columns=["label"], values=["data"])

我想根据颜色过滤数据值。使用设定值很容易，我可以做到，

df[(abs(df["data"]) > 2)]

，但我不知道/无法找到当阈值取决于颜色时如何做到这一点。使用以下过滤器，

filter = {
    "red": {"threshold": 5},
    "green": {"threshold": 5},
    "yellow": {"threshold": 7},
    "orange": {"threshold": 1},
}

期望的输出是，

|  shape  |  colour  |        data        |
|                    |  d1  |  d2  |  d3  |
-------------------------------------------
| circle  |  green   | Nan  | Nan  |  9   |
| circle  |  red     |  -9  | Nan  | Nan  |
| square  |  orange  |  5   |  -6  |  2   |
| square  |  yellow  |  9   |  8   | Nan  |

我试过了，

df.apply(lambda r: r["data"] if abs(r["data"]) > filter[r.index.get_level_values(1)].get("threshold", 2) else 0)

和

df[(abs(df["data"]) > filter[df.index.get_level_values(1)].get("threshold", 0))]

无济于事。

Answer 1

从您的数据集中，我们可以使用名为

numpy

的

select 方法，如下所示：

import numpy as np


filter = {
    "red": {"threshold": 5},
    "green": {"threshold": 5},
    "yellow": {"threshold": 7},
    "orange": {"threshold": 1},
}

condlist = [(raw['colour'] == i) & (raw['data'].abs() > filter[i]['threshold']) for i in filter]
choicelist = [raw['data'] for i in filter]            

raw['vals'] = np.select(condlist, choicelist, default=np.nan)

输出：

>>> raw
    shape   colour  label   data    vals
0   circle  green   d1          2   NaN
1   circle  green   d2          4   NaN
2   circle  green   d3          9   9.0
3   square  orange  d1          5   5.0
4   square  orange  d2         -6  -6.0
5   square  orange  d3          2   2.0
6   circle  red     d1         -9  -9.0
7   circle  red     d2          3   NaN
8   circle  red     d3          1   NaN
9   square  yellow  d1          9   9.0
10  square  yellow  d2          8   8.0
11  square  yellow  d3          2   NaN

然后我们可以将数据透视表应用于数据以获得预期结果：

df = raw.pivot(index=["shape", "colour"], columns=["label"], values=["vals"])

输出：

>>> df
                vals
        label   d1      d2      d3
shape   colour          
circle  green   NaN     NaN     9.0
        red    -9.0     NaN     NaN
square  orange  5.0    -6.0     2.0
        yellow  9.0     8.0     NaN

Answer 2

从

filter

字典创建

DataFrame

，以颜色为索引。然后使用

df

索引的

'colour'

级别重新索引后者。将

df

与获得的值进行比较（每行的单独阈值取决于颜色）：

threshold = pd.DataFrame.from_dict(filter, orient='index') 
mask = df.abs() > threshold.reindex(df.index.get_level_values('colour')).values
print(df[mask])

输出：

              data          
label           d1   d2   d3
shape  colour               
circle green   NaN  NaN  9.0
       red    -9.0  NaN  NaN
square orange  5.0 -6.0  2.0
       yellow  9.0  8.0  NaN

根据索引列值过滤pandas多索引数据框

问题描述投票：0回答：2

2个回答

最新问题

根据索引列值过滤pandas多索引数据框

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2