我有一个数据框,其中一些列是名称配对的(对于以
_x
结尾的每一列,都有一个相应的以 _y
结尾的列),而其他列则不是。例如:
import pandas as pd
import numpy as np
colnames = [
'foo', 'bar', 'baz',
'a_x', 'b_x', 'c_x',
'a_y', 'b_y', 'c_y',
]
rng = np.random.default_rng(0)
data = rng.random((20, len(colnames)))
df = pd.DataFrame(data, columns=colnames)
假设我有两个列表,其中包含所有以
_x
结尾的列名,以及所有以_y
结尾的列名(构建这样的列表很容易),长度相同m
(记住对于每个_x
列只有一个对应的 _y
列)。我想用一个简单的公式创建 m
新列:
df['a_err'] = (df['a_x'] - df['a_y']) / df['a_y']
当然,无需对列名称进行硬编码。使用
for
循环很容易做到这一点,但我想知道是否可以在没有循环的情况下做同样的事情,希望它会更快(真正的数据帧比这个小例子大得多)。
您可以将
groupby_apply
与自定义功能一起使用:
func = lambda sr: (sr.iloc[:, 0] - sr.iloc[:, 1]) / sr.iloc[:, 1]
# r'' stands for raw strings like f'' for formatted strings
# Keep columns that end ($) with '_x' or '_y' (_[xy])
# Groupby column prefix (a_x -> a, b_x -> b, ..., c_y -> c)
# Apply your formula on each group (a_x, a_y), (b_x, b_y), (c_x, c_y)
err = (df.filter(regex=r'_[xy]$')
.groupby(lambda x: x.split('_')[0], axis=1)
.apply(func).add_suffix('_err'))
# Append error columns to your original dataframe
df = pd.concat([df, err], axis=1)
输出:
>>> df
foo bar baz a_x b_x c_x a_y b_y c_y a_err b_err c_err
0 0.636962 0.269787 0.040974 0.016528 0.813270 0.912756 0.606636 0.729497 0.543625 -0.972755 0.114838 0.679017
1 0.935072 0.815854 0.002739 0.857404 0.033586 0.729655 0.175656 0.863179 0.541461 3.881166 -0.961091 0.347567
2 0.299712 0.422687 0.028320 0.124283 0.670624 0.647190 0.615385 0.383678 0.997210 -0.798040 0.747885 -0.351000
3 0.980835 0.685542 0.650459 0.688447 0.388921 0.135097 0.721488 0.525354 0.310242 -0.045796 -0.259697 -0.564545
4 0.485835 0.889488 0.934044 0.357795 0.571530 0.321869 0.594300 0.337911 0.391619 -0.397955 0.691361 -0.178106
5 0.890274 0.227158 0.623187 0.084015 0.832644 0.787098 0.239369 0.876484 0.058568 -0.649014 -0.050018 12.439042
6 0.336117 0.150279 0.450339 0.796324 0.230642 0.052021 0.404552 0.198513 0.090753 0.968411 0.161849 -0.426782
7 0.580332 0.298696 0.671995 0.199515 0.942113 0.365110 0.105495 0.629108 0.927155 0.891226 0.497538 -0.606204
8 0.440377 0.954590 0.499896 0.425229 0.620213 0.995097 0.948944 0.460045 0.757729 -0.551893 0.348158 0.313262
9 0.497423 0.529312 0.785786 0.414656 0.734484 0.711143 0.932060 0.114933 0.729015 -0.555119 5.390557 -0.024516
10 0.927424 0.967926 0.014706 0.863640 0.981195 0.957210 0.148764 0.972629 0.889936 4.805437 0.008807 0.075595
11 0.822374 0.479988 0.232373 0.801881 0.923530 0.266130 0.538934 0.442753 0.931017 0.487900 1.085882 -0.714151
12 0.040511 0.732006 0.614373 0.028365 0.719220 0.015992 0.757951 0.512759 0.929104 -0.962576 0.402648 -0.982788
13 0.066082 0.841317 0.066690 0.344310 0.430299 0.966062 0.562232 0.258865 0.241676 -0.387601 0.662254 2.997349
14 0.888118 0.225869 0.124555 0.288331 0.586123 0.554091 0.809711 0.560476 0.288421 -0.643909 0.045760 0.921116
15 0.412896 0.818121 0.626506 0.959078 0.369404 0.552612 0.593924 0.848291 0.145474 0.614815 -0.564531 2.798708
16 0.406510 0.909959 0.043067 0.822706 0.415384 0.829804 0.009955 0.365046 0.078630 81.646166 0.137895 9.553270
17 0.652615 0.273849 0.702652 0.943801 0.126817 0.864778 0.059464 0.380771 0.429774 14.871772 -0.666946 1.012170
18 0.488850 0.976462 0.775691 0.308857 0.269837 0.863120 0.881307 0.510707 0.344296 -0.649546 -0.471640 1.506915
19 0.994917 0.315944 0.182712 0.880098 0.812335 0.667889 0.958414 0.925715 0.748249 -0.081714 -0.122477 -0.107396
您还可以使用
filter
拆分列:
# Python > 3.8, walrus operator
err = (df.filter(regex='_x$').values - (y := df.filter(regex='_y$'))) / y
err.columns = err.columns.str.split('_').str[0] + '_err'
df = pd.concat([df, err], axis=1)
如果您已经有了列表,则可以使用以下内容:
q = ['a', 'b', 'c']
x = ['a_x', 'b_x', 'c_x']
y = ['a_y', 'b_y', 'c_y']
e = ['a_err', 'b_err', 'c_err']
df[e] = (df[x].rename(columns=dict(zip(x,q))) - df[y].rename(columns=dict(zip(y,q)))) / df[x].rename(columns=dict(zip(x,q)))
可能有更有效的方法,因为这每次都会重命名列,以便减法和乘法与索引匹配。
由于
x
和 y
之间存在平衡,您可以将 for 循环限制在列上,并且仍然保持高性能。下面的解决方案使用 MultiIndex,并尽可能将大量处理传递给 Pandas:
temp = df.filter(like = '_')
temp.columns = temp.columns.str.split("_", expand = True)
temp = temp.swaplevel(axis='columns')
temp = temp.x.sub(temp.y).div(temp.y).add_suffix('_err')
df.assign(**temp)
foo bar baz a_x b_x c_x a_y b_y c_y a_err b_err c_err
0 0.636962 0.269787 0.040974 0.016528 0.813270 0.912756 0.606636 0.729497 0.543625 -0.972755 0.114838 0.679017
1 0.935072 0.815854 0.002739 0.857404 0.033586 0.729655 0.175656 0.863179 0.541461 3.881166 -0.961091 0.347567
2 0.299712 0.422687 0.028320 0.124283 0.670624 0.647190 0.615385 0.383678 0.997210 -0.798040 0.747885 -0.351000
3 0.980835 0.685542 0.650459 0.688447 0.388921 0.135097 0.721488 0.525354 0.310242 -0.045796 -0.259697 -0.564545
4 0.485835 0.889488 0.934044 0.357795 0.571530 0.321869 0.594300 0.337911 0.391619 -0.397955 0.691361 -0.178106
5 0.890274 0.227158 0.623187 0.084015 0.832644 0.787098 0.239369 0.876484 0.058568 -0.649014 -0.050018 12.439042
6 0.336117 0.150279 0.450339 0.796324 0.230642 0.052021 0.404552 0.198513 0.090753 0.968411 0.161849 -0.426782
7 0.580332 0.298696 0.671995 0.199515 0.942113 0.365110 0.105495 0.629108 0.927155 0.891226 0.497538 -0.606204
8 0.440377 0.954590 0.499896 0.425229 0.620213 0.995097 0.948944 0.460045 0.757729 -0.551893 0.348158 0.313262
9 0.497423 0.529312 0.785786 0.414656 0.734484 0.711143 0.932060 0.114933 0.729015 -0.555119 5.390557 -0.024516
10 0.927424 0.967926 0.014706 0.863640 0.981195 0.957210 0.148764 0.972629 0.889936 4.805437 0.008807 0.075595
11 0.822374 0.479988 0.232373 0.801881 0.923530 0.266130 0.538934 0.442753 0.931017 0.487900 1.085882 -0.714151
12 0.040511 0.732006 0.614373 0.028365 0.719220 0.015992 0.757951 0.512759 0.929104 -0.962576 0.402648 -0.982788
13 0.066082 0.841317 0.066690 0.344310 0.430299 0.966062 0.562232 0.258865 0.241676 -0.387601 0.662254 2.997349
14 0.888118 0.225869 0.124555 0.288331 0.586123 0.554091 0.809711 0.560476 0.288421 -0.643909 0.045760 0.921116
15 0.412896 0.818121 0.626506 0.959078 0.369404 0.552612 0.593924 0.848291 0.145474 0.614815 -0.564531 2.798708
16 0.406510 0.909959 0.043067 0.822706 0.415384 0.829804 0.009955 0.365046 0.078630 81.646166 0.137895 9.553270
17 0.652615 0.273849 0.702652 0.943801 0.126817 0.864778 0.059464 0.380771 0.429774 14.871772 -0.666946 1.012170
18 0.488850 0.976462 0.775691 0.308857 0.269837 0.863120 0.881307 0.510707 0.344296 -0.649546 -0.471640 1.506915
19 0.994917 0.315944 0.182712 0.880098 0.812335 0.667889 0.958414 0.925715 0.748249 -0.081714 -0.122477 -0.107396