在没有 for 循环的情况下从 2m 列计算 m 列

Question

我有一个数据框，其中一些列是名称配对的（对于以

_x

结尾的每一列，都有一个相应的以

_y

结尾的列），而其他列则不是。例如：

import pandas as pd
import numpy as np

colnames = [
    'foo', 'bar', 'baz',
    'a_x', 'b_x', 'c_x',
    'a_y', 'b_y', 'c_y',
]
rng = np.random.default_rng(0)
data = rng.random((20, len(colnames)))
df = pd.DataFrame(data, columns=colnames)

假设我有两个列表，其中包含所有以

_x

结尾的列名，以及所有以

_y

结尾的列名（构建这样的列表很容易），长度相同

（记住对于每个

_x

列只有一个对应的

_y

列）。我想用一个简单的公式创建

新列：

df['a_err'] = (df['a_x'] - df['a_y']) / df['a_y']

当然，无需对列名称进行硬编码。使用

for

循环很容易做到这一点，但我想知道是否可以在没有循环的情况下做同样的事情，希望它会更快（真正的数据帧比这个小例子大得多）。

Answer 1

您可以将

groupby_apply

与自定义功能一起使用：

func = lambda sr: (sr.iloc[:, 0] - sr.iloc[:, 1]) / sr.iloc[:, 1]

# r'' stands for raw strings like f'' for formatted strings
# Keep columns that end ($) with '_x' or '_y' (_[xy])
# Groupby column prefix (a_x -> a, b_x -> b, ..., c_y -> c)
# Apply your formula on each group (a_x, a_y), (b_x, b_y), (c_x, c_y)
err = (df.filter(regex=r'_[xy]$')
         .groupby(lambda x: x.split('_')[0], axis=1)
         .apply(func).add_suffix('_err'))

# Append error columns to your original dataframe
df = pd.concat([df, err], axis=1)

输出：

>>> df
         foo       bar       baz       a_x       b_x       c_x       a_y       b_y       c_y      a_err     b_err      c_err
0   0.636962  0.269787  0.040974  0.016528  0.813270  0.912756  0.606636  0.729497  0.543625  -0.972755  0.114838   0.679017
1   0.935072  0.815854  0.002739  0.857404  0.033586  0.729655  0.175656  0.863179  0.541461   3.881166 -0.961091   0.347567
2   0.299712  0.422687  0.028320  0.124283  0.670624  0.647190  0.615385  0.383678  0.997210  -0.798040  0.747885  -0.351000
3   0.980835  0.685542  0.650459  0.688447  0.388921  0.135097  0.721488  0.525354  0.310242  -0.045796 -0.259697  -0.564545
4   0.485835  0.889488  0.934044  0.357795  0.571530  0.321869  0.594300  0.337911  0.391619  -0.397955  0.691361  -0.178106
5   0.890274  0.227158  0.623187  0.084015  0.832644  0.787098  0.239369  0.876484  0.058568  -0.649014 -0.050018  12.439042
6   0.336117  0.150279  0.450339  0.796324  0.230642  0.052021  0.404552  0.198513  0.090753   0.968411  0.161849  -0.426782
7   0.580332  0.298696  0.671995  0.199515  0.942113  0.365110  0.105495  0.629108  0.927155   0.891226  0.497538  -0.606204
8   0.440377  0.954590  0.499896  0.425229  0.620213  0.995097  0.948944  0.460045  0.757729  -0.551893  0.348158   0.313262
9   0.497423  0.529312  0.785786  0.414656  0.734484  0.711143  0.932060  0.114933  0.729015  -0.555119  5.390557  -0.024516
10  0.927424  0.967926  0.014706  0.863640  0.981195  0.957210  0.148764  0.972629  0.889936   4.805437  0.008807   0.075595
11  0.822374  0.479988  0.232373  0.801881  0.923530  0.266130  0.538934  0.442753  0.931017   0.487900  1.085882  -0.714151
12  0.040511  0.732006  0.614373  0.028365  0.719220  0.015992  0.757951  0.512759  0.929104  -0.962576  0.402648  -0.982788
13  0.066082  0.841317  0.066690  0.344310  0.430299  0.966062  0.562232  0.258865  0.241676  -0.387601  0.662254   2.997349
14  0.888118  0.225869  0.124555  0.288331  0.586123  0.554091  0.809711  0.560476  0.288421  -0.643909  0.045760   0.921116
15  0.412896  0.818121  0.626506  0.959078  0.369404  0.552612  0.593924  0.848291  0.145474   0.614815 -0.564531   2.798708
16  0.406510  0.909959  0.043067  0.822706  0.415384  0.829804  0.009955  0.365046  0.078630  81.646166  0.137895   9.553270
17  0.652615  0.273849  0.702652  0.943801  0.126817  0.864778  0.059464  0.380771  0.429774  14.871772 -0.666946   1.012170
18  0.488850  0.976462  0.775691  0.308857  0.269837  0.863120  0.881307  0.510707  0.344296  -0.649546 -0.471640   1.506915
19  0.994917  0.315944  0.182712  0.880098  0.812335  0.667889  0.958414  0.925715  0.748249  -0.081714 -0.122477  -0.107396

您还可以使用

filter

拆分列：

# Python > 3.8, walrus operator
err = (df.filter(regex='_x$').values - (y := df.filter(regex='_y$'))) / y
err.columns = err.columns.str.split('_').str[0] + '_err'
df = pd.concat([df, err], axis=1)

Answer 2

如果您已经有了列表，则可以使用以下内容：

q = ['a', 'b', 'c']
x = ['a_x', 'b_x', 'c_x']
y = ['a_y', 'b_y', 'c_y']
e = ['a_err', 'b_err', 'c_err']


df[e] = (df[x].rename(columns=dict(zip(x,q))) - df[y].rename(columns=dict(zip(y,q)))) / df[x].rename(columns=dict(zip(x,q)))

可能有更有效的方法，因为这每次都会重命名列，以便减法和乘法与索引匹配。

Answer 3

由于

和

之间存在平衡，您可以将 for 循环限制在列上，并且仍然保持高性能。下面的解决方案使用 MultiIndex，并尽可能将大量处理传递给 Pandas：

temp = df.filter(like = '_')
temp.columns = temp.columns.str.split("_", expand = True)
temp = temp.swaplevel(axis='columns')
temp = temp.x.sub(temp.y).div(temp.y).add_suffix('_err')
df.assign(**temp)

         foo       bar       baz       a_x       b_x       c_x       a_y       b_y       c_y      a_err     b_err      c_err
0   0.636962  0.269787  0.040974  0.016528  0.813270  0.912756  0.606636  0.729497  0.543625  -0.972755  0.114838   0.679017
1   0.935072  0.815854  0.002739  0.857404  0.033586  0.729655  0.175656  0.863179  0.541461   3.881166 -0.961091   0.347567
2   0.299712  0.422687  0.028320  0.124283  0.670624  0.647190  0.615385  0.383678  0.997210  -0.798040  0.747885  -0.351000
3   0.980835  0.685542  0.650459  0.688447  0.388921  0.135097  0.721488  0.525354  0.310242  -0.045796 -0.259697  -0.564545
4   0.485835  0.889488  0.934044  0.357795  0.571530  0.321869  0.594300  0.337911  0.391619  -0.397955  0.691361  -0.178106
5   0.890274  0.227158  0.623187  0.084015  0.832644  0.787098  0.239369  0.876484  0.058568  -0.649014 -0.050018  12.439042
6   0.336117  0.150279  0.450339  0.796324  0.230642  0.052021  0.404552  0.198513  0.090753   0.968411  0.161849  -0.426782
7   0.580332  0.298696  0.671995  0.199515  0.942113  0.365110  0.105495  0.629108  0.927155   0.891226  0.497538  -0.606204
8   0.440377  0.954590  0.499896  0.425229  0.620213  0.995097  0.948944  0.460045  0.757729  -0.551893  0.348158   0.313262
9   0.497423  0.529312  0.785786  0.414656  0.734484  0.711143  0.932060  0.114933  0.729015  -0.555119  5.390557  -0.024516
10  0.927424  0.967926  0.014706  0.863640  0.981195  0.957210  0.148764  0.972629  0.889936   4.805437  0.008807   0.075595
11  0.822374  0.479988  0.232373  0.801881  0.923530  0.266130  0.538934  0.442753  0.931017   0.487900  1.085882  -0.714151
12  0.040511  0.732006  0.614373  0.028365  0.719220  0.015992  0.757951  0.512759  0.929104  -0.962576  0.402648  -0.982788
13  0.066082  0.841317  0.066690  0.344310  0.430299  0.966062  0.562232  0.258865  0.241676  -0.387601  0.662254   2.997349
14  0.888118  0.225869  0.124555  0.288331  0.586123  0.554091  0.809711  0.560476  0.288421  -0.643909  0.045760   0.921116
15  0.412896  0.818121  0.626506  0.959078  0.369404  0.552612  0.593924  0.848291  0.145474   0.614815 -0.564531   2.798708
16  0.406510  0.909959  0.043067  0.822706  0.415384  0.829804  0.009955  0.365046  0.078630  81.646166  0.137895   9.553270
17  0.652615  0.273849  0.702652  0.943801  0.126817  0.864778  0.059464  0.380771  0.429774  14.871772 -0.666946   1.012170
18  0.488850  0.976462  0.775691  0.308857  0.269837  0.863120  0.881307  0.510707  0.344296  -0.649546 -0.471640   1.506915
19  0.994917  0.315944  0.182712  0.880098  0.812335  0.667889  0.958414  0.925715  0.748249  -0.081714 -0.122477  -0.107396

在没有 for 循环的情况下从 2m 列计算 m 列

问题描述投票：0回答：3

3个回答

最新问题

在没有 for 循环的情况下从 2m 列计算 m 列

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3