运行此代码:
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
'key': ['key1', 'key2', 'key3', 'key1', 'key2'],
'colA': ['value1A', 'value2A', 'value3A', 'value4A', 'value5A'],
'colB': ['value1B', 'value2B', 'value3B', 'value4B', 'value5B'],
'colC': ['value1C', 'value2C', 'value3C', 'value4C', 'value5C'],
'colD': ['value1D', 'value2D', 'value3D', 'value4D', 'value5D']
}
df = pd.DataFrame(data)
# Create conditions
conditions = [df['key'] == 'key1',
df['key'] == 'key2',
df['key'] == 'key3']
# Apply conditions and choices to the respective columns
df['colA'] = np.select([df['key'] == 'key1'], [df['colA']], default= 'NA')
df['colD'] = np.select([df['key'] == 'key1'], [df['colD']], default= 'NA')
df['colB'] = np.select([df['key'] == 'key2'], [df['colB']], default= 'NA')
df['colC'] = np.select([df['key'] == 'key3'], [df['colC']], default= 'NA')
# Display the resulting DataFrame
print(df)
产生这个:
key colA colB colC colD
0 key1 value1A NA NA value1D
1 key2 NA value2B NA NA
2 key3 NA NA value3C NA
3 key1 value4A NA NA value4D
4 key2 NA value5B NA NA
有没有一种方法可以有效地重写它,这样我就不必为我想要映射的每一列执行 numpy.select ? “键”列本质上确定哪些列包含该特定行的有效数据。如果数据无效,我想将其标记为 NA,如果数据有效,我想保留该行中的值。在我的真实数据集中,“key”列控制我如何映射多个列,就像 key1 如何控制到上面的 colA 和 colD 的映射一样。我更喜欢使用 numpy 或其他矢量化方法,因为据我了解,它比其他方法(例如地图)更快,但我愿意听取任何和所有的想法。
merge
: 进行过滤
d = {'key1': ['colA', 'colD'],
'key2': ['colB'],
'key3': ['colC'],
}
(df.reset_index()
.melt(['index', 'key'])
.merge(pd.Series(d).explode().rename_axis('key')
.reset_index(name='variable'))
.set_index(['index', 'key', 'variable'])['value']
.unstack('variable', fill_value='NA')
.reset_index('key').rename_axis(index=None, columns=None)
)
输出:
key colA colB colC colD
0 key1 value1A NA NA value1D
1 key2 NA value2B NA NA
2 key3 NA NA value3C NA
3 key1 value4A NA NA value4D
4 key2 NA value5B NA NA