我有一个包含 [“Uni”、“Region”、“Profession”、“Level_Edu”、“Financial_Base”、“Learning_Time”、“GENDER”] 列的数据集。 [“Uni”、“Region”、“Profession”] 中的所有值均已填充,而 [“Level_Edu”、“Financial_Base”、“Learning_Time”、“GENDER”] 始终包含 NA。
对于带有 NA 的每一列,有几个可能的值
Level_Edu = ['undergrad', 'grad', 'PhD']
Financial_Base = ['personal', 'grant']
Learning_Time = ["morning", "day", "evening"]
GENDER = ['Male', 'Female']
我想为初始数据中的每个观察生成 [“Level_Edu”、“Financial_Base”、“Learning_Time”、“GENDER”] 的所有可能组合。这样每个初始观测值将由 36 个新观测值表示(通过组合数学公式获得:N1 * N2 * N3 * N4,其中 Ni 是列可能值的第 i 个向量的长度)
这是一个 Python 代码,用于重新创建两个初始观察值和我想要获得的结果的近似值(对于我想要的每个初始观察值,显示 36 种组合中的 3 种组合)。
import pandas as pd
import numpy as np
sample_data_as_is = pd.DataFrame([["X1", "Y1", "Z1", np.nan, np.nan, np.nan, np.nan], ["X2", "Y2", "Z2", np.nan, np.nan, np.nan, np.nan]], columns=["Uni", 'Region', "Profession", "Level_Edu", 'Financial_Base', 'Learning_Time', 'GENDER'])
sample_data_to_be = pd.DataFrame([["X1", "Y1", "Z1", "undergrad", "personal", "morning", 'Male'], ["X2", "Y2", "Z2", "undergrad", "personal", "morning", 'Male'],
["X1", "Y1", "Z1", "grad", "personal", "morning", 'Male'], ["X2", "Y2", "Z2", "grad", "personal", "morning", 'Male'],
["X1", "Y1", "Z1", "undergrad", "grant", "morning", 'Male'], ["X2", "Y2", "Z2", "undergrad", "grant", "morning", 'Male']], columns=["Uni", 'Region', "Profession", "Level_Edu", 'Financial_Base', 'Learning_Time', 'GENDER'])
itertools.product
和十字merge
:
from itertools import product
data = {'Level_Edu': ['undergrad', 'grad', 'PhD'],
'Financial_Base': ['personal', 'grant'],
'Learning_Time': ['morning', 'day', 'evening'],
'GENDER': ['Male', 'Female']}
out = (sample_data_as_is[['Uni', 'Region', 'Profession']]
.merge(pd.DataFrame(product(*data.values()), columns=data.keys()), how='cross')
)
输出:
Uni Region Profession Level_Edu Financial_Base Learning_Time GENDER
0 X1 Y1 Z1 undergrad personal morning Male
1 X1 Y1 Z1 undergrad personal morning Female
2 X1 Y1 Z1 undergrad personal day Male
3 X1 Y1 Z1 undergrad personal day Female
4 X1 Y1 Z1 undergrad personal evening Male
.. .. ... ... ... ... ... ...
67 X2 Y2 Z2 PhD grant morning Female
68 X2 Y2 Z2 PhD grant day Male
69 X2 Y2 Z2 PhD grant day Female
70 X2 Y2 Z2 PhD grant evening Male
71 X2 Y2 Z2 PhD grant evening Female
[72 rows x 7 columns]