我有一些数据集,我想为每个数据集创建一个代表该数据集的假数据集。我需要动态地执行此操作,仅基于数据类型(数字,obj)
这是一个例子
import pandas as pd
import random
# Create a dictionary with columns as lists
data = {
'ObjectColumn1': [f'Object1_{i}' for i in range(1, 11)],
'ObjectColumn2': [f'Object2_{i}' for i in range(1, 11)],
'ObjectColumn3': [f'Object3_{i}' for i in range(1, 11)],
'NumericColumn1': [random.randint(1, 100) for _ in range(10)],
'NumericColumn2': [random.uniform(1.0, 10.0) for _ in range(10)],
'NumericColumn3': [random.randint(1000, 2000) for _ in range(10)],
'NumericColumn4': [random.uniform(10.0, 20.0) for _ in range(10)]
}
# Create the DataFrame
df = pd.DataFrame(data)
假设上面的数据集有 m (=3) 个对象列和 n (=4) 个数字列。 数据集有 x (=10) 行。 我想创建一个 N (=10,000) 行的假数据集,这样:
如果 N = 4,fake_data 应该是这样的
IIUC,你可以使用:
def fakeit(df, N, s=1):
np.random.seed(s)
rans = ["ran1", "ran2", "ran3"]
objs = df.select_dtypes("object"); nums = df.select_dtypes("number")
mim = nums.describe().loc[["min","50%"]].set_axis(["low","high"]).to_dict()
fake_objs = (objs.sample(frac=N/len(df), replace=True, ignore_index=True)
.assign(ExtraObjectColumn= np.random.choice(rans, size=N)))
fake_nums = (pd.DataFrame({nc: np.random.uniform(size=N, **kw)
for (nc,kw) in mim.items()}))
return pd.concat([fake_objs, fake_nums], axis=1) # or assign back to df
out = fakeit(df, 10_000)
输出:
print(out) # with `df` seeded at 0
# 4.19 ms ± 23.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
ObjectColumn1 ObjectColumn2 ... NumericColumn3 NumericColumn4
0 Object1_6 Object2_6 ... 1026.454938 15.001035
1 Object1_9 Object2_9 ... 1112.569853 12.358510
2 Object1_10 Object2_10 ... 1303.267506 13.656933
3 Object1_6 Object2_6 ... 1380.363296 12.951402
4 Object1_1 Object2_1 ... 1106.091309 14.267152
... ... ... ... ... ...
9995 Object1_10 Object2_10 ... 1188.281480 16.635908
9996 Object1_5 Object2_5 ... 1200.956953 13.403647
9997 Object1_6 Object2_6 ... 1271.411811 13.575587
9998 Object1_4 Object2_4 ... 1358.510617 16.788902
9999 Object1_6 Object2_6 ... 1387.046206 16.500843
[10000 rows x 8 columns]
IIUC,这样的事情应该可以满足你的要求。它将输入数据帧分为数字列和其他列,然后从这些列中获取问题中描述的随机样本,最后添加额外数据列表作为提供列表中的随机样本:
def make_fake_data(df, N, extra):
df_obj = df.loc[:, df.dtypes == 'object']
obj_out = pd.DataFrame({ col : np.random.choice(df_obj[col], N) for col in df_obj.columns })
df_num = df.loc[:, df.dtypes != 'object']
num_out = pd.DataFrame({ col : np.random.uniform(np.min(df_num[col]), np.median(df_num[col]), N) for col in df_num.columns })
ext_out = pd.DataFrame({ 'ExtraObjectColumn' : random.choices(extra, k=N) })
return pd.concat([obj_out, num_out, ext_out], axis=1)
使用示例:
make_fake_data(df, 20, ['a', 'b', 'c', 'd'])
输出示例:
ObjectColumn1 ObjectColumn2 ObjectColumn3 ... NumericColumn3 NumericColumn4 ExtraObjectColumn
0 Object1_4 Object2_1 Object3_4 ... 1322.269370 14.502498 d
1 Object1_6 Object2_5 Object3_5 ... 1314.941227 12.478253 c
2 Object1_6 Object2_7 Object3_7 ... 1418.271732 11.214247 a
3 Object1_4 Object2_9 Object3_9 ... 1269.408303 11.404303 c
4 Object1_3 Object2_6 Object3_4 ... 1426.038132 14.251836 a
5 Object1_1 Object2_2 Object3_1 ... 1212.806903 14.750310 c
6 Object1_10 Object2_7 Object3_1 ... 1294.254746 10.692256 d
7 Object1_1 Object2_7 Object3_3 ... 1232.854020 10.438323 c
8 Object1_5 Object2_5 Object3_7 ... 1205.779688 14.763409 c
9 Object1_7 Object2_6 Object3_2 ... 1287.248660 10.384493 b
10 Object1_4 Object2_2 Object3_1 ... 1237.738855 14.054841 b
11 Object1_7 Object2_3 Object3_5 ... 1176.494651 12.869827 c
12 Object1_5 Object2_1 Object3_10 ... 1101.036149 10.978762 b
13 Object1_5 Object2_6 Object3_7 ... 1430.060873 13.473017 c
14 Object1_1 Object2_1 Object3_7 ... 1416.556459 12.281628 c
15 Object1_3 Object2_8 Object3_3 ... 1190.239080 15.257389 b
16 Object1_6 Object2_9 Object3_5 ... 1101.712808 10.551654 b
17 Object1_1 Object2_10 Object3_4 ... 1453.687960 15.070104 b
18 Object1_6 Object2_2 Object3_2 ... 1139.413534 11.744450 b
19 Object1_7 Object2_7 Object3_2 ... 1080.682206 13.962322 b