我在使用
sklearn
时遇到了 multiprocessing
中的 PCA 问题。具体来说,reconstruction error
中的PCA
根据Pool
中设置的进程数而显着变化。例如,使用 Pool(processes=4)
会产生一个小误差 (np.abs(tmp_matrix-X_train).max()<1e-2)
),但增加到 Pool(processes=5)
或更高会导致重大误差,每列的 np.abs(tmp_matrix-X_train).max()
平均约为 10
。使用Intel sklearnex
封装。
我测试了各种组合并观察到以下模式:
20 cpu
+processes=1
、80 cpu
+processes=1
、80 cpu
+processes=4
、120 cpu
+processes=5
80 cpu
+processes=5
、100 cpu
+processes=5
、120 cpu
+processes=5
(是的,120+5不稳定)这是我的代码的相关部分:
from sklearnex import patch_sklearn
patch_sklearn()
from sklearn.decomposition import PCA
from functools import partial
from multiprocessing import Pool
def config_selection_single(df_entry: tuple, _some_arguments_indlucding_data_object):
#some pre-processing code
for some_iteration_condition:
# some data processing and transformation to bound data non-NaN and between [-1e20,1e20]
for another_iteration_condition:
z_mean = X[train_cond][:].mean()
z_std = X[train_cond][:].std()+1e-10
X_train = (X[train_cond][:]-z_mean) / z_std # X_train has shape ~ 2e4 X 50
pca = PCA(n_components=20, svd_solver='full')
p_model = pca.fit(X_train)
Q = p_model.transform(X_train)
tmp_matrix = p_model.inverse_transform(Q)
if not np.allclose(Q,X_train.dot(p_model.components_.transpose())): # to compute reconstruction error.
print("reconstruction error is huge!")
print(np.abs(tmp_matrix-X_train).max())
config_selection_prtial = partial(config_selection_single, _some_arguments_indlucding_data_object)
with Pool(processes=4) as pool: # 4 is good, 5 and 6 are bad
pool.map(config_selection_prtial, list(my_df.items()))
不幸的是,我找不到可以重现该问题的小型数据集演示。
关于进程数量影响 PCA 精度的原因有什么见解吗?
[回答我自己的问题]事实证明这是来自
scikit-learn-intelex==2023.1.1
的错误。当我将其更新为scikit-learn-intelex==2024.0.1
时,结果看起来不错。