在应用 sklearn 模型之前,我正在对数据进行预处理,但我无法确定错误不断发生的原因。当我为
ColumTransformer
中的每个单独的列索引运行代码时,它对每个变量都运行良好。但是,当我将其应用于多个列时,会发生错误。我的问题:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# Number of samples
num_samples = 1000
# Generating random data
data = {
'Feature_1': np.random.rand(num_samples),
'Feature_2': np.random.rand(num_samples),
'Feature_3': np.random.choice(['A', 'B', 'C'], num_samples),
'Feature_4': np.random.choice(['X', 'Y', 'Z'], num_samples),
'Feature_5': np.random.choice(['M', 'N', 'O'], num_samples), # Non-numeric values intentionally introduced
'Feature_6': np.random.choice(['P', 'Q', 'R'], num_samples), # Non-numeric values intentionally introduced
'Feature_7': np.random.choice(['D', 'E', 'F'], num_samples),
'Feature_8': np.random.choice(['G', 'H', 'I'], num_samples),
'Feature_9': np.random.choice(['S', 'T', 'U'], num_samples),
'Feature_10': np.random.rand(num_samples),
'Feature_11': np.random.rand(num_samples),
'Feature_12': np.random.choice(['V', 'W', 'X'], num_samples),
'Feature_13': np.random.choice(['Y', 'Z'], num_samples),
'Feature_14': np.random.choice(['P', 'Q', 'R'], num_samples),
'Feature_15': np.random.choice(['A', 'B', 'C', 'D'], num_samples),
'Target': np.random.choice([0, 1], num_samples)
}
categorical_indices = [3, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15]
d = pd.DataFrame(data)
X = d.values
ct = ColumnTransformer(
transformers=[('encoder', OneHotEncoder(), categorical_indices)],
remainder='passthrough'
)
X_1 = np.array(ct.fit_transform(X))
错误:
Traceback (most recent call last):
File "/Users/jaeyoungkim/opt/anaconda3/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 588, in _hstack
converted_Xs = [check_array(X,
File "/Users/jaeyoungkim/opt/anaconda3/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 588, in <listcomp>
converted_Xs = [check_array(X,
File "/Users/jaeyoungkim/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "/Users/jaeyoungkim/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 673, in check_array
array = np.asarray(array, order=order, dtype=dtype)
File "/Users/jaeyoungkim/opt/anaconda3/lib/python3.9/site-packages/numpy/core/_asarray.py", line 102, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: 'C'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/var/folders/25/5mycjlz1013629wcstsb_mwh0000gn/T/ipykernel_24019/2314645552.py", line 10, in <module>
X_1 = np.array(ct.fit_transform(X))
File "/Users/jaeyoungkim/opt/anaconda3/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 529, in fit_transform
return self._hstack(list(Xs))
File "/Users/jaeyoungkim/opt/anaconda3/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 593, in _hstack
raise ValueError(
ValueError: For a sparse output, all columns should be a numeric or convertible to a numeric.
您的 categorical_indices 偏离了 1。
索引从0开始。
categorical_indices = [3, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15]
categorical_indices = [x - 1 for x in categorical_indices]
# categorical_indices = [2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 14]