ValueError:对于稀疏输出,所有列都应该是数字或可转换为数字

问题描述 投票:0回答:1

在应用 sklearn 模型之前,我正在对数据进行预处理,但我无法确定错误不断发生的原因。当我为

ColumTransformer
中的每个单独的列索引运行代码时,它对每个变量都运行良好。但是,当我将其应用于多个列时,会发生错误。我的问题:

  1. 当我一起运行时出现什么问题?
  2. 如何使用代码识别哪一列导致错误? (我是通过手动更改参数来检查的)
  3. 当单列导致此错误时,如何补救该错误?

数据和示例代码

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Number of samples
num_samples = 1000

# Generating random data
data = {
    'Feature_1': np.random.rand(num_samples),
    'Feature_2': np.random.rand(num_samples),
    'Feature_3': np.random.choice(['A', 'B', 'C'], num_samples),
    'Feature_4': np.random.choice(['X', 'Y', 'Z'], num_samples),
    'Feature_5': np.random.choice(['M', 'N', 'O'], num_samples),  # Non-numeric values intentionally introduced
    'Feature_6': np.random.choice(['P', 'Q', 'R'], num_samples),  # Non-numeric values intentionally introduced
    'Feature_7': np.random.choice(['D', 'E', 'F'], num_samples),
    'Feature_8': np.random.choice(['G', 'H', 'I'], num_samples),
    'Feature_9': np.random.choice(['S', 'T', 'U'], num_samples),
    'Feature_10': np.random.rand(num_samples),
    'Feature_11': np.random.rand(num_samples),
    'Feature_12': np.random.choice(['V', 'W', 'X'], num_samples),
    'Feature_13': np.random.choice(['Y', 'Z'], num_samples),
    'Feature_14': np.random.choice(['P', 'Q', 'R'], num_samples),
    'Feature_15': np.random.choice(['A', 'B', 'C', 'D'], num_samples),
    'Target': np.random.choice([0, 1], num_samples)
}


categorical_indices = [3, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15]

d = pd.DataFrame(data)

X = d.values

ct = ColumnTransformer(
            transformers=[('encoder', OneHotEncoder(), categorical_indices)],
            remainder='passthrough'
                    )
X_1 = np.array(ct.fit_transform(X))

错误:

Traceback (most recent call last):

  File "/Users/jaeyoungkim/opt/anaconda3/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 588, in _hstack
    converted_Xs = [check_array(X,

  File "/Users/jaeyoungkim/opt/anaconda3/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 588, in <listcomp>
    converted_Xs = [check_array(X,

  File "/Users/jaeyoungkim/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)

  File "/Users/jaeyoungkim/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 673, in check_array
    array = np.asarray(array, order=order, dtype=dtype)

  File "/Users/jaeyoungkim/opt/anaconda3/lib/python3.9/site-packages/numpy/core/_asarray.py", line 102, in asarray
    return array(a, dtype, copy=False, order=order)

ValueError: could not convert string to float: 'C'


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/var/folders/25/5mycjlz1013629wcstsb_mwh0000gn/T/ipykernel_24019/2314645552.py", line 10, in <module>
    X_1 = np.array(ct.fit_transform(X))

  File "/Users/jaeyoungkim/opt/anaconda3/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 529, in fit_transform
    return self._hstack(list(Xs))

  File "/Users/jaeyoungkim/opt/anaconda3/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 593, in _hstack
    raise ValueError(

ValueError: For a sparse output, all columns should be a numeric or convertible to a numeric.
python scikit-learn transformation one-hot-encoding categorical
1个回答
0
投票

您的 categorical_indices 偏离了 1。

索引从0开始。

categorical_indices = [3, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15]
categorical_indices = [x - 1 for x in categorical_indices]
# categorical_indices = [2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 14]
© www.soinside.com 2019 - 2024. All rights reserved.