我正在使用 scikit learn 进行线性回归,并且我通过重塑它们尝试了各种方法,这导致了代码中的整个错误。数据集是
R&D Spend Administration Marketing Spend State Profit
0 165349.20 136897.80 471784.10 New York 192261.83
1 162597.70 151377.59 443898.53 California 191792.06
2 153441.51 101145.55 407934.54 Florida 191050.39
3 144372.41 118671.85 383199.62 New York 182901.99
4 142107.34 91391.77 366168.42 Florida 166187.94
5 131876.90 99814.71 362861.36 New York 156991.12
6 134615.46 147198.87 127716.82 California 156122.51
7 130298.13 145530.06 323876.68 Florida 155752.60
8 120542.52 148718.95 311613.29 New York 152211.77
9 123334.88 108679.17 304981.62 California 149759.96
10 101913.08 110594.11 229160.95 Florida 146121.95
11 100671.96 91790.61 249744.55 California 144259.40
12 93863.75 127320.38 249839.44 Florida 141585.52
13 91992.39 135495.07 252664.93 California 134307.35
14 119943.24 156547.42 256512.92 Florida 132602.65
我试过下面的代码
#Dataset
dataset=pd.read_csv(r'50_Startups.csv')
X=dataset.iloc[:,:-1]
y=dataset.iloc[:,-1]
#Encoding Categorical Data
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
oHe=OneHotEncoder()
ct=ColumnTransformer(transformers=[('encoder',oHe,[3])],remainder='passthrough')
X = np.array(ct.fit_transform(X), dtype = np.str)
#Splitting into Training and Test sets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)
#Training the Multiple Linear Regression
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(X_train,y_train)
错误是:
ValueError: dtype='numeric' is not compatible with arrays of bytes/strings.
Convert your data to numeric values explicitly instead.
您应该为
X
使用数字类型:
X = np.array(ct.fit_transform(X), dtype=np.float64)
然后回归无误地发生:
regressor.fit(X_train, y_train)
regressor.coef_
# array([ 2.21054629e+03, 2.33695693e+03, -4.54750322e+03, 8.05301486e-01,
# -9.57801181e-03, 1.17912512e-02])
regressor.intercept_
# 52971.480360281625
在这里,我们首先使用LabelEncoder将分类变量转换为数值,然后将OneHotEncoder应用于转换后的数值数据。最后,我们从 np.array() 函数调用中删除 dtype 参数,以确保转换后的数据具有适当的数值数据类型。
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
le=LabelEncoder()
oHe=OneHotEncoder()
X.iloc[:,3] = le.fit_transform(X.iloc[:, 3])
ct=ColumnTransformer(transformers=[('encoder',oHe,[3])],remainder='passthrough')