如何提高汽车价格估算中的 RMSE?
`
new_condition_df = df[df['condition'].map(condition_mapping) == 2]
top_1000_highest_mileage = new_condition_df.nlargest(1000, 'mileage')['mileage']
average_top_1000_highest_mileage = top_1000_highest_mileage.mean()
# Filter the DataFrame for rows where condition is null or unspecified
null_condition_df = df[df['condition'].isnull() | (df['condition'] == '')]
# Update 'condition' based on mileage condition
null_condition_df.loc[null_condition_df['mileage'] >= average_top_1000_highest_mileage, 'condition'] = 'CONDITION_USED'
null_condition_df.loc[null_condition_df['mileage'] < average_top_1000_highest_mileage, 'condition'] = 'CONDITION_NEW'
# Update the original DataFrame with the modified rows
df.update(null_condition_df)
`
columns_with_null = ['color', 'vat_reclaimable', 'cubic_capacity', 'seller_country', 'feature']
df.dropna(subset=columns_with_null, inplace=True)
df['air_conditioning'].fillna('AIRCONDITIONING_NONE', inplace=True)
df['parking_camera'].fillna('PARKINGCAMERA_NONE', inplace=True)
df['parking_sensors'].fillna('PARKINGSENZOR_NONE', inplace=True)
features = ['mileage', 'cubic_capacity', 'power', 'year'] + list(df.columns[df.columns.str.startswith('car_style_')]) + list(df.columns[df.columns.str.startswith('transmission_')]) + list(df.columns[df.columns.str.startswith('fuel_type_')])
train_data = df.dropna(subset=['drive']) # Odstranění řádků s chybějícími hodnotami sloupce 'drive'
X_train, X_test, y_train, y_test = train_test_split(train_data[features], pd.get_dummies(train_data['drive']), test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
missing_data = df[df['drive'].isnull()]
X_missing = missing_data[features]
predicted_values = model.predict(X_missing)
df_imputed = df.copy()
predicted_df = pd.DataFrame(predicted_values, columns=y_train.columns, index=missing_data.index)
df_imputed.loc[df_imputed['drive'].isnull(), y_train.columns] = predicted_df.values
predicted_df_encoded = pd.DataFrame(predicted_values, columns=y_train.columns, index=missing_data.index)
predicted_df_encoded = (predicted_df_encoded > 0.5).astype(int)
for column in predicted_df_encoded.columns:
df_imputed[column] = 0 # Přidání sloupce se všemi hodnotami 0
df_imputed.loc[predicted_df_encoded.index, column] = predicted_df_encoded[column].values
unique_values_imputed_encoded = df_imputed['drive'].unique()
df = df_imputed
df.drop(columns=['drive'], inplace=True)
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df['feature']),columns=mlb.classes_))
df.fillna(0, inplace=True)
df = df.drop(columns=['feature'])
df_encoded = pd.get_dummies(df)
X_train, X_test, y_train, y_test = train_test_split(df_encoded.drop(columns=['price_with_vat_czk']), df_encoded['price_with_vat_czk'], test_size=0.25, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
所有程序:https://onecompiler.com/python/42brp9a4r 数据集https://filetransfer.io/data-package/a0mFEfg4#link
我的 RMSE 约为 64k
任何答案都会很简短,有很多事情要做,这就是我认为数据科学家的角色的意义所在。您可以先问自己一些问题:
我认为对缺失的数据有一些探索性数据分析,例如您可能会看到散点图告诉您线性回归可能不是最好的方法。我刚刚尝试使用默认超参数使用
sklearn.ensemble.RandomForestRegressor
,结果图的行为更好,RMSE 已降至 45K 左右,R2 已从 0.88 移至 0.94(参见 sklearn.metrics.r2_score
)
例如,您可以获取特征向量与目标之间的相关性,以了解在执行任何模型之前变量发生了什么,您会发现它与“年份”高度相关:
df_encoded.corr(method="spearman")["price_with_vat_czk"].sort_values()
您可以尝试探索性数据分析来更好地调整您的特征(统计是您的朋友),此外,您可以尝试多种算法,之后,您可以使用不同的超参数(参见
sklearn.model_selection.GridSearchCV
),然后将您的模型放入管道中并尝试获得特征重要性,以减少你所拥有的特征数量的维数(请参阅 SHAP 或 GINI 来做到这一点)...总而言之,还有很多事情要做...