如何提高汽车价格估算中的 RMSE?

问题描述 投票:0回答:1

如何提高汽车价格估算中的 RMSE?

  1. 首先,我会根据行驶公里数估算来填写缺失的条件值。
`
new_condition_df = df[df['condition'].map(condition_mapping) == 2]
top_1000_highest_mileage = new_condition_df.nlargest(1000, 'mileage')['mileage']
average_top_1000_highest_mileage = top_1000_highest_mileage.mean()

# Filter the DataFrame for rows where condition is null or unspecified
null_condition_df = df[df['condition'].isnull() | (df['condition'] == '')]

# Update 'condition' based on mileage condition
null_condition_df.loc[null_condition_df['mileage'] >= average_top_1000_highest_mileage, 'condition'] = 'CONDITION_USED'
null_condition_df.loc[null_condition_df['mileage'] < average_top_1000_highest_mileage, 'condition'] = 'CONDITION_NEW'

# Update the original DataFrame with the modified rows
df.update(null_condition_df)
`
  1. 删除一些空行
columns_with_null = ['color', 'vat_reclaimable', 'cubic_capacity', 'seller_country', 'feature']
df.dropna(subset=columns_with_null, inplace=True)

df['air_conditioning'].fillna('AIRCONDITIONING_NONE', inplace=True)
df['parking_camera'].fillna('PARKINGCAMERA_NONE', inplace=True)
df['parking_sensors'].fillna('PARKINGSENZOR_NONE', inplace=True)
  1. 这里我试图估计drive列的缺失值,其中包含汽车是4x4还是4x2的信息,drive包含大量空值,这就是为什么我以如此复杂的方式估计它
features = ['mileage', 'cubic_capacity', 'power', 'year'] + list(df.columns[df.columns.str.startswith('car_style_')]) + list(df.columns[df.columns.str.startswith('transmission_')]) + list(df.columns[df.columns.str.startswith('fuel_type_')])

train_data = df.dropna(subset=['drive'])  # Odstranění řádků s chybějícími hodnotami sloupce 'drive'
X_train, X_test, y_train, y_test = train_test_split(train_data[features], pd.get_dummies(train_data['drive']), test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

missing_data = df[df['drive'].isnull()]
X_missing = missing_data[features]
predicted_values = model.predict(X_missing)


df_imputed = df.copy()
predicted_df = pd.DataFrame(predicted_values, columns=y_train.columns, index=missing_data.index)
df_imputed.loc[df_imputed['drive'].isnull(), y_train.columns] = predicted_df.values

predicted_df_encoded = pd.DataFrame(predicted_values, columns=y_train.columns, index=missing_data.index)
predicted_df_encoded = (predicted_df_encoded > 0.5).astype(int)

for column in predicted_df_encoded.columns:
    df_imputed[column] = 0  # Přidání sloupce se všemi hodnotami 0
    df_imputed.loc[predicted_df_encoded.index, column] = predicted_df_encoded[column].values  

unique_values_imputed_encoded = df_imputed['drive'].unique()
df = df_imputed
df.drop(columns=['drive'], inplace=True)
  1. 这里我对特征字段进行编码
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df['feature']),columns=mlb.classes_))
df.fillna(0, inplace=True)

df = df.drop(columns=['feature'])
  1. 培训本身
df_encoded = pd.get_dummies(df)

X_train, X_test, y_train, y_test = train_test_split(df_encoded.drop(columns=['price_with_vat_czk']), df_encoded['price_with_vat_czk'], test_size=0.25, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

所有程序:https://onecompiler.com/python/42brp9a4r 数据集https://filetransfer.io/data-package/a0mFEfg4#link

我的 RMSE 约为 64k

python pandas dataframe machine-learning linear-regression
1个回答
0
投票

任何答案都会很简短,有很多事情要做,这就是我认为数据科学家的角色的意义所在。您可以先问自己一些问题:

  • 您的模型中的 RMSE 有多差?您认为多少对您来说比较好?选择该指标而不选择其他指标的比较点是什么?

我认为对缺失的数据有一些探索性数据分析,例如您可能会看到散点图告诉您线性回归可能不是最好的方法。我刚刚尝试使用默认超参数使用

sklearn.ensemble.RandomForestRegressor
,结果图的行为更好,RMSE 已降至 45K 左右,R2 已从 0.88 移至 0.94(参见
sklearn.metrics.r2_score

例如,您可以获取特征向量与目标之间的相关性,以了解在执行任何模型之前变量发生了什么,您会发现它与“年份”高度相关:

df_encoded.corr(method="spearman")["price_with_vat_czk"].sort_values()

您可以尝试探索性数据分析来更好地调整您的特征(统计是您的朋友),此外,您可以尝试多种算法,之后,您可以使用不同的超参数(参见

sklearn.model_selection.GridSearchCV
),然后将您的模型放入管道中并尝试获得特征重要性,以减少你所拥有的特征数量的维数(请参阅 SHAP 或 GINI 来做到这一点)...总而言之,还有很多事情要做...

© www.soinside.com 2019 - 2024. All rights reserved.