最近,我一直在学习ML的一些核心概念,并使用Sklearn库编写代码。经过一些基本练习,我尝试了kaggle的AirBnb NYC数据集(大约有40000个样本)-https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data#New_York_City_.png


enter image description here我使用sklearn.linear_model.Ridge作为基线,并做了一些基本的数据清理工作后,我在测试集上获得了0.12的极差R ^ 2评分。然后我想,线性模型可能太简单了,因此我尝试了适用于回归的“内核技巧”方法(sklearn.kernel_ridge.Kernel_Ridge),但是它们要花费太多时间才能拟合(> 1hr)!为了解决这个问题,我使用sklearn.kernel_approximation.Nystroem函数来近似内核图,在训练之前将变换应用于特征,然后使用简单的线性回归模型。但是,即使增加了n_components参数,也要花费很多时间进行转换和拟合,而我必须获得任何有意义的精度提高。



  • 没有什么可以代替适当的分析。这可能涉及专家访谈,以了解您的数据集的限制。
  • 您的模型(任何模型,不仅限于回归模型)仅与您的特征一样好。如果房价取决于当地税率或学校评级,那么即使没有这些功能,即使是完美的模型也无法很好地发挥作用。
  • 某些功能无法通过设计包含在模型中,因此不要期望在现实世界中获得完美的分数。例如,几乎不可能考虑到杂货店,餐馆,俱乐部等的使用。这些功能中的许多功能也是不断变化的目标,因为它们会随着时间而变化。如果人类专家的表现更差,那么即使0.12 R2也可能很棒。
  • 模型有其假设。线性回归期望因变量(价格)与独立变量(例如财产规模)线性相关。通过探索残差,您可以观察到一些非线性,并用非线性特征覆盖它们。但是,有些模式很难发现,但是仍然可以通过其他模型来解决,例如非参数回归和神经网络。


  • 这是最简单,最快的模型。实时系统和统计分析有很多含义,因此确实很重要
  • 通常将其用作基准模型。在尝试精美的神经网络体系结构之前,了解与单纯方法相比有多少改进将很有帮助。
  • 有时使用回归来检验某些假设,例如效果的线性和变量之间的关系




import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
import sklearn
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
from sklearn.datasets import load_boston
#boston = load_boston()

# Predicting Continuous Target Variables with Regression Analysis
df = pd.read_csv('C:\\your_path_here\\AB_NYC_2019.csv')

# get only 2 fields and convert non-numerics to numerics
df_new = df[['neighbourhood']]
df_new = pd.get_dummies(df_new)
# print(df_new.columns.values)

# df_new.shape
# df.shape

# let's use a feature selection technique so we can see which features (independent variables) have the highest statistical influence on the target (dependent variable).
from sklearn.ensemble import RandomForestClassifier
features = df_new.columns.values
clf = RandomForestClassifier()
clf.fit(df_new[features], df['price'])

# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)

# what kind of object is this
# type(sorted_idx)
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")

enter image description here

X = df_new[features]
y = df['price']

reg = LassoCV()
reg.fit(X, y)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X,y))
coef = pd.Series(reg.coef_, index = X.columns)

print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " +  str(sum(coef == 0)) + " variables")


Best alpha using built-in LassoCV: 0.040582
Best score using built-in LassoCV: 0.103947
Lasso picked 78 variables and eliminated the other 146 variables


imp_coef = coef.sort_values()
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")

# get the top 25; plotting fewer features so we can actually read the chart
imp_coef = imp_coef.tail(25)
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")

enter image description here

X = df_new
y = df['price']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)

# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)

# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)

from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))


MAE 1004799260.0756996
MSE 9.87308783180938e+21
RMSE 99363412943.64531
R squared error -2.603867717517002e+17


X = df[['longitude','latitude']]
y = df['price']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)

# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)

# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)

df1 = pd.DataFrame({'Actual': y_test, 'Predicted':prediction})
df2 = df1.head(10)
df2.plot(kind = 'bar')

enter image description here

from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))
# better but not awesome


MAE 85.35438165291622
MSE 36552.6244271195
RMSE 191.18740655994972
R squared error 0.03598346983552425

# look at OLS
import statsmodels.api as sm
model = sm.OLS(y, X).fit()

# run the model and interpret the predictions
predictions = model.predict(X)
# Print out the statistics

enter image description here


一种热编码正在完全按照预期的方式进行,但是并不能帮助您获得所需的结果。同样,使用lng / lat并不能帮助您获得所需的结果。如您所知,您必须使用数值数据来解决回归问题,但是这些功能都无法帮助您预测价格,至少不是很好。当然,我可能在某个地方犯了一个错误。如果我确实犯了一个错误,请告诉我!

查看下面的链接,以了解使用各种功能预测房价的好例子。注意:所有变量均为数字,结果相当不错(付出或接受,大约占70%,但仍比Air BNB好得多)。



