我尝试通过分析数据文件Google Apps Store来预测评级来实践线性回归,文件csv在Kaggle上。
清洁并尝试应用KNeighborsRegressor
运行模型后,结果,精度和r平方太低,我不知道为什么。
然而,预测和y检验之间的差异并不大,MSE也很低。
我认为这里有一些错误,我希望你能帮助我解决它。我想达到90%的准确率。
import re
import sys
import time
import datetime
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
df = pd.read_csv('googleplaystore.csv')
df['Rating'] = df['Rating'].fillna(df['Rating'].median())
replaces = [u'\u00AE', u'\u2013', u'\u00C3', u'\u00E3', u'\u00B3', '[', ']', "'"]
for i in replaces:
df['Current Ver'] = df['Current Ver'].astype(str).apply(lambda x : x.replace(i, ''))
regex = [r'[-+|/:/;(_)@]', r'\s+', r'[A-Za-z]+']
for j in regex:
df['Current Ver'] = df['Current Ver'].astype(str).apply(lambda x : re.sub(j, '0', x))
df['Current Ver'] = df['Current Ver'].astype(str).apply(lambda x : x.replace('.', ',',1).replace('.', '').replace(',', '.',1)).astype(float)
df['Current Ver'] = df['Current Ver'].fillna(df['Current Ver'].median())
df.drop([10472], axis = 0, inplace = True)
le = preprocessing.LabelEncoder()
df['App'] = le.fit_transform(df['App'])
category_list = df['Category'].unique().tolist()
category_list = ['cat_' + word for word in category_list]
df = pd.concat([df, pd.get_dummies(df['Category'], prefix='cat')], axis=1)
df['Genres'] = df['Genres'].str.split(';').str[0]
df['Genres'].replace('Music & Audio', 'Music', inplace =True)
le = preprocessing.LabelEncoder()
df['Genres'] = le.fit_transform(df['Genres'])
le = preprocessing.LabelEncoder()
df['Content Rating'] = le.fit_transform(df['Content Rating'])
df['Price'] = df['Price'].apply(lambda x : x.strip('$'))
df['Installs'] = df['Installs'].apply(lambda x : x.strip('+').replace(',', ''))
df['Type'] = pd.get_dummies(df['Type'])
def change_size(size):
if 'M' in size:
x = size[:-1]
x = float(x)*1000000
return(x)
elif 'k' == size[-1:]:
x = size[:-1]
x = float(x)*1000
return(x)
else:
return None
df['Size'] = df['Size'].apply(change_size)
df['Size'] = df['Size'].fillna(value=df['Size'].median(), axis = 0)
df['new'] = pd.to_datetime(df['Last Updated'])
df['lastupdate'] = (df['new'] - df['new'].max()).dt.days
features = ['App', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'lastupdate','Content Rating', 'Genres', 'Current Ver']
features.extend(category_list)
X = df[features]
y = df['Rating']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
model = KNeighborsRegressor(n_neighbors=28)
predictions = model.predict(X_test)
model.fit(X_train, y_train)
accuracy = model.score(X_test,y_test)
'Accuracy: ' + str(np.round(accuracy*100, 2)) + '%'
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
result = pd.DataFrame({'Actual': y_test, 'Predicted': predictions})
result
这里测量精度的方式是与评级完全匹配(如在分类器中)。例如,如果实际评级为4.3且预测值为4.3001,则将其计为错误。但是,这里要做的是回归,在这种情况下,MSE和MAE是更好的指标。
您可以尝试用y_test
和predictions
来获取“精确度”的概念,如下所示:
>>> metrics.accuracy_score(np.around(y_test), np.around(predictions))
0.7666051660516605