这个数据集很棒!然而我被卡住了......投票数(x轴)和批准指数(y轴)之间似乎存在对数关系,但我尝试使用 scipy.curve_fit 和 lmfit 来拟合数据......但我不这样做得到一个方程,它要么给出错误的方程,从而给出错误的图表,要么当我尝试使用 lmfit 时得到真值错误......
但是,我使用一个在线网站生成了一条曲线,它起作用了......有点。我以恒定的间隔获取 x 值和相应的 y 值,然后用它来绘制点,并从网站生成曲线。我得到的方程是 y = 0.986203np.log(12.6695x + 64567.3) - 8.30291。当我尝试输入这些作为猜测参数[p0]或make_parameters值时,它似乎仍然不起作用......我添加了下面的代码以供参考!
如果有人能告诉我我做错了什么,我将非常感激,因为我花了大约 2 个小时来解决这个问题......谢谢:)
这里也是数据集的链接:https://www.kaggle.com/datasets/alessandrolobello/the-ultimate-film-statistics-dataset-for-ml
下面是我尝试过的代码:
import pandas as pd
import numpy as np
import sklearn
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split as tts
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error
import os
import plotly.express as px
df=pd.read_csv('movie_statistic_dataset.csv')
#number of votes vs approval index
data = newdf[['movie_numerOfVotes','approval_Index']].reset_index(drop=True)
data.columns = ['Votes','Approval']
X = data[['Votes']].values
Y = data[['Approval']].values
#trying curve_fit
from scipy.optimize import curve_fit
def logfunction(a,b,x):
return a*np.log(b*x)
popt,pcov = curve_fit(logfunction,X.ravel(),Y.ravel())
a,b = popt
ypred_log = logfunction(a,b,X.ravel())
plt.scatter(X.ravel(),ypred_log)
plt.scatter(X.ravel(),Y.ravel())
#the curve doesn't fit
#trying lmfit
from lmfit import Model, Minimizer
def logfunc(x):
return a*np.log(b*x + c)
mod = Model(logfunc)
params = mod.make_params(a=1,b=15,c=6000)
result = mod.fit(np.array(data['Approval']),params=params,x=np.array(data['Votes']))
result.fit_report() #gives an error!
#equation I found online:
x = np.arange(0,6000001,10000)
y = 0.986203*np.log(12.6695*x + 64567.3) - 8.30291
#trying again with lmfit
def logfunc(x):
return y = a*np.log(b*x + c) - d
mod = Model(logfunc)
params = mod.make_params(a=0.982, b=12.7,c=64567,d=8.302)
result = mod.fit(np.array(data['Approval']),params=params,x=np.array(data['Votes']))
#also doesn't work!!
我所做的主要改变是:
curve_fit
,第一个变量必须是自变量(在本例中为 x
)import pandas as pd
import numpy as np
import sklearn
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split as tts
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
import os
df = pd.read_csv('movie_statistic_dataset.csv')
#number of votes vs approval index
data = df[['movie_numerOfVotes','approval_Index']].reset_index(drop=True)
data.columns = ['Votes','Approval']
X = data[['Votes']].values
Y = data[['Approval']].values
#trying curve_fit
from scipy.optimize import curve_fit
def logfunction(x, a, c): #first argument must be the independent variable
return a * np.log(x + 1e-8)**3.8 + c
# return a * np.log(x + 1e-6)**b + c #could estimate b instead
popt,pcov = curve_fit(logfunction, X.ravel(), Y.ravel())
a, c = popt
#Fitted curve at the data points
ypred_log = logfunction(X.ravel(), a, c)
#Fitted curve on a new axis
x_fine = np.linspace(X.min(), X.max(), num=300)
ypred_log_fine = logfunction(x_fine, a, c)
#Plot
plt.scatter(X.ravel(), Y.ravel(), marker='.', s=20, color='darkgray', label='data')
plt.plot(x_fine, ypred_log_fine, color='black', label='fit')
#Plot the fit at data points:
# plt.scatter(X.ravel(), ypred_log, marker='|', color='black', label='fit')
plt.legend()
plt.gcf().set_size_inches(8, 3)
plt.xlabel('X')
plt.ylabel('Y')