Logistic回归：X每个样本具有667个特征；期待74869

Question

使用imdb电影评论数据集，我进行了逻辑回归以预测评论的情绪。

tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None, 

tokenizer=fill, use_idf=True, norm='l2', smooth_idf=True)
y = df.sentiment.values
X = tfidf.fit_transform(df.review)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.3, shuffle=False)
clf = LogisticRegressionCV(cv=5, scoring="accuracy", random_state=1, n_jobs=-1, verbose=3,max_iter=300).fit(X_train, y_train)

yhat = clf.predict(X_test)


print("accuracy:")
print(clf.score(X_test, y_test))

model_performance(X_train, y_train, X_test, y_test, clf)

在此之前，已经进行了预处理。模型性能只是创建混淆矩阵的功能。这一切都可以很好地工作，并且精度很高。

我现在抓取了新的IMDB评论：

#The movie "Joker" IMBD review page
url_link='https://www.imdb.com/title/tt7286456/reviews'
html=urlopen(url_link)

content_bs=BeautifulSoup(html)

JokerReviews = []
#All the reviews ends in a div class called text in html, can be found in the imdb source code
for b in content_bs.find_all('div',class_='text'):
  JokerReviews.append(b)

df = pd.DataFrame.from_records(JokerReviews)
df['sentiment'] = "0" 
jokerData=df[0]
jokerData = jokerData.apply(preprocessor)

问题：现在，我希望测试相同的逻辑回归以预测情绪：

tfidf2 = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None, tokenizer=fill, use_idf=True, norm='l2', smooth_idf=True)
y = df.sentiment.values
Xjoker = tfidf2.fit_transform(jokerData)

yhat = Clf.predict(Xjoker)

但是我得到了错误：ValueError：X每个样本具有667个功能；期望74869

我不明白为什么它必须具有与X_test相同的功能

Answer 1

[问题是您的模型在经过识别74869个唯一单词的预处理之后得到了训练，并且对用于推理的输入数据的预处理已经识别了667个单词，并且应该将数据发送给具有相同列数的模型。除此之外，模型可能也不会期望为推理识别的667个单词之一。

要为模型创建有效的输入，您必须使用诸如以下的方法：

# check which columns are expected by the model, but not exist in the inference dataframe
not_existing_cols = [c for c in X.columns.tolist() if c not in Xjoker]
# add this columns to the data frame
Xjoker = Xjoker.reindex(Xjoker.columns.tolist() + not_existing_cols, axis=1)
# new columns dont have values, replace null by 0
Xjoker.fillna(0, inplace = True)
# use the original X structure as mask for the new inference dataframe
Xjoker = Xjoker[X.columns.tolist()]

这些步骤之后，您可以调用predict（）方法。

Logistic回归：X每个样本具有667个特征；期待74869

问题描述投票：1回答：1

1个回答

最新问题

Logistic回归：X每个样本具有667个特征；期待74869

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1