特征选择-提取使用Ngram和TF-IDF

问题描述 投票:0回答:1

我是数据挖掘新手,我使用 ChatGPT 来解决我的问题。我不明白哪里有错误。乍一看,一切似乎都是正确的。如果您能帮助我解决整个代码中的错误,我会很高兴。

我有一个 .csv 文件,其中包含目标(true 或 false)和 tweet(字符串)列,我对此“.csv”文件执行所有操作。

这是我的部分代码:

def train_and_evaluate_classifier(classifier, X_train, y_train, X_test, y_test, method_name):
    start_time = time.time()
    classifier.fit(X_train, y_train)
    training_time = time.time() - start_time

    start_time = time.time()
    y_pred = classifier.predict(X_test)
    testing_time = time.time() - start_time

    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    num_features = X_train.shape[1]

    result_table = pd.DataFrame({
        'Method': [method_name],
        'Accuracy': [accuracy],
        'F-measure': [f1],
        'Precision': [precision],
        'Recall': [recall],
        'Number of Features': [num_features],
        'Training Time': [training_time],
        'Testing Time': [testing_time]
    })

    return result_table
# TF-IDF
vectorizer = TfidfVectorizer(max_features=200)

#count vectorizer
coun_vect = CountVectorizer(ngram_range=(1,2), max_features=200)

knn_classifier = KNeighborsClassifier()


kbest = SelectKBest(chi2, k = 200)
pca = PCA(n_components = 2)

kf = KFold(n_splits=5, shuffle=True)

knn_results = []


X=data['tweet']
y=data['target']

for train_i, test_i in kf.split(X,y):
    X_train, X_test = X[train_i], X[test_i]
    y_train, y_test = y[train_i], y[test_i]

    #TF-IDF vectorizer
    X_train_tfidf = vectorizer.fit_transform(X_train)
    X_test_tfidf = vectorizer.transform(X_test)
    #Count vectorizer for ngram
    X_train_ngram = coun_vect.fit_transform(X_train)
    X_test_ngram = coun_vect.transform(X_test)
    #TF-IDF + PCA
    X_train_tfidf_pca = pca.fit_transform(X_train_tfidf)
    X_test_tfidf_pca = pca.transform(X_test_tfidf)
    #TF-IDF + FS
    X_train_tfidf_fs = kbest.fit_transform(X_train_tfidf, y_train)
    X_test_tfidf_fs = kbest.transform(X_test_tfidf)
    #Ngram + pca
    X_train_ngram_pca = pca.fit_transform(X_train_ngram)
    X_test_ngram_pca = pca.transform(X_test_ngram)
    #Ngram + fs
    X_train_ngram_fs = kbest.fit_transform(X_train_ngram, y_train)
    X_test_ngram_fs = kbest.transform(X_test_ngram)



    knn_result_tfidf = train_and_evaluate_classifier(knn_classifier, X_train_tfidf, y_train, X_test_tfidf, y_test, 'KNN TF-IDF')
    
    knn_result_ngram = train_and_evaluate_classifier(knn_classifier,X_train_ngram, y_train,X_test_ngram,y_test,'KNN N-gram')

    knn_result_ngram_fs = train_and_evaluate_classifier(knn_classifier, X_train_ngram_fs, y_train, X_test_ngram_fs, y_test,'KNN N-gram +fs')


    knn_result_ngram_pca = train_and_evaluate_classifier(knn_classifier, X_train_ngram_pca, y_train, X_test_ngram_pca, y_test,'KNN N-gram +pca')


    knn_result_tfidf_fs = train_and_evaluate_classifier(knn_classifier,X_train_tfidf_fs, y_train,X_test_tfidf_fs,y_test,'KNN TF-IDF + FS')


    knn_result_tfidf_pca = train_and_evaluate_classifier(knn_classifier,X_train_tfidf_pca, y_train,X_test_tfidf_pca,y_test,'KNN TF-IDF + PCA')


    
    knn_results.extend([knn_result_tfidf, knn_result_ngram, knn_result_ngram_fs, knn_result_ngram_pca, knn_result_tfidf_fs, knn_result_tfidf_pca])


    #TF-IDF + FS
    X_train_tfidf_fs = kbest.fit_transform(X_train_tfidf, y_train)
    X_test_tfidf_fs = kbest.transform(X_test_tfidf)

为什么gpt在这里使用额外的y_train作为参数?

我这部分有一个打字错误:

    105     #Ngram + pca
--> 106     X_train_ngram_pca = pca.fit_transform(X_train_ngram)
    107     X_test_ngram_pca = pca.transform(X_test_ngram)

TypeError:PCA 不支持稀疏输入。请参阅 TruncatedSVD 了解可能的替代方案。

python data-mining text-mining tfidfvectorizer k-fold
1个回答
0
投票

普通 PCA 不接受稀疏矩阵(检查是否有 NaN 或其他矩阵)。

要执行 PCA 步骤,您可以执行截断奇异值分解 (Truncated-SVD),如错误消息所示。

from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=n_components)
X_train_ngram_svd = svd.fit_transform(X_train_ngram)

如果您需要载荷来查看可变权重:

loading_vectors = svd.components_
© www.soinside.com 2019 - 2024. All rights reserved.