我如何使我的算法与KNN文本分类一起使用?

问题描述 投票:0回答:1

[尝试使我的分类接受文本(字符串),而不仅仅是数字(数字)。处理数据,承载大量拉出的文章,我希望分类算法显示要继续处理的内容以及要丢弃的内容。应用一个数字,事情就很好了,但这不是很直观,尽管我知道数字代表与我正在使用的两个类之一的关系。

我如何更改算法的逻辑,以使其接受文本作为搜索条件,而不仅仅是从'Unique_id'列中选取的匿名数字?列是...'标题','摘要','相关','标签','Unique_id'。在算法末端串联df的原因是我想比较结果。最后。应该注意的是,“标签”列由关键字列表组成,因此基本上我希望算法从该列中读取。

[我确实尝试过,从数据源中读取,将'index_col ='Unique_id'更改为'index_col ='Label,但也没有解决。

我想要的例子:

print("\nPrint KNN1")
print(get_closest_neighs1('search word'), "\n")

print("\nPrint KNN2")
print(get_closest_neighs2('search word'), "\n")

print("\nPrint KNN3")
print(get_closest_neighs3('search word'), "\n")

这是完整的代码(使用现在的运行,使用数字标识最近的邻居,请参见算法的结尾,以查看上面的示例:]

import pandas as pd

print("\nPerforming Analysis using Text Classification")
data = pd.read_csv('File_1_coltest_demo.csv', sep=';',  encoding="ISO-8859-1").dropna()

data['Unique_id'] = data.groupby(['Title', 'Abstract', 'Relevant']).ngroup()

data.to_csv('File_2_coltest_demo_KNN.csv', sep=';', encoding="ISO-8859-1", index=False)

data1 = pd.read_csv('File_2_coltest_demo_KNN.csv', sep=';', encoding="ISO-8859-1", index_col='Unique_id')

data2 = pd.DataFrame(data1, columns=['Abstract', 'Relevant'])

data2.to_csv('File_3_coltest_demo_KNN_reduced.csv', sep=';', encoding="ISO-8859-1", index=False)

print("\nData top 25 items")
print(data2.head(25))

print("\nData info")
print(data2.info())

print("\nData columns")
print(data2.columns)

from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer

token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(lowercase=True, stop_words='english', ngram_range=(1, 1), tokenizer=token.tokenize)
text_counts = cv.fit_transform(data2['Abstract'])

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
text_counts, data2['Abstract'], test_size=0.5, random_state=1)

print("\nTF IDF")
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer()
text_tf = tf.fit_transform(data2['Abstract'])
print(text_tf)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
text_tf, data2['Abstract'], test_size=0.3, random_state=123)

from sklearn.neighbors import NearestNeighbors
import pandas as pd

nbrs = NearestNeighbors(n_neighbors=20, metric='euclidean').fit(text_tf)

def get_closest_neighs1(Abstract):
    row = data2.index.get_loc(Abstract)
    distances, indices = nbrs.kneighbors(text_tf.getrow(row))
    names_similar = pd.Series(indices.flatten()).map(data2.reset_index()['Abstract'])
    result = pd.DataFrame({'distance1' : distances.flatten(), 'Abstract' : names_similar}) # 'Unique_id' : names_similar,
    return result

def get_closest_neighs2(Unique_id):
    row = data2.index.get_loc(Unique_id)
    distances, indices = nbrs.kneighbors(text_tf.getrow(row))
    names_similar = pd.Series(indices.flatten()).map(data2.reset_index()['Unique_id'])
    result1 = pd.DataFrame({'Distance' : distances.flatten() / 10, 'Unique_id' : names_similar}) # 'Unique_id' : names_similar,
    return result1

def get_closest_neighs3(Relevant):
    row = data2.index.get_loc(Relevant)
    distances, indices = nbrs.kneighbors(text_tf.getrow(row))
    names_similar = pd.Series(indices.flatten()).map(data2.reset_index()['Relevant'])
    result2 = pd.DataFrame({'distance2' : distances.flatten(), 'Relevant' : names_similar}) # 'Unique_id' : names_similar,
    return result2

print("\nPrint KNN1")
print(get_closest_neighs1(114), "\n")

print("\nPrint KNN2")
print(get_closest_neighs2(114), "\n")

print("\nPrint KNN3")
print(get_closest_neighs3(114), "\n")

data3 = pd.DataFrame(get_closest_neighs1(114))
data4 = pd.DataFrame(get_closest_neighs2(114))
data5 = pd.DataFrame(get_closest_neighs3(114))

del data5['distance2']

data6 = pd.concat([data3, data4, data5], axis=1).reindex(data3.index)

del data6['distance1']

data6.to_csv('File_4_coltest_demo_KNN_results.csv', sep=';', encoding="ISO-8859-1", index=False)
python-3.x nlp knn text-classification
1个回答
0
投票

如果我了解您的权利,您正在尝试这样做:

© www.soinside.com 2019 - 2024. All rights reserved.