Tf-Idf的输出不令人满意

问题描述 投票:0回答:1

我在文本文件中有两行的文档,如下所示。我想将tf-idf应用于它,但出现如下所示的错误,我不确定文件中的int对象在哪里?为什么会引发此错误?

Env:

Jupter notebook, python 3.7

错误:

AttributeError: 'int' object has no attribute 'lower'

file.txt:

  Random person from the random hill came to a running mill and I have a count of the hill. This is my house. 

  A person is from a great hill and he loves to run a mill. 

  Sub-disciplines of biology are defined by the research methods employed and the kind of system studied: theoretical biology uses mathematical methods to formulate quantitative models while experimental biology performs empirical experiments.

  The objects of our research will be the different forms and manifestations of life, the conditions and laws under which these phenomena occur, and the causes through which they have been effected. The science that concerns itself with these objects we will indicate by the name biology.

代码:

import pandas as pd
import spacy
import csv
import collections
import sys
import itertools
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
from nltk.tokenize import sent_tokenize
from gensim import corpora, models
from stop_words import get_stop_words
from nltk.stem import PorterStemmer

data = pd.read_csv('file.txt', sep="\n", header=None)

data.dtypes
0    object
dtype: object

data.shape()
4, 1

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(data)
print(X)
python-3.x tf-idf tfidfvectorizer
1个回答
0
投票

我通过这样读取文件来解决它:

以open('file.txt')作为f:行= [f中行的line.rstrip()]

© www.soinside.com 2019 - 2024. All rights reserved.