将几个txt文件内容读入python

问题描述 投票:0回答:0

我有两个文件夹,每个文件夹包含各种 .txt 文件中的单词,一个文件夹名为“good”,另一个文件夹名为“bad”,我想编写一个 python 脚本,将所有数据导入数据框和dataframe 将有“Id”列、“word”列和“label”列。 标签列将根据文件夹名称为“好”或“坏”。

我已经编写了以下 python 脚本,但我似乎遇到了文件编码类型的问题,我已经安装了“cahrdet”库来检测文件编码类型,但我仍然收到此错误:

UnicodeDecodeError: 'cp949' codec can't decode byte 0xb7 in position 1400: illegal multibyte sequence
good_path = "myfolder/good"
bad_path = "myfolder/bad"


ids = []
words = []
labels = []


for filename in os.listdir(good_path):
    with open(os.path.join(good_path, filename), "rb") as f:
        result = chardet.detect(f.read())
        encoding = result["encoding"]
    with open(os.path.join(good_path, filename), "r", encoding=encoding) as f:
        word_content = f.read()
        ids.append(filename)
        words.append(word_content)
        labels.append("good")


for filename in os.listdir(bad_path):
    with open(os.path.join(bad_path, filename), "rb") as f:
        result = chardet.detect(f.read())
        encoding = result["encoding"]
    with open(os.path.join(bad_path, filename), "r", encoding=encoding) as f:
        word_content = f.read()
        ids.append(filename)
        words.append(word_content)
        labels.append("bad")

# Create a dataframe from the lists
df = pd.DataFrame({"Id": ids, "words": words, "label": labels})

python pandas dataframe supervised-learning
© www.soinside.com 2019 - 2024. All rights reserved.