如何正确插入不同编码的文本文件？

Question

我有 NetApp XCP 生成的 *.txt 文件和包含文件名列表的 Excel 文档。我正在尝试将数据插入数据库并比较列

file_name

上的两个表。

每个文件都包含一些非 utf-8 的字符。我不能忽略字符，因为最终结果需要文件路径和文件名。我尝试使用 utf-8 ASCII MacRoman 和其他一些模块 chardet （脚本尚未完成）。文件名包含中文字符，当我插入数据库时，这些字符以“？”结尾相反。

我需要从不同编码的文件插入数据库并比较文件名是否匹配。我还需要带回原始路径和文件名，以便可以访问文件。

def upload_leaked_files():
    #df = pd.read_excel(LEAKED_FILES, sheet_name=0) #,  encoding_override='cp437') #not working with pd.read

    with open(LEAKED_FILES, encoding="MacRoman") as file: #MacRoman
        data = list(csv.reader(file, delimiter=","))
    LeakedFile.objects.all().delete()
    bulk_group = []
    for row in data:
        #print(type( row[0]),row[0])
        d = row[0] #.encode(encoding="utf-8")
        #print(type(d), d)

        bulk_group.append(LeakedFile(file=d))
    LeakedFile.objects.bulk_create(bulk_group, batch_size=100)

示例文件名：预批量检验证明NCS低线仪表预批量件质量明.doc

Answer 1

我尝试将编码设置为“utf8”，它似乎可以完成工作：

测试文件内容：

预批量检验证明NCS低线仪表预批量件质量明.doc,someothertext
预批量检验证明NCS低线仪表预批量件质量明.doc,othertext

代码：

with open('./test_file.csv', 'r', encoding='utf8') as input:
    file_content = [r.replace('\n', '').split(',') for r in input.readlines()]
    
    rows = [r for r in file_content]
    print(rows)

输出：

[['预批量检验证明NCS低线仪表预批量件质量明.doc', 'someothertext'], ['预批量检验证明NCS低线仪表
预批量件质量明.doc', 'othertext']]

我在你的解释中遗漏了什么吗？

干杯:)

如何正确插入不同编码的文本文件？

问题描述投票：0回答：1

1个回答

最新问题

如何正确插入不同编码的文本文件？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1