MacOS和Ubuntu之间的Python编码差异导致UnicodeEncodeError

Question

我正在使用Python 3.6.4脚本将CSV文件从一台计算机复制到远程PostgreSQL 11服务器。

CSV文件的标头（和值）可以包含空格以及非ASCII字符，例如∂和µ。要保存CSV文件，我使用以下代码：

with open(filename, "w", encoding="utf-8") as output:
    result.to_csv(output, sep='\t', header=False, index=False, na_rep='null', encoding="utf-8")

此处结果为pandas dataframe。然后，我得到列名称的列表。

columns = ["\"{}\"".format(f) for f in list(result.head(0))]

然后使用CSV和column_list覆盖到Postgres：

    tn = f"{schema}.{tablename} ({', '.join(columns)})"
    with open(filename) as f:
        subprocess.check_call([
            'psql',
            '-c', f"COPY {tn} FROM STDIN WITH NULL as 'null'",
            '-d', url,
            '--set=ON_ERROR_STOP=true'
        ], stdin=f)

这在MacOS（Catalina 10.15.x）上效果很好。但是，当以上代码在Ubuntu实例（18.04或16.04）上运行时，我不断收到以下错误：

UnicodeEncodeError: 'ascii' codec can't encode character '\xb5' in position 231: ordinal not in range(128)

错误本身已得到很好的记录，我遍历了所有已发布的答案，包括here，here和here。但是，它们都没有帮助。如您所见，我在代码中撒了encoding = utf-8，我试图在Ubuntu实例中定义所有环境变量，但无济于事。

我无法删除特殊字符，必须将其保留在数据库中。我可以在python脚本中还是在subprocess.check_call中做什么来解决此问题？我还切换到以下位置：

import codecs


    with codecs.open(filename, encoding='utf-8') as f:
        subprocess.run([
            'psql',
            '-c', f"COPY {tn} FROM STDIN WITH NULL as 'null'",
            '-d', url,
            '--set=ON_ERROR_STOP=true'
        ], stdin=f, encoding='utf-8')

但是问题仍然相同。任何帮助表示赞赏。

Answer 1

这就是我的工作方式。我仍然希望有一种更好的方法，但是现在我可以避免编码问题（双关语）。此方法在表名称中不支持非ASCII字符，但是由于这不是我当前的要求，因此我对此并不担心。它可以正确处理列名和字段值中的非ASCII字符。

此行抛出UnicodeEncodeError，因为其中一列具有非ASCII字符：

tn = f"{schema}.{tablename} ({', '.join(columns)})"

所以，现在我也将列名称保留在CSV文件中。以下每个步骤都很关键，不能跳过其中的任何一个：

# First, re-arrange the dataframe columns to match definition order in database. 
# The dictionary table_headers contains the columns as defined within the database 
# for specified table.
for entry in table_headers[table]:
    if entry not in result:
        result[entry] = 'null'
result = result[table_headers[table]]
# Set index. You may or may not wish to do so. In this case it made sense for me.
result.set_index(result_id, inplace=True)
# Finally, write it out. Note that index=True is being set because of line above.
with open(fn, "w", encoding="utf-8") as output:
    result.to_csv(output, sep='\t', header=True, index=True, na_rep='null', encoding="utf-8")

现在，在数据库复制阶段，我这样做：

tn = f"{schema}.\"{tablename}\""
with codecs.open(filename, encoding='utf-8') as f:
    subprocess.run([
        'psql',
        '-c', f"COPY {tn} FROM STDIN WITH NULL as 'null' CSV HEADER DELIMITER '\t'",
        '-d', url,
        '--set=ON_ERROR_STOP=true'
    ], stdin=f, encoding='utf-8')

而且效果很好。在MacOS和Ubuntu上。我可能不需要codecs.open和其他utf-8条目，但是它们没有害处，所以现在就把它们放在那儿。

MacOS和Ubuntu之间的Python编码差异导致UnicodeEncodeError

问题描述投票：2回答：1

1个回答

最新问题

MacOS和Ubuntu之间的Python编码差异导致UnicodeEncodeError

问题描述 投票：2回答：1

1个回答

最新问题

问题描述投票：2回答：1