在 Polars（Python 库）中将二进制文件转换为非 UTF-8 字符的字符串变量

Question

我在使用 Python 操作包含非 UTF-8 字符的数据集时遇到问题。字符串以二进制形式导入。但我在将二进制列转换为字符串时遇到问题，其中单元格包含非 UTF-8 字符。

我的问题的一个最小的工作示例是

import polars as pl
import pandas as pd

pd_df = pd.DataFrame([[b"bob", b"value 2", 3], [b"jane", b"\xc4", 6]], columns=["a", "b", "c"])
df = pl.from_pandas(pd_df)

column_names = df.columns

# Loop through the column names
for col_name in column_names:
    # Check if the column has binary values
    if df[col_name].dtype ==pl.Binary:
        # Convert the binary column to string format
        print(col_name)
        df = df.with_columns(pl.col(col_name).cast(pl.String))

转换 b 列时会引发错误。对于解决方案，我可以将任何非 utf 8 字符转换为空白。

在在线建议中尝试了许多其他转换建议，但我无法让其中任何一个发挥作用。

Answer 1

所以看来你使用了错误的编码（可能不是UTF-8而是其他东西 - 你需要弄清楚）。但回到你原来的请求，你仍然可以通过应用一个逐字符解码的 lambda 来忽略解码错误（在这种情况下你将得到空白值）。但正如您所认为的，这只是一种解决方法，而不是真正的正确的做法：

import polars as pl
import pandas as pd

pd_df = pd.DataFrame([[b"bob", b"value 2", 3], [b"jane", b"\xc4", 6]], columns=["a", "b", "c"])
df = pl.from_pandas(pd_df)

column_names = df.columns

# Loop through the column names
for col_name in column_names:
    # Check if the column has binary values
    if df[col_name].dtype == pl.Binary:
        # Convert the binary column to string format, handling non UTF8 characters
        df = df.with_columns(
            pl.col(col_name).apply(
                lambda x: x.decode(errors='ignore') if isinstance(x, bytes) else x
            ).alias(col_name)
        )

print(df)

在 Polars（Python 库）中将二进制文件转换为非 UTF-8 字符的字符串变量

问题描述投票：0回答：1

1个回答

最新问题

在 Polars（Python 库）中将二进制文件转换为非 UTF-8 字符的字符串变量

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1