使用 pandas 或 arrow 读取 Snowflake 创建的分区镶木地板文件时出现问题

Question

ArrowInvalid: Unable to merge: Field X has incompatible types: string vs dictionary<values=string, indices=int32, ordered=0>

ArrowInvalid: Unable to merge: Field X has incompatible types: decimal vs int32

我正在尝试将雪花查询的结果写入磁盘，然后使用 arrow 和 duckdb 查询该数据。我创建了一个分区镶木地板，其查询如下：this：

COPY INTO 's3://path/to/folder/'
FROM (
    SELECT transaction.TRANSACTION_ID, OUTPUT_SCORE, MODEL_NAME, ACCOUNT_ID, to_char(TRANSACTION_DATE,'YYYY-MM') as SCORE_MTH
    FROM transaction
    )
partition by('SCORE_MTH=' || score_mth || '/ACCOUNT_ID=' || ACCOUNT_ID)
file_format = (type=parquet)
header=true

当我尝试读取镶木地板文件时，出现以下错误：

df = pd.read_parquet('path/to/parquet/') # same result using pq.ParquetDataset or pq.read_table as they all use the same function under the hood

ArrowInvalid: Unable to merge: Field SCORE_MTH has incompatible types: string vs dictionary<values=string, indices=int32, ordered=0>

此外，经过一些谷歌搜索，我发现了这个页面。按照说明操作： df = pd.read_parquet('path/to/parquet/', use_legacy_dataset=True)

ValueError: Schema in partition[SCORE_MTH=0, ACCOUNT_ID=0] /path/to/parquet was different. 
TRANSACTION_ID: string not null
OUTPUT_SCORE: double
MODEL_NAME: string
ACCOUNT_ID: int32
SCORE_MTH: string

vs

TRANSACTION_ID: string not null
OUTPUT_SCORE: double
MODEL_NAME: string

此外，根据数据类型，您可能会收到此错误：

ArrowInvalid: Unable to merge: Field X has incompatible types: IntegerType vs DoubleType

或

ArrowInvalid: Unable to merge: Field X has incompatible types: decimal vs int32

这是一个已知问题。

知道如何读取此镶木地板文件吗？

Answer 1

我发现唯一有效的解决方法是：

import pyarrow.dataset as ds
dataset = ds.dataset('/path/to/parquet/', format="parquet", partitioning="hive")

然后就可以直接使用

duckdb

查询：

import duckdb
con = duckdb.connect()
pandas_df = con.execute("Select * from dataset").df()

如果你想要一个 pandas 数据框，你可以这样做：

dataset.to_table().to_pandas()

请注意，

to_table()

会将整个数据集加载到内存中。

Answer 2

我正在处理同样的问题，对我来说，如果我向函数提供 pyarrow 模式，它就会起作用：

import pandas as pd
import pyarrow as pa

schema = pa.schema([('SCORE_MTH', pa.string()), ('ACCOUNT_ID', pa.int32())])
pd.read_parquet('s3://path/to/folder//', schema=schema)  # works also with filters

使用 pandas 或 arrow 读取 Snowflake 创建的分区镶木地板文件时出现问题

问题描述投票：0回答：2

2个回答

最新问题

使用 pandas 或 arrow 读取 Snowflake 创建的分区镶木地板文件时出现问题

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2