我正在尝试从 mongoDB 记录创建一个镶木地板文件,为了做到这一点,我首先创建了一个这样的模式
import pyarrow as pa
import pyarrow.parquet as pq
USER = pa.schema([
pa.field("_id", pa.string(), nullable=True),
pa.field("appID", pa.string(), nullable=True),
pa.field("group", pa.string(), nullable=True),
pa.field("_created", pa.int64(), nullable=True),
pa.field("_touched", pa.int64(), nullable=True),
pa.field("_updated", pa.int64(), nullable=True)
])
writer = pq.ParquetWriter('output.parquet', USER)
并尝试使用以下内容在将 mongo 文档循环到镶木地板文件后添加数据
batch = pa.RecordBatch.from_pylist(chunk)
writer.write_batch(batch)
我收到这个错误
Table schema does not match schema used to create file
这是因为并非所有mongo记录都包含group
字段,如何解决这个问题?
要修复从 MongoDB 记录创建 Parquet 文件时出现的“表架构与用于创建文件的架构不匹配”错误,有必要确保您布置的架构与您记录的数据结构一致正在尝试写作。
以下是如何更改代码以填补缺失区域的示例:
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
USER = pa.schema([
pa.field("_id", pa.string(), nullable=True),
pa.field("appID", pa.string(), nullable=True),
pa.field("group", pa.string(), nullable=True),
pa.field("_created", pa.int64(), nullable=True),
pa.field("_touched", pa.int64(), nullable=True),
pa.field("_updated", pa.int64(), nullable=True)
])
writer = pq.ParquetWriter('output.parquet', USER)
for doc in mongo_docs:
filled_doc = {field.name: doc.get(field.name, None) for field in USER}
batch = pa.RecordBatch.from_pandas(pd.DataFrame([filled_doc]),
schema=USER)
writer.write_batch(batch)
writer.close()
写入所有批次后,不要忘记使用 writer.close() 关闭 writer,这样 Parquet 文件已正确完成
您还需要向元数据添加可为空标志:
USER = pa.schema([
pa.field("_id", pa.string(), nullable=True),
pa.field("appID", pa.string(), nullable=True),
pa.field("group", pa.string(), nullable=True, metadata={'nullable': 'true'}),
pa.field("_created", pa.int64(), nullable=True),
pa.field("_touched", pa.int64(), nullable=True),
pa.field("_updated", pa.int64(), nullable=True)
])