使用 to_avro 将 Spark 数据帧序列化到 Spark 中的 avro

Question

我有一个具有以下架构的 Spark 数据框

StructType(
    StructField(id,StringType,true),
    StructField(type,StringType,true),
)

我需要使用

to_avro

中的

spark-avro

函数使用以下 avro 模式转换为 avro，就像这样

to_avro(spark_df, jsonFormatSchema)

{
  "type": "record",
  "name": "Value",
  "fields": [
    {
      "name": "id",
      "type": "string"
    },
    {
      "name": "type",
      "type": "string"
    },
    {
      "name": "x",
      "type": [
        "null",
        "string"
      ],
      "default": null
    },
    {
      "name": "y",
      "type": [
        {
          "type": "boolean",
          "connect.default": false
        },
        "null"
      ],
      "default": false
    }
  ],
}

现在显然，我的 Spark 数据帧没有 x 和 y 列，我如何定义 avro 模式，以便我的 Spark 数据帧序列化到的 avro 二进制文件将包含这些字段的 null/默认值，而不是抛出 IncompleteSchemaException ？

我认为类型数组中的“null”值会处理输入 Spark 数据帧中不存在的字段，但事实证明这是错误的。

Answer 1

问题是默认值仅在解码时使用，而不是编码时使用。请参阅规范中的此部分：https://avro.apache.org/docs/current/specification/#schema-record

具体来说这部分：

A default value for this field, only used when reading instances that lack the field for schema evolution purposes. The presence of a default value does not make the field optional at encoding time.

使用 to_avro 将 Spark 数据帧序列化到 Spark 中的 avro

问题描述投票：0回答：1

1个回答

最新问题

使用 to_avro 将 Spark 数据帧序列化到 Spark 中的 avro

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1