spark.read.json 抛出 COLUMN_ALREADY_EXISTS,列名称因大写和类型而异

问题描述 投票:0回答:1

我正在尝试在 Spark 中读取一个巨大的非结构化 JSON 文件。我遇到了一种边缘情况,它似乎与仅在大写/小写和类型上有所不同的列相关。考虑脚本:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.json("ldap5.json")

并输入:

{"ldap":{"supportedLdapVersion":"3"}}
{"ldap":{"supportedLDAPVersion":["2","3"]}}

这是整个堆栈跟踪:

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/03/12 09:28:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/03/12 09:28:33 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
/home/d33tah/virtualenv/lib/python3.7/site-packages/pyspark/context.py:317: FutureWarning: Python 3.7 support is deprecated in Spark 3.4.
  warnings.warn("Python 3.7 support is deprecated in Spark 3.4.", FutureWarning)
Traceback (most recent call last):
  File "run2.py", line 3, in <module>
    df = spark.read.json("ldap5.json")
  File "/home/d33tah/virtualenv/lib/python3.7/site-packages/pyspark/sql/readwriter.py", line 418, in json
    return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/home/d33tah/virtualenv/lib/python3.7/site-packages/py4j/java_gateway.py", line 1323, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/home/d33tah/virtualenv/lib/python3.7/site-packages/pyspark/errors/exceptions/captured.py", line 175, in deco
    raise converted from None
pyspark.errors.exceptions.captured.AnalysisException: [COLUMN_ALREADY_EXISTS] The column `supportedldapversion` already exists. Consider to choose another name or rename the existing column.

版本:

> pyspark --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.1
      /_/
                        
Using Scala version 2.12.18, OpenJDK 64-Bit Server VM, 11.0.19
Branch HEAD
Compiled by user heartsavior on 2024-02-15T11:24:58Z
Revision fd86f85e181fc2dc0f50a096855acf83a6cc5d9c
Url https://github.com/apache/spark
Type --help for more information.

假设我对 JSON 文件的控制权为零,我可以做些什么来使输入成功加载吗?

json apache-spark pyspark
1个回答
0
投票

我后来发现这个答案也适用于这里:

尝试将

spark.sql.caseSensitive
设置为
true
(默认为
false
spark.conf.set('spark.sql.caseSensitive', true)
你可以看到 源码其定义: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L833

此外,您可以在

JDBCWriteSuite
中看到它如何影响 JDBC 连接器: https://github.com/apache/spark/blob/ee95ec35b4f711fada4b62bc27281252850bb475/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala

© www.soinside.com 2019 - 2024. All rights reserved.