Spark / Scala Parse json文件仅包含字符串列表

问题描述 投票:0回答:1

我正在尝试使用Spark / Scala或使用Scala [[1,"a"],[2,"b"]]解析包含Jackson的列表的简单json文件,>

当我尝试Spark时,出现以下错误

//simple line of code
spark.read.json(filePath).show
//error
 Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
    at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.buildReader(JsonFileFormat.scala:118)
    at org.apache.spark.sql.execution.datasources.FileFormat$class.buildReaderWithPartitionValues(FileFormat.scala:129)
    at org.apache.spark.sql.execution.datasources.TextBasedFileFormat.buildReaderWithPartitionValues(FileFormat.scala:160)
    at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:295)
    at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:293)
    at org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:313)
    at org.apache.spark.sql.execution.BaseLimitExec$class.inputRDDs(limit.scala:62)
    at org.apache.spark.sql.execution.LocalLimitExec.inputRDDs(limit.scala:97)
    at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
    at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:337)
    at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
    at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
    at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
    at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
    at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
    at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)
    at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:691)

我也尝试使用Jackson并将其解析为case类,但它给了我一个空列表

extractJsonFromStr[AppData]("""[[1,"a"],[2,"b"]]""")
case class AppData(apps :List[(Int,String)])
def extractJsonFromStr[T](jsonString: String)(implicit m: Manifest[T]): Try[T] = {
    implicit val formats: DefaultFormats.type = DefaultFormats
    Try {
      parse(jsonString).extract[T]
    }
  }

我正在尝试使用Spark / Scala或使用Scala Jackson解析包含[[1,“ a”],[2,“ b”]]的简单json文件的列表,当我尝试使用Spark时,它会出现以下错误//简单的代码行spark.read.json(...

json scala apache-spark apache-spark-sql jackson
1个回答
0
投票

代替使用case类,您可以通过指示解析器读取基本类型来使用Jackson(通过json4s),GSON以及可能的其他解析器(我仅测试了这两个)来解析它:

© www.soinside.com 2019 - 2024. All rights reserved.