在PySpark结构化流中,Kafka JSON数据与模式为空。新模式的输入不匹配

问题描述 投票:-1回答:1

我正在尝试在Spark结构化流中读取JSON中的Kafka消息。Kafka中的消息示例如下。

{
  "_id": {
    "$oid": "5eb292531c7d910b8c98dbce"
  },
  "Id": 37,
  "Timestamp": {
    "$date": 1582889068616
  },
  "TTNR": "R902170286",
  "SNR": 91177446,
  "State": 0,
  "I_A1": "FALSE",
  "I_B1": "FALSE",
  "I1": 0.0037385,
  "Mabs": -20.9814753,
  "p_HD1": 31.0069236,
  "pG": 27.640614,
  "pT": 1.7169713,
  "pXZ": 3.4712914,
  "T3": 25.2174444,
  "nan": 179.3099976,
  "Q1": 0,
  "a_01X": [
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925,
    62.7839925
  ]
}

在Kafka中读取流后,字符串的值域是这样的。

|value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
{"_id":{"$oid":"5eb292531c7d910b8c98dbce"},"Id":37,"Timestamp":{"$date":1582889068616},"TTNR":"R902170286","SNR":91177446,"State":0,"I_A1":"FALSE","I_B1":"FALSE","I1":0.0037385,"Mabs":-20.9814753,"p_HD1":31.0069236,"pG":27.640614,"pT":1.7169713,"pXZ":3.4712914,"T3":25.2174444,"nan":179.3099976,"Q1":0,"a_01X":[62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925]}

|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

已经定义了一个模式来选择一些字段,如下所示。

json_schema=StructType([ \
    StructField("_id",StructField("$oid",StringType())), \
    StructField("Id", DoubleType()), \
    StructField('Timestamp', StructField("$date", LongType())), \
    StructField("TTNR", StringType()), \
    StructField("SNR", DoubleType()), \
    StructField("State", LongType()), \
    StructField("I_A1", StringType()), \
    StructField("I_B1", StringType()), \
    StructField("I1", DoubleType()), \
    StructField("Mabs", DoubleType()), \
    StructField("p_HD1", DoubleType()), \
    StructField("pG", DoubleType()), \
    StructField("pT", DoubleType()), \
    StructField("pXZ", DoubleType()), \
    StructField("T3", DoubleType()), \
    StructField("nan", DoubleType()), \
    StructField("Q1", LongType()), \
    StructField("a_01X", ArrayType(DoubleType()))
    ])

(解决了解析错误的问题))但在尝试打印到控制台后,我得到的是 null 值。

data_stream_json = data_stream_value.select(from_json(col("value"), json_schema).alias("json_detail"))
data_stream_output = data_stream_json \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

+----+----+----+----+
|  Id|TTNR| SNR|  Q1|
+----+----+----+----+
|null|null|null|null|
+----+----+----+----+

(新错误) 改变模式后,又出现了新的问题,解析字符串。

pyspark.sql.utils.ParseException: u'\nmismatched input \'{\' expecting {\'SELECT\', \'FROM\', \'ADD\', \'AS\', \'ALL\', \'ANY\', \'DISTINCT\', \'WHERE\', \'GROUP\', \'BY\', \'GROUPING\', \'SETS\', \'CUBE\', \'ROLLUP\', \'ORDER\', \'HAVING\', \'LIMIT\', \'AT\', \'OR\', \'AND\', \'IN\', NOT, \'NO\', \'EXISTS\', \'BETWEEN\', \'LIKE\', RLIKE, \'IS\', \'NULL\', \'TRUE\', \'FALSE\', \'NULLS\', \'ASC\', \'DESC\', \'FOR\', \'INTERVAL\', \'CASE\', \'WHEN\', \'THEN\', \'ELSE\', \'END\', \'JOIN\', \'CROSS\', \'OUTER\', \'INNER\', \'LEFT\', \'SEMI\', \'RIGHT\', \'FULL\', \'NATURAL\', \'ON\', \'PIVOT\', \'LATERAL\', \'WINDOW\', \'OVER\', \'PARTITION\', \'RANGE\', \'ROWS\', \'UNBOUNDED\', \'PRECEDING\', \'FOLLOWING\', \'CURRENT\', \'FIRST\', \'AFTER\', \'LAST\', \'ROW\', \'WITH\', \'VALUES\', \'CREATE\', \'TABLE\', \'DIRECTORY\', \'VIEW\', \'REPLACE\', \'INSERT\', \'DELETE\', \'INTO\', \'DESCRIBE\', \'EXPLAIN\', \'FORMAT\', \'LOGICAL\', \'CODEGEN\', \'COST\', \'CAST\', \'SHOW\', \'TABLES\', \'COLUMNS\', \'COLUMN\', \'USE\', \'PARTITIONS\', \'FUNCTIONS\', \'DROP\', \'UNION\', \'EXCEPT\', \'MINUS\', \'INTERSECT\', \'TO\', \'TABLESAMPLE\', \'STRATIFY\', \'ALTER\', \'RENAME\', \'ARRAY\', \'MAP\', \'STRUCT\', \'COMMENT\', \'SET\', \'RESET\', \'DATA\', \'START\', \'TRANSACTION\', \'COMMIT\', \'ROLLBACK\', \'MACRO\', \'IGNORE\', \'BOTH\', \'LEADING\', \'TRAILING\', \'IF\', \'POSITION\', \'EXTRACT\', \'DIV\', \'PERCENT\', \'BUCKET\', \'OUT\', \'OF\', \'SORT\', \'CLUSTER\', \'DISTRIBUTE\', \'OVERWRITE\', \'TRANSFORM\', \'REDUCE\', \'SERDE\', \'SERDEPROPERTIES\', \'RECORDREADER\', \'RECORDWRITER\', \'DELIMITED\', \'FIELDS\', \'TERMINATED\', \'COLLECTION\', \'ITEMS\', \'KEYS\', \'ESCAPED\', \'LINES\', \'SEPARATED\', \'FUNCTION\', \'EXTENDED\', \'REFRESH\', \'CLEAR\', \'CACHE\', \'UNCACHE\', \'LAZY\', \'FORMATTED\', \'GLOBAL\', TEMPORARY, \'OPTIONS\', \'UNSET\', \'TBLPROPERTIES\', \'DBPROPERTIES\', \'BUCKETS\', \'SKEWED\', \'STORED\', \'DIRECTORIES\', \'LOCATION\', \'EXCHANGE\', \'ARCHIVE\', \'UNARCHIVE\', \'FILEFORMAT\', \'TOUCH\', \'COMPACT\', \'CONCATENATE\', \'CHANGE\', \'CASCADE\', \'RESTRICT\', \'CLUSTERED\', \'SORTED\', \'PURGE\', \'INPUTFORMAT\', \'OUTPUTFORMAT\', DATABASE, DATABASES, \'DFS\', \'TRUNCATE\', \'ANALYZE\', \'COMPUTE\', \'LIST\', \'STATISTICS\', \'PARTITIONED\', \'EXTERNAL\', \'DEFINED\', \'REVOKE\', \'GRANT\', \'LOCK\', \'UNLOCK\', \'MSCK\', \'REPAIR\', \'RECOVER\', \'EXPORT\', \'IMPORT\', \'LOAD\', \'ROLE\', \'ROLES\', \'COMPACTIONS\', \'PRINCIPALS\', \'TRANSACTIONS\', \'INDEX\', \'INDEXES\', \'LOCKS\', \'OPTION\', \'ANTI\', \'LOCAL\', \'INPATH\', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0)\n\n== SQL ==\n{"fields":[{"metadata":{},"name":"_id","nullable":true,"type":{"metadata":{},"name":"$oid","nullable":true,"type":"string"}},{"metadata":{},"name":"Id","nullable":true,"type":"double"},{"metadata":{},"name":"Timestamp","nullable":true,"type":{"metadata":{},"name":"$date","nullable":true,"type":"long"}},{"metadata":{},"name":"TTNR","nullable":true,"type":"string"},{"metadata":{},"name":"SNR","nullable":true,"type":"double"},{"metadata":{},"name":"State","nullable":true,"type":"long"},{"metadata":{},"name":"I_A1","nullable":true,"type":"string"},{"metadata":{},"name":"I_B1","nullable":true,"type":"string"},{"metadata":{},"name":"I1","nullable":true,"type":"double"},{"metadata":{},"name":"Mabs","nullable":true,"type":"double"},{"metadata":{},"name":"p_HD1","nullable":true,"type":"double"},{"metadata":{},"name":"pG","nullable":true,"type":"double"},{"metadata":{},"name":"pT","nullable":true,"type":"double"},{"metadata":{},"name":"pXZ","nullable":true,"type":"double"},{"metadata":{},"name":"T3","nullable":true,"type":"double"},{"metadata":{},"name":"nan","nullable":true,"type":"double"},{"metadata":{},"name":"Q1","nullable":true,"type":"long"},{"metadata":{},"name":"a_01X","nullable":true,"type":{"containsNull":true,"elementType":"double","type":"array"}}],"type":"struct"}\n^^^\n'

我想得到一些帮助。

apache-spark pyspark apache-kafka apache-spark-sql spark-streaming
1个回答
1
投票

Note 如果你有复杂的嵌套的json,可以尝试使用这个方法。DataType.fromJson 方法将json schema转换为StructType schema &将json schema保留在代码之外。任何模式的变化只需更新json schema & 重新启动你的应用程序,它将自动采用新的模式。

我已经把json数据转换为schema字符串,请检查以下代码。

scala> val jsonSchema = """{"type":"struct","fields":[{"name":"I1","type":"double","nullable":true,"metadata":{}},{"name":"I_A1","type":"string","nullable":true,"metadata":{}},{"name":"I_B1","type":"string","nullable":true,"metadata":{}},{"name":"Id","type":"long","nullable":true,"metadata":{}},{"name":"Mabs","type":"double","nullable":true,"metadata":{}},{"name":"Q1","type":"long","nullable":true,"metadata":{}},{"name":"SNR","type":"long","nullable":true,"metadata":{}},{"name":"State","type":"long","nullable":true,"metadata":{}},{"name":"T3","type":"double","nullable":true,"metadata":{}},{"name":"TTNR","type":"string","nullable":true,"metadata":{}},{"name":"Timestamp","type":{"type":"struct","fields":[{"name":"$date","type":"long","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"_id","type":{"type":"struct","fields":[{"name":"$oid","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"a_01X","type":{"type":"array","elementType":"double","containsNull":true},"nullable":true,"metadata":{}},{"name":"nan","type":"double","nullable":true,"metadata":{}},{"name":"pG","type":"double","nullable":true,"metadata":{}},{"name":"pT","type":"double","nullable":true,"metadata":{}},{"name":"pXZ","type":"double","nullable":true,"metadata":{}},{"name":"p_HD1","type":"double","nullable":true,"metadata":{}}]}"""
jsonSchema: String = {"type":"struct","fields":[{"name":"I1","type":"double","nullable":true,"metadata":{}},{"name":"I_A1","type":"string","nullable":true,"metadata":{}},{"name":"I_B1","type":"string","nullable":true,"metadata":{}},{"name":"Id","type":"long","nullable":true,"metadata":{}},{"name":"Mabs","type":"double","nullable":true,"metadata":{}},{"name":"Q1","type":"long","nullable":true,"metadata":{}},{"name":"SNR","type":"long","nullable":true,"metadata":{}},{"name":"State","type":"long","nullable":true,"metadata":{}},{"name":"T3","type":"double","nullable":true,"metadata":{}},{"name":"TTNR","type":"string","nullable":true,"metadata":{}},{"name":"Timestamp","type":{"type":"struct","fields":[{"name":"$date","type":"long","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"_id","type":{"type":"struct","fields":[{"name":"$oid","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"a_01X","type":{"type":"array","elementType":"double","containsNull":true},"nullable":true,"metadata":{}},{"name":"nan","type":"double","nullable":true,"metadata":{}},{"name":"pG","type":"double","nullable":true,"metadata":{}},{"name":"pT","type":"double","nullable":true,"metadata":{}},{"name":"pXZ","type":"double","nullable":true,"metadata":{}},{"name":"p_HD1","type":"double","nullable":true,"metadata":{}}]}

scala> val schema = DataType.fromJson(jsonSchema).asInstanceOf[StructType]
schema: org.apache.spark.sql.types.StructType = StructType(StructField(I1,DoubleType,true), StructField(I_A1,StringType,true), StructField(I_B1,StringType,true), StructField(Id,LongType,true), StructField(Mabs,DoubleType,true), StructField(Q1,LongType,true), StructField(SNR,LongType,true), StructField(State,LongType,true), StructField(T3,DoubleType,true), StructField(TTNR,StringType,true), StructField(Timestamp,StructType(StructField($date,LongType,true)),true), StructField(_id,StructType(StructField($oid,StringType,true)),true), StructField(a_01X,ArrayType(DoubleType,true),true), StructField(nan,DoubleType,true), StructField(pG,DoubleType,true), StructField(pT,DoubleType,true), StructField(pXZ,DoubleType,true), StructField(p_HD1,DoubleType,true))




1
投票

我不知道你的全部代码,但是看到你放在这里的代码,我觉得你需要先把你的kafka输入转换成字符串,因为它最初是以HexaDecimal格式输入的,然后你在这个字符串上使用你的schema。


0
投票

我想明白了。

诀窍是把我的Kafka序列器从AVRO改为字符串格式。虽然AVRO保留了模式,但它也引入了一些前缀字符,如newline(见下图),在我的情况下很难去除并解析为json。

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
{"_id":{"$oid":"5e58f86d5afd84039c135405"},"Id":1,"Timestamp":{"$date":1582889068580},"TTNR":"R902170286","SNR":92177446,"State":0,"I_A1":"FALSE","I_B1":"FALSE","I1":0.0036622,"Mabs":-20.5236976,"p_HD1":30.985062,"pG":27.7779473,"pT":1.727958,"pXZ":3.4487671,"T3":25.2296518,"nan":215.3000031,"Q1":0,"a_01X":[62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925]}
|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

将我的输入作为一个字符串,引入了更多的字段,这更容易被删除。我不得不定义一个更大的模式,但解析是成功的。

© www.soinside.com 2019 - 2024. All rights reserved.