在Dataframe中使用Pyspark读取每个json对象作为单行?

问题描述 投票:0回答:1

我有以下JSON文件

{"name":"John", "age":31, "city":"New York"}
{"name":"Henry", "age":41, "city":"Boston"}
{"name":"Dave", "age":26, "city":"New York"}

因此,我需要将每一行json与Dataframe一起作为单行读取。

下面是预期的输出。

enter image description here

我试过用下面的代码

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName('Read Json') \
    .getOrCreate()

df = spark.read.format('json').load('sample_json')
df.show()

但我只能得到下面的输出。

enter image description here

请帮我解决这个问题。先谢谢你。

python python-3.x apache-spark pyspark apache-spark-sql
1个回答
1
投票

读取文件为 json 然后使用 to_json 函数来创建 json_column.

1.Using to_json function:

from pyspark.sql.functions import *    
spark.read.json("sample.json").\
withColumn("Json_column",to_json(struct(col("age"),col('city'),col('name')))).\
show(10,False)
#+---+--------+-----+------------------------------------------+
#|age|city    |name |Json_column                               |
#+---+--------+-----+------------------------------------------+
#|31 |New York|John |{"age":31,"city":"New York","name":"John"}|
#|41 |Boston  |Henry|{"age":41,"city":"Boston","name":"Henry"} |
#|26 |New York|Dave |{"age":26,"city":"New York","name":"Dave"}|
#+---+--------+-----+------------------------------------------+

#or more dynamic way
df=spark.read.json("sample.json")
df.withColumn("Json_column",to_json(struct([col(c) for c in df.columns]))).show(10,False)
#+---+--------+-----+------------------------------------------+
#|age|city    |name |Json_column                               |
#+---+--------+-----+------------------------------------------+
#|31 |New York|John |{"age":31,"city":"New York","name":"John"}|
#|41 |Boston  |Henry|{"age":41,"city":"Boston","name":"Henry"} |
#|26 |New York|Dave |{"age":26,"city":"New York","name":"Dave"}|
#+---+--------+-----+------------------------------------------+

2.Other approach using get_json_object function:

阅读 json 归档 文字 然后创造 name,age,city 列,从 json object.

from pyspark.sql.functions import *
spark.read.text("sample.json").\
withColumn("name",get_json_object(col("value"),"$.name")).\
withColumn("city",get_json_object(col("value"),"$.city")).\
withColumn("age",get_json_object(col("value"),"$.age")).\
withColumnRenamed("value","Json_column").\
select("age","city","name","Json_column").\
show(10,False)
#+---+--------+-----+--------------------------------------------+
#|age|city    |name |Json_column                                 |
#+---+--------+-----+--------------------------------------------+
#|31 |New York|John |{"name":"John", "age":31, "city":"New York"}|
#|41 |Boston  |Henry|{"name":"Henry", "age":41, "city":"Boston"} |
#|26 |New York|Dave |{"name":"Dave", "age":26, "city":"New York"}|
#+---+--------+-----+--------------------------------------------+
© www.soinside.com 2019 - 2024. All rights reserved.