Pyspark将array 转换为字符串和分组依据

问题描述 投票:0回答:1

我正在尝试根据“代码”列对所有“文档标题”进行分组。 “代码”是一个字符串数组。我遇到了类似的question,但解决方案似乎无效。这是数据模式

enter image description here

sqlContext.sql("""SELECT code,documentTitle FROM psyc2 """).show(10)

查询结果

+------------+--------------------+
|        code|       documentTitle|
+------------+--------------------+
|      [3297]|Discussions on ph...|
|      [3410]|Reflecting on lea...|
|      [3410]|Changing educatio...|
|[2227, 3410]|Assessment of med...|
|      [3410]|Training the trai...|
|[2224, 3371]|Improving the qua...|
|      [3410]|The effectiveness...|
|      [3410]|The impact of mul...|
|[3410, 4100]|Computer-aided le...|
|      [3410]|Setting and maint...|
+------------+--------------------+

如何从具有数组类型的代码列中选择字符串。

sql python-3.x pyspark apache-spark-sql pyspark-sql
1个回答
0
投票

创建数据框:

from pyspark.sql.types import *
from pyspark.sql import functions as F

list=[[['3927'],'Hey'],
      [['3410'],'Yo'],
      [['3927'],'Why'],
      [['2227','3410'],'Am'],
      [['3410','3927'],'Here']]
cSchema = StructType([StructField("Code", ArrayType(StringType())),StructField("Document_title", StringType())])
df= spark.createDataFrame(list,schema=cSchema)
df.show()
+------------+--------------+
|        Code|Document_title|
+------------+--------------+
|      [3927]|           Hey|
|      [3410]|            Yo|
|      [3927]|           Why|
|[2227, 3410]|            Am|
|[3410, 3927]|          Here|
+------------+--------------+

展开列表列,然后将groupby与collect_list一起使用以收集不同的文档:

df1=df.withColumn("Code",F.explode(F.col("Code")))
df1.groupBy(F.col("Code"))\
   .agg(F.collect_list("Document_title"))\
   .show()
+----+----------------------------+
|Code|collect_list(Document_title)|
+----+----------------------------+
|3927|            [Hey, Why, Here]|
|3410|              [Yo, Am, Here]|
|2227|                        [Am]|
+----+----------------------------+
© www.soinside.com 2019 - 2024. All rights reserved.