我正在尝试根据“代码”列对所有“文档标题”进行分组。 “代码”是一个字符串数组。我遇到了类似的question,但解决方案似乎无效。这是数据模式
sqlContext.sql("""SELECT code,documentTitle FROM psyc2 """).show(10)
查询结果
+------------+--------------------+
| code| documentTitle|
+------------+--------------------+
| [3297]|Discussions on ph...|
| [3410]|Reflecting on lea...|
| [3410]|Changing educatio...|
|[2227, 3410]|Assessment of med...|
| [3410]|Training the trai...|
|[2224, 3371]|Improving the qua...|
| [3410]|The effectiveness...|
| [3410]|The impact of mul...|
|[3410, 4100]|Computer-aided le...|
| [3410]|Setting and maint...|
+------------+--------------------+
如何从具有数组类型的代码列中选择字符串。
from pyspark.sql.types import *
from pyspark.sql import functions as F
list=[[['3927'],'Hey'],
[['3410'],'Yo'],
[['3927'],'Why'],
[['2227','3410'],'Am'],
[['3410','3927'],'Here']]
cSchema = StructType([StructField("Code", ArrayType(StringType())),StructField("Document_title", StringType())])
df= spark.createDataFrame(list,schema=cSchema)
df.show()
+------------+--------------+
| Code|Document_title|
+------------+--------------+
| [3927]| Hey|
| [3410]| Yo|
| [3927]| Why|
|[2227, 3410]| Am|
|[3410, 3927]| Here|
+------------+--------------+
df1=df.withColumn("Code",F.explode(F.col("Code")))
df1.groupBy(F.col("Code"))\
.agg(F.collect_list("Document_title"))\
.show()
+----+----------------------------+
|Code|collect_list(Document_title)|
+----+----------------------------+
|3927| [Hey, Why, Here]|
|3410| [Yo, Am, Here]|
|2227| [Am]|
+----+----------------------------+