如何在 PySpark 中从数组中提取元素

Question

我有一个以下类型的数据框：

col1|col2|col3|col4
xxxx|yyyy|zzzz|[1111],[2222]

我希望我的输出具有以下类型：

col1|col2|col3|col4|col5
xxxx|yyyy|zzzz|1111|2222

我的

col4

是一个数组，我想将它转换成一个单独的列。需要做什么？

我看到很多带有

flatMap

的答案，但是它们正在增加一行。我希望将元组放在另一列中但在同一行中。

以下是我当前的架构：

root
 |-- PRIVATE_IP: string (nullable = true)
 |-- PRIVATE_PORT: integer (nullable = true)
 |-- DESTINATION_IP: string (nullable = true)
 |-- DESTINATION_PORT: integer (nullable = true)
 |-- collect_set(TIMESTAMP): array (nullable = true)
 |    |-- element: string (containsNull = true)

另外，请有人帮我解释一下数据帧和 RDD。

Answer 1

创建样本数据：

from pyspark.sql import Row
x = [Row(col1="xx", col2="yy", col3="zz", col4=[123,234])]
rdd = sc.parallelize([Row(col1="xx", col2="yy", col3="zz", col4=[123,234])])
df = spark.createDataFrame(rdd)
df.show()
#+----+----+----+----------+
#|col1|col2|col3|      col4|
#+----+----+----+----------+
#|  xx|  yy|  zz|[123, 234]|
#+----+----+----+----------+

使用

getItem

从数组列中提取元素，在实际情况下将

col4

替换为

collect_set(TIMESTAMP)

：

df = df.withColumn("col5", df["col4"].getItem(1)).withColumn("col4", df["col4"].getItem(0))
df.show()
#+----+----+----+----+----+
#|col1|col2|col3|col4|col5|
#+----+----+----+----+----+
#|  xx|  yy|  zz| 123| 234|
#+----+----+----+----+----+

Answer 2

您有 4 个选项来提取数组内的值：

df = spark.createDataFrame([[1, [10, 20, 30, 40]]], ['A', 'B'])
df.show()

+---+----------------+
|  A|               B|
+---+----------------+
|  1|[10, 20, 30, 40]|
+---+----------------+

from pyspark.sql import functions as F

df.select(
    "A",
    df.B[0].alias("B0"), # dot notation and index        
    F.col("B")[1].alias("B1"), # function col and index
    df.B.getItem(2).alias("B2"), # dot notation and method getItem
    F.col("B").getItem(3).alias("B3"), # function col and method getItem
).show()

+---+---+---+---+---+
|  A| B0| B1| B2| B3|
+---+---+---+---+---+
|  1| 10| 20| 30| 40|
+---+---+---+---+---+

如果您有很多列，请使用列表理解：

df.select(
    'A', *[F.col('B')[i].alias(f'B{i}') for i in range(4)]
).show()

+---+---+---+---+---+
|  A| B0| B1| B2| B3|
+---+---+---+---+---+
|  1| 10| 20| 30| 40|
+---+---+---+---+---+

如何在 PySpark 中从数组中提取元素

问题描述投票：0回答：2

2个回答

最新问题

如何在 PySpark 中从数组中提取元素

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2