如何将数组的字符串表示形式转换为 pyspark 中的实际数组类型

问题描述 投票:0回答:1

我有一个列,其中的数据以数组的字符串表示形式传入

我尝试将其类型转换为数组类型,但数据正在被修改。

我也尝试使用正则表达式来删除多余的括号,但它不起作用。

附上以下代码 这段代码是将数组的字符串表示形式转换为实际数组

df = df.withColumn("columns", split(df["columns"], ", "))

这是我尝试过的正则表达式代码

df = df.withColumn(
'columns',
expr("transform(split(columns, ','), x -> trim('\"[]', x))")

非常感谢任何帮助

azure apache-spark pyspark casting
1个回答
0
投票

以下是一种可能的方法。

import sys
from pyspark import SparkContext
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark import SQLContext

sc = SparkContext('local')
sqlContext = SQLContext(sc)

data = [
              (1, '["First", "Second", "Third"]'),
              (2, '["First"]'),
              (3, '["Second", "Third"]'),
              (4, '["First", "Fourth"]')
            ]

df1 = sqlContext.createDataFrame(data, ['id', 'val'])

df1.show(n=100, truncate=False)
print("Collect columns into list")

my_schema = ArrayType(StringType())


intermediate_df = df1.withColumn("array_string", F.from_json("val", schema=my_schema))

print("intermediate_df dataframe")
intermediate_df.show(n=20, truncate=False)

输出:

+---+----------------------------+
|id |val                         |
+---+----------------------------+
|1  |["First", "Second", "Third"]|
|2  |["First"]                   |
|3  |["Second", "Third"]         |
|4  |["First", "Fourth"]         |
+---+----------------------------+

intermediate_df dataframe
+---+----------------------------+----------------------+
|id |val                         |array_string          |
+---+----------------------------+----------------------+
|1  |["First", "Second", "Third"]|[First, Second, Third]|
|2  |["First"]                   |[First]               |
|3  |["Second", "Third"]         |[Second, Third]       |
|4  |["First", "Fourth"]         |[First, Fourth]       |
+---+----------------------------+----------------------+
© www.soinside.com 2019 - 2024. All rights reserved.