我有一个列,其中的数据以数组的字符串表示形式传入
我尝试将其类型转换为数组类型,但数据正在被修改。
我也尝试使用正则表达式来删除多余的括号,但它不起作用。
附上以下代码 这段代码是将数组的字符串表示形式转换为实际数组
df = df.withColumn("columns", split(df["columns"], ", "))
这是我尝试过的正则表达式代码
df = df.withColumn(
'columns',
expr("transform(split(columns, ','), x -> trim('\"[]', x))")
)
非常感谢任何帮助
以下是一种可能的方法。
import sys
from pyspark import SparkContext
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark import SQLContext
sc = SparkContext('local')
sqlContext = SQLContext(sc)
data = [
(1, '["First", "Second", "Third"]'),
(2, '["First"]'),
(3, '["Second", "Third"]'),
(4, '["First", "Fourth"]')
]
df1 = sqlContext.createDataFrame(data, ['id', 'val'])
df1.show(n=100, truncate=False)
print("Collect columns into list")
my_schema = ArrayType(StringType())
intermediate_df = df1.withColumn("array_string", F.from_json("val", schema=my_schema))
print("intermediate_df dataframe")
intermediate_df.show(n=20, truncate=False)
输出:
+---+----------------------------+
|id |val |
+---+----------------------------+
|1 |["First", "Second", "Third"]|
|2 |["First"] |
|3 |["Second", "Third"] |
|4 |["First", "Fourth"] |
+---+----------------------------+
intermediate_df dataframe
+---+----------------------------+----------------------+
|id |val |array_string |
+---+----------------------------+----------------------+
|1 |["First", "Second", "Third"]|[First, Second, Third]|
|2 |["First"] |[First] |
|3 |["Second", "Third"] |[Second, Third] |
|4 |["First", "Fourth"] |[First, Fourth] |
+---+----------------------------+----------------------+