将字符串列表转换为数组类型

Question

我有一个带有字符串数据类型列的数据框，但实际表示是数组类型。

import pyspark
from pyspark.sql import Row
item = spark.createDataFrame([Row(item='fish',geography=['london','a','b','hyd']),
                              Row(item='chicken',geography=['a','hyd','c']),
                              Row(item='rice',geography=['a','b','c','blr']),
                              Row(item='soup',geography=['a','kol','simla']),
                              Row(item='pav',geography=['a','del']),
                              Row(item='kachori',geography=['a','guj']),
                              Row(item='fries',geography=['a','chen']),
                              Row(item='noodles',geography=['a','mum'])])
item.show()
# +-------+-------------------+
# |   item|          geography|
# +-------+-------------------+
# |   fish|[london, a, b, hyd]|
# |chicken|        [a, hyd, c]|
# |   rice|     [a, b, c, blr]|
# |   soup|    [a, kol, simla]|
# |    pav|           [a, del]|
# |kachori|           [a, guj]|
# |  fries|          [a, chen]|
# |noodles|           [a, mum]|
# +-------+-------------------+

print(item.printSchema())
#  root
#  |-- item: string (nullable = true)
#  |-- geography: string (nullable = true)

如何将上述数据集中的地理列转换为数组类型？

Answer 1

F.expr(r"regexp_extract_all(geography, '(\\w+)', 1)")

regexp_extract_all

可从 Spark 3.1+

获取

regexp_extract_all(str, regexp[, idx])
- 提取
str
中与
regexp
表达式匹配并对应于正则表达式组索引的所有字符串。

from pyspark.sql import Row, functions as F

item = spark.createDataFrame([Row(item='fish',geography="['london','a','b','hyd']"),
                              Row(item='noodles',geography="['a','mum']")])
item.printSchema()
# root
#  |-- item: string (nullable = true)
#  |-- geography: string (nullable = true)


item = item.withColumn('geography', F.expr(r"regexp_extract_all(geography, '(\\w+)', 1)"))

item.printSchema()
# root
#  |-- item: string (nullable = true)
#  |-- geography: array (nullable = true)
#  |    |-- element: string (containsNull = true)

Answer 2

使用分割

选项1

  new=    (item.withColumn('geography',split(regexp_replace('geography','[^\w\,]',''),'\,'))).printSchema()

选项2

new1 =(item.withColumn('geography',col('geography').cast('string'))
    .withColumn('geography',split('geography','\,'))).printSchema()

Answer 3

您可以定义列的架构，然后将其转换为数组类型，如下所示：

schema = ArrayType(StringType())
item = item.withColumn('geography', from_json(col('geography'), schema))

如果您想创建另一个每个单元格具有单个值的数据框，那么您可以执行以下操作：

item_geography = item.select("id", "item", explode("geography").alias("geography")).withColumnRenamed('id', 'item_primary_id')

将字符串列表转换为数组类型

问题描述投票：0回答：3

3个回答

最新问题

将字符串列表转换为数组类型

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3