如何将pyspark中列的每一行的输入字符串转换成字典

问题描述 投票:0回答:1

我有一个数据帧的列值,我在其中接收如下所示的字符串输入,其中startIndex是每个字符的开头索引,end index是该字符在字符串中出现的结尾,flag是字符本身。

    +---+------------------+
    | id|    Values        |
    +---+------------------+
    |01 |  AABBBAA         |
    |02 |  SSSAAAA         |
    +---+------------------+

现在,我想将每一行的字符串转换成字典,如下图所示:

    +---+--------------------+
    | id|    Values          |
    +---+--------------------+
    |01 |  [{"startIndex":0, |
    |   |    "endIndex" : 1, | 
    |   |    "flag" : A },   |
    |   |   {"startIndex":2, |
    |   |    "endIndex" : 4, |
    |   |    "flag" : B },   |
    |   |   {"startIndex":5, |
    |   |    "endIndex" : 6, |
    |   |    "flag" : A }]   |
    |02 |  [{"startIndex":0, |
    |   |    "endIndex" : 2, |
    |   |    "flag" : S },   |
    |   |   {"startIndex":3, |
    |   |    "endIndex" : 6, |
    |   |    "flag" : A }]   |
    +---+--------------------+-

我有伪代码来构架字典,但不确定如何应用它一次使用所有行,而无需使用循环。还有这样的问题方法是只有最后一个框架字典在所有行中都被覆盖


        import re
        x = "aaabbbbccaa"
        xs = re.findall(r"((.)\2*)", x)
        print(xs)
        start = 0
        output = '' 
        for item in xs:
            end = start + (len(item[0])-1)
            startIndex = start
            endIndex = end
            qualityFlag = item[1]
            print(startIndex, endIndex, qualityFlag)
            start = end+

python-3.x pyspark pyspark-sql pyspark-dataframes
1个回答
1
投票
使用

udf()包装代码逻辑,使用to_json()将结构数组转换为字符串:

from pyspark.sql.functions import udf, to_json import re df = spark.createDataFrame([ ('01', 'AABBBAA') , ('02', 'SSSAAAA') ] , ['id', 'Values'] ) # argument `x` is a StringType() over the udf function # return `row` as a list of dicts @udf('array<struct<startIndex:long,endIndex:long,flag:string>>') def set_fields(x): row = [] for m in re.finditer(r'(.)\1*', x): row.append({ 'startIndex': m.start() , 'endIndex': m.end()-1 , 'flag': m.group(1) }) return row df.select('id', to_json(set_fields('Values')).alias('Values')).show(truncate=False) +---+----------------------------------------------------------------------------------------------------------------------------+ |id |Values | +---+----------------------------------------------------------------------------------------------------------------------------+ |01 |[{"startIndex":0,"endIndex":1,"flag":"A"},{"startIndex":2,"endIndex":4,"flag":"B"},{"startIndex":5,"endIndex":6,"flag":"A"}]| |02 |[{"startIndex":0,"endIndex":2,"flag":"S"},{"startIndex":3,"endIndex":6,"flag":"A"}] | +---+----------------------------------------------------------------------------------------------------------------------------+
© www.soinside.com 2019 - 2024. All rights reserved.