在spark sql中分割包含嵌套括号的字符串

问题描述 投票:0回答:1

我有一个场景,我有一个包含嵌套括号的字符串列,我需要以递归方式从该字符串中提取字符串。

例如

A1 AND (A4 OR (A2 AND A3)) AND (B1 OR B2) should be splitted into 
(expected output)
 1. A2 AND A3->logic1
 2. A4 OR logic1 ->logic2
 3. B1 OR B2 ->logic3
 4. logic2 AND logic3 ->logic4
 5. A1 AND logic4 ->logic5
(A2 AND A3) OR (B1 AND B2) should be splitted into (expected output)

 1. B1 AND B2->logic1
 2. A2 AND A3->logic2
 3. logic1 OR logic2->logic3

到目前为止,我尝试使用 split 函数(基于 Spark sql 中的 '(' 进行拆分)并根据字符串中没有括号使用 dendense_rank()

with t1 as 
(select '(A2 AND A3) OR (B1 AND B2)' as logic from table limit 1
)
,t1_exploded as 
(
select logic,explode(split(logic,'\\(')) as logic_splitted from t1
)
,t1_logic_splitted as
(
  select logic,trim(logic_splitted) as logic_splitted,
dense_Rank() over (order by 
len(logic_splitted) - len(replace(logic_splitted,')','')) desc,logic_splitted desc)
as logic_group from t1_exploded
)
select * from t1_logic_splitted

但是当有超过 2 个 AND/OR 时我会遇到问题。 如果有更好的选择,请告诉我,非常感谢您的帮助

python pyspark apache-spark-sql
1个回答
0
投票

使用lark python库非常容易做到。

$

pip install lark --upgrade

然后你需要创建一个能够解析你的表达式的语法。

以下是脚本:

from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import *


spark = SparkSession.builder.appName("NoUnstack").getOrCreate()

schema = StructType([
    StructField("exp_id", IntegerType(), True),
    StructField("boolean_expression", StringType(), True),

])

data = [
    (1, "A1 AND (A4 OR (A2 AND A3)) AND (B1 OR B2)"),
    (2, "(A2 AND A3) OR (B1 AND B2)"),

]

df = spark.createDataFrame(data, schema)
df.show(truncate=False)

grammar = """
    ?start: expression
    ?expression: atom
        | expression "AND" expression -> and_op
        | expression "OR" expression  -> or_op
        | "(" expression ")" -> bracket_exp
    ?atom: /[A-Z][0-9]/ -> variable

    %import common.WS
    %ignore WS
"""


def evaluate_exp(expression):
    from lark import Lark, Transformer, v_args

    class MyTransformer(Transformer):
        def __init__(self):
            super().__init__()
            self.logic_counter = 0
            self.transformations = []

        def variable(self, items):
            return str(items[0])

        @v_args(inline=True)
        def and_op(self, left, right):
            self.logic_counter += 1
            result = f"{left} AND {right}"
            logic_label = f"logic{self.logic_counter}"
            self.transformations.append((result, logic_label))
            return logic_label

        @v_args(inline=True)
        def or_op(self, left, right):
            self.logic_counter += 1
            result = f"{left} OR {right}"
            logic_label = f"logic{self.logic_counter}"
            self.transformations.append((result, logic_label))
            return logic_label

        def bracket_exp(self, items):
            return items[0]

        def get_transformations(self):
            string_repr = []
            for original, label in self.transformations:
                string_repr.append(f"{original} -> {label}")
            return string_repr

    parser = Lark(grammar, start='start', parser='lalr')
    parsed = parser.parse(expression)

    transformer = MyTransformer()
    transformer.transform(parsed)
    value = transformer.get_transformations()
    return value


evaluate_exp_udf = udf(evaluate_exp,  ArrayType(StringType()))

df = df.withColumn("ast_tree", evaluate_exp_udf(col("boolean_expression")))
df = df.withColumn("exploded_col", explode(col("ast_tree")))
df.show(n=40, truncate=False)
df.select("exp_id", "exploded_col").show(n=40, truncate=False)

输出:

+------+---------------------------+
|exp_id|exploded_col               |
+------+---------------------------+
|1     |A2 AND A3 -> logic1        |
|1     |A4 OR logic1 -> logic2     |
|1     |B1 OR B2 -> logic3         |
|1     |logic2 AND logic3 -> logic4|
|1     |A1 AND logic4 -> logic5    |
|2     |A2 AND A3 -> logic1        |
|2     |B1 AND B2 -> logic2        |
|2     |logic1 OR logic2 -> logic3 |
+------+---------------------------+

完整输出:

+------+-----------------------------------------+
|exp_id|boolean_expression                       |
+------+-----------------------------------------+
|1     |A1 AND (A4 OR (A2 AND A3)) AND (B1 OR B2)|
|2     |(A2 AND A3) OR (B1 AND B2)               |
+------+-----------------------------------------+

+------+-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------+---------------------------+
|exp_id|boolean_expression                       |ast_tree                                                                                                               |exploded_col               |
+------+-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------+---------------------------+
|1     |A1 AND (A4 OR (A2 AND A3)) AND (B1 OR B2)|[A2 AND A3 -> logic1, A4 OR logic1 -> logic2, B1 OR B2 -> logic3, logic2 AND logic3 -> logic4, A1 AND logic4 -> logic5]|A2 AND A3 -> logic1        |
|1     |A1 AND (A4 OR (A2 AND A3)) AND (B1 OR B2)|[A2 AND A3 -> logic1, A4 OR logic1 -> logic2, B1 OR B2 -> logic3, logic2 AND logic3 -> logic4, A1 AND logic4 -> logic5]|A4 OR logic1 -> logic2     |
|1     |A1 AND (A4 OR (A2 AND A3)) AND (B1 OR B2)|[A2 AND A3 -> logic1, A4 OR logic1 -> logic2, B1 OR B2 -> logic3, logic2 AND logic3 -> logic4, A1 AND logic4 -> logic5]|B1 OR B2 -> logic3         |
|1     |A1 AND (A4 OR (A2 AND A3)) AND (B1 OR B2)|[A2 AND A3 -> logic1, A4 OR logic1 -> logic2, B1 OR B2 -> logic3, logic2 AND logic3 -> logic4, A1 AND logic4 -> logic5]|logic2 AND logic3 -> logic4|
|1     |A1 AND (A4 OR (A2 AND A3)) AND (B1 OR B2)|[A2 AND A3 -> logic1, A4 OR logic1 -> logic2, B1 OR B2 -> logic3, logic2 AND logic3 -> logic4, A1 AND logic4 -> logic5]|A1 AND logic4 -> logic5    |
|2     |(A2 AND A3) OR (B1 AND B2)               |[A2 AND A3 -> logic1, B1 AND B2 -> logic2, logic1 OR logic2 -> logic3]                                                 |A2 AND A3 -> logic1        |
|2     |(A2 AND A3) OR (B1 AND B2)               |[A2 AND A3 -> logic1, B1 AND B2 -> logic2, logic1 OR logic2 -> logic3]                                                 |B1 AND B2 -> logic2        |
|2     |(A2 AND A3) OR (B1 AND B2)               |[A2 AND A3 -> logic1, B1 AND B2 -> logic2, logic1 OR logic2 -> logic3]                                                 |logic1 OR logic2 -> logic3 |
+------+-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------+---------------------------+

+------+---------------------------+
|exp_id|exploded_col               |
+------+---------------------------+
|1     |A2 AND A3 -> logic1        |
|1     |A4 OR logic1 -> logic2     |
|1     |B1 OR B2 -> logic3         |
|1     |logic2 AND logic3 -> logic4|
|1     |A1 AND logic4 -> logic5    |
|2     |A2 AND A3 -> logic1        |
|2     |B1 AND B2 -> logic2        |
|2     |logic1 OR logic2 -> logic3 |
+------+---------------------------+
© www.soinside.com 2019 - 2024. All rights reserved.