我有一个场景,我有一个包含嵌套括号的字符串列,我需要以递归方式从该字符串中提取字符串。
例如
A1 AND (A4 OR (A2 AND A3)) AND (B1 OR B2) should be splitted into
(expected output)
1. A2 AND A3->logic1
2. A4 OR logic1 ->logic2
3. B1 OR B2 ->logic3
4. logic2 AND logic3 ->logic4
5. A1 AND logic4 ->logic5
(A2 AND A3) OR (B1 AND B2) should be splitted into (expected output)
1. B1 AND B2->logic1
2. A2 AND A3->logic2
3. logic1 OR logic2->logic3
到目前为止,我尝试使用 split 函数(基于 Spark sql 中的 '(' 进行拆分)并根据字符串中没有括号使用 dendense_rank()
with t1 as
(select '(A2 AND A3) OR (B1 AND B2)' as logic from table limit 1
)
,t1_exploded as
(
select logic,explode(split(logic,'\\(')) as logic_splitted from t1
)
,t1_logic_splitted as
(
select logic,trim(logic_splitted) as logic_splitted,
dense_Rank() over (order by
len(logic_splitted) - len(replace(logic_splitted,')','')) desc,logic_splitted desc)
as logic_group from t1_exploded
)
select * from t1_logic_splitted
但是当有超过 2 个 AND/OR 时我会遇到问题。 如果有更好的选择,请告诉我,非常感谢您的帮助
使用lark python库非常容易做到。
$
pip install lark --upgrade
然后你需要创建一个能够解析你的表达式的语法。
以下是脚本:
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("NoUnstack").getOrCreate()
schema = StructType([
StructField("exp_id", IntegerType(), True),
StructField("boolean_expression", StringType(), True),
])
data = [
(1, "A1 AND (A4 OR (A2 AND A3)) AND (B1 OR B2)"),
(2, "(A2 AND A3) OR (B1 AND B2)"),
]
df = spark.createDataFrame(data, schema)
df.show(truncate=False)
grammar = """
?start: expression
?expression: atom
| expression "AND" expression -> and_op
| expression "OR" expression -> or_op
| "(" expression ")" -> bracket_exp
?atom: /[A-Z][0-9]/ -> variable
%import common.WS
%ignore WS
"""
def evaluate_exp(expression):
from lark import Lark, Transformer, v_args
class MyTransformer(Transformer):
def __init__(self):
super().__init__()
self.logic_counter = 0
self.transformations = []
def variable(self, items):
return str(items[0])
@v_args(inline=True)
def and_op(self, left, right):
self.logic_counter += 1
result = f"{left} AND {right}"
logic_label = f"logic{self.logic_counter}"
self.transformations.append((result, logic_label))
return logic_label
@v_args(inline=True)
def or_op(self, left, right):
self.logic_counter += 1
result = f"{left} OR {right}"
logic_label = f"logic{self.logic_counter}"
self.transformations.append((result, logic_label))
return logic_label
def bracket_exp(self, items):
return items[0]
def get_transformations(self):
string_repr = []
for original, label in self.transformations:
string_repr.append(f"{original} -> {label}")
return string_repr
parser = Lark(grammar, start='start', parser='lalr')
parsed = parser.parse(expression)
transformer = MyTransformer()
transformer.transform(parsed)
value = transformer.get_transformations()
return value
evaluate_exp_udf = udf(evaluate_exp, ArrayType(StringType()))
df = df.withColumn("ast_tree", evaluate_exp_udf(col("boolean_expression")))
df = df.withColumn("exploded_col", explode(col("ast_tree")))
df.show(n=40, truncate=False)
df.select("exp_id", "exploded_col").show(n=40, truncate=False)
输出:
+------+---------------------------+
|exp_id|exploded_col |
+------+---------------------------+
|1 |A2 AND A3 -> logic1 |
|1 |A4 OR logic1 -> logic2 |
|1 |B1 OR B2 -> logic3 |
|1 |logic2 AND logic3 -> logic4|
|1 |A1 AND logic4 -> logic5 |
|2 |A2 AND A3 -> logic1 |
|2 |B1 AND B2 -> logic2 |
|2 |logic1 OR logic2 -> logic3 |
+------+---------------------------+
完整输出:
+------+-----------------------------------------+
|exp_id|boolean_expression |
+------+-----------------------------------------+
|1 |A1 AND (A4 OR (A2 AND A3)) AND (B1 OR B2)|
|2 |(A2 AND A3) OR (B1 AND B2) |
+------+-----------------------------------------+
+------+-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------+---------------------------+
|exp_id|boolean_expression |ast_tree |exploded_col |
+------+-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------+---------------------------+
|1 |A1 AND (A4 OR (A2 AND A3)) AND (B1 OR B2)|[A2 AND A3 -> logic1, A4 OR logic1 -> logic2, B1 OR B2 -> logic3, logic2 AND logic3 -> logic4, A1 AND logic4 -> logic5]|A2 AND A3 -> logic1 |
|1 |A1 AND (A4 OR (A2 AND A3)) AND (B1 OR B2)|[A2 AND A3 -> logic1, A4 OR logic1 -> logic2, B1 OR B2 -> logic3, logic2 AND logic3 -> logic4, A1 AND logic4 -> logic5]|A4 OR logic1 -> logic2 |
|1 |A1 AND (A4 OR (A2 AND A3)) AND (B1 OR B2)|[A2 AND A3 -> logic1, A4 OR logic1 -> logic2, B1 OR B2 -> logic3, logic2 AND logic3 -> logic4, A1 AND logic4 -> logic5]|B1 OR B2 -> logic3 |
|1 |A1 AND (A4 OR (A2 AND A3)) AND (B1 OR B2)|[A2 AND A3 -> logic1, A4 OR logic1 -> logic2, B1 OR B2 -> logic3, logic2 AND logic3 -> logic4, A1 AND logic4 -> logic5]|logic2 AND logic3 -> logic4|
|1 |A1 AND (A4 OR (A2 AND A3)) AND (B1 OR B2)|[A2 AND A3 -> logic1, A4 OR logic1 -> logic2, B1 OR B2 -> logic3, logic2 AND logic3 -> logic4, A1 AND logic4 -> logic5]|A1 AND logic4 -> logic5 |
|2 |(A2 AND A3) OR (B1 AND B2) |[A2 AND A3 -> logic1, B1 AND B2 -> logic2, logic1 OR logic2 -> logic3] |A2 AND A3 -> logic1 |
|2 |(A2 AND A3) OR (B1 AND B2) |[A2 AND A3 -> logic1, B1 AND B2 -> logic2, logic1 OR logic2 -> logic3] |B1 AND B2 -> logic2 |
|2 |(A2 AND A3) OR (B1 AND B2) |[A2 AND A3 -> logic1, B1 AND B2 -> logic2, logic1 OR logic2 -> logic3] |logic1 OR logic2 -> logic3 |
+------+-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------+---------------------------+
+------+---------------------------+
|exp_id|exploded_col |
+------+---------------------------+
|1 |A2 AND A3 -> logic1 |
|1 |A4 OR logic1 -> logic2 |
|1 |B1 OR B2 -> logic3 |
|1 |logic2 AND logic3 -> logic4|
|1 |A1 AND logic4 -> logic5 |
|2 |A2 AND A3 -> logic1 |
|2 |B1 AND B2 -> logic2 |
|2 |logic1 OR logic2 -> logic3 |
+------+---------------------------+