我的 Databricks 笔记本中有两个数据框。例如,数据如下所示:
df1
:
id name
1 200/300A/200B
2 805/805B/500
3 22A+100B
4 200/300A/200B+22A+100B;
5 100+-805/+22A+100B;
6 ;
7 NULL
df2
:
ids
805
200B
22A
我想用
df1
id 替换 df2
名称为 0,其余的为“1”。例如,无论数据显示为 805、200B 或 22A,我都需要将 df1
中的这些值替换为 0 或 1。
df
id name
1 1/1/0
2 0/1/1
3 0+1
4 1/1/0+0+1;
5 1+-0/+0+1;
6 ;
7 NULL
以下方法将
df2
中的所有内容收集到Python列表中,然后基于Spark的正则表达式查找构建,这可能很难维护。
输入:
from pyspark.sql import functions as F
df1 = spark.createDataFrame(
[(1, '200/300A/200B'),
(2, '805/805B/500'),
(3, '22A+100B'),
(4, '200/300A/200B+22A+100B;'),
(5, '100+-805/+22A+100B;'),
(6, ';'),
(7, None)],
['id', 'name'])
df2 = spark.createDataFrame([('805',), ('200B',), ('22A',)], ['ids'])
脚本:
zeros = df2.agg(F.collect_set('ids')).head()[0]
for_zeros = '|'.join(zeros)
pat_for_zeros = f'(?<![\d\w])({for_zeros})(?![\w\d])'
pat_for_other = f'(?<![\d\w])(?!0([^\w\d]|$))[\d\w]+'
replaced_zeros = F.regexp_replace('name', pat_for_zeros, '0')
replaced_other = F.regexp_replace(replaced_zeros, pat_for_other, '1')
df = df1.withColumn('name', replaced_other)
df.show()
# +---+----------+
# | id| name|
# +---+----------+
# | 1| 1/1/0|
# | 2| 0/1/1|
# | 3| 0+1|
# | 4|1/1/0+0+1;|
# | 5|1+-0/+0+1;|
# | 6| ;|
# | 7| NULL|
# +---+----------+
查看以下代码
WITH input AS (
SELECT
id,
name,
r.ids
FROM VALUES (1 ,'200/300A/200B'),(2 ,'805/805B/500'),(3 ,'22A+100B'),(4 ,';'),(5, NULL) AS (id, name)
JOIN ( SELECT COLLECT_LIST(ids) AS ids FROM VALUES ('805'),('200B'),('22A') AS (ids) ) r
)
SELECT
id,
CASE WHEN LENGTH(name) == 1 THEN name
ELSE
transform(
array_distinct(
regexp_extract_all(name, '[^a-zA-Z0-9]', 0)
),
t ->
CONCAT_WS(
t,
transform(
split(name, concat('[',t,']')),
s -> if(array_contains(ids,s), 0, 1)
)
)
)[0]
END AS name
FROM input
+---+-----+
|id |name |
+---+-----+
|1 |1/1/0|
|2 |0/1/1|
|3 |0+1 |
|4 |; |
|5 |NULL |
+---+-----+