在pyspark中的regexp_replace函数中使用字典

Question

我想使用字典在pyspark数据帧列上执行regexp_replace操作。

字典：{'RD':'ROAD','DR':'DRIVE','AVE':'AVENUE',....}字典将有大约270个键值对。

输入数据帧：

ID  | Address    
1   | 22, COLLINS RD     
2   | 11, HEMINGWAY DR    
3   | AVIATOR BUILDING    
4   | 33, PARK AVE MULLOHAND DR

期望的输出数据帧：

ID   | Address  | Address_Clean    
1    | 22, COLLINS RD    | 22, COLLINS ROAD    
2    | 11, HEMINGWAY DR     | 11, HEMINGWAY DRIVE    
3    | AVIATOR BUILDING      | AVIATOR BUILDING    
4    | 33, PARK AVE MULLOHAND DR    | 33, PARK AVENUE MULLOHAND DRIVE

我在互联网上找不到任何文件。如果试图传递字典如下代码 -

data=data.withColumn('Address_Clean',regexp_replace('Address',dict))

引发错误“regexp_replace需要3个参数，2个给定”。

数据集大小约为2000万。因此，UDF解决方案将很慢（由于行方式操作）并且我们无法访问支持pandas_udf的spark 2.3.0。除了可能使用循环之外，还有其他有效的方法吗？

Answer 1

它抛出了这个错误，因为regexp_replace（）需要三个参数：

regexp_replace('column_to_change','pattern_to_be_changed','new_pattern')

但你是对的，你不需要UDF或循环。您只需要一些regexp和一个看起来与原始目录完全相同的目录表:)

这是我的解决方案：

# You need to get rid of all the things you want to replace. 
# You can use the OR (|) operator for that. 
# You could probably automate that and pass it a string that looks like that instead but I will leave that for you to decide.

input_df = input_df.withColumn('start_address', sf.regexp_replace("original_address","RD|DR|etc...",""))


# You will still need the old ends in a separate column
# This way you have something to join on your directory table.

input_df = input_df.withColumn('end_of_address',sf.regexp_extract('original_address',"(.*) (.*)", 2))


# Now we join the directory table that has two columns - ends you want to replace and ends you want to have instead.

input_df = directory_df.join(input_df,'end_of_address')


# And now you just need to concatenate the address with the correct ending.

input_df = input_df.withColumn('address_clean',sf.concat('start_address','correct_end'))

在pyspark中的regexp_replace函数中使用字典

问题描述投票：0回答：1

1个回答

最新问题

在pyspark中的regexp_replace函数中使用字典

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1