我对描述中的每个令牌都有一些描述和标签。标签指定令牌的类型。我想从描述中提取地址,即:与 、 和 对应的所有标记。我怎样才能在 pyspark 中实现这一目标。
|description |tags|
+-------------------------------------------+--------------------------------------------------------------------------+
|"aci*credit one bank, n" |<vendor_name> <vendor_name> <vendor_name> <vendor_name> |
|odot dmv2u 503-9455400 or 06/30 |<vendor_name> <vendor_name> <phone_number> <state> <trans_date> |
|# 7-eleven 41066 5050 hunter rd ooltewah tn|<other> <vendor_name> <store_id> <street> <street> <street> <city> <state>|
我正在寻找的输出是:
NULL
OR
5050 hunter rd ooltewah tn
不应包含任何不是地址标签的内容。
查看此解决方案:
import pyspark.sql.functions as f
df = spark.createDataFrame([
('"aci*credit one bank, n"', '<vendor_name> <vendor_name> <vendor_name> <vendor_name>'),
('odot dmv2u 503-9455400 or 06/30', '<vendor_name> <vendor_name> <phone_number> <state> <trans_date>'),
('# 7-eleven 41066 5050 hunter rd ooltewah tn', '<other> <vendor_name> <store_id> <street> <street> <street> <city> <state>')
], ['description', 'tags'])
address_tags = ['<state>', '<street>', '<city>']
address_tags_concatenated = '"' + '","'.join(address_tags) + '"'
df = (
df
# Can't use maps because there are duplicate tag values.
.withColumn('content_zip', f.arrays_zip(f.split(f.col('description'), ' ').alias('description'), f.split(f.col('tags'), ' ').alias('tag')))
.withColumn('content_zip_filtered', f.expr(f'filter(content_zip, x -> x.tag in ({address_tags_concatenated}))'))
.select(f.concat_ws(" ", f.col('content_zip_filtered.description')).alias('address'))
)
df.show(truncate=False)
输出:
+--------------------------+
|address |
+--------------------------+
| |
|or |
|5050 hunter rd ooltewah tn|
+--------------------------+