pyspark在lambda中使用正则表达式拆分字符串

问题描述 投票:1回答:1

我正在尝试基于lambda函数内的正则表达式拆分字符串,字符串不会被拆分。我确定正则表达式工作正常。检查正则表达式测试链接https://regex101.com/r/ryRio6/1

from pyspark.sql.functions import col,split
import re

r = re.compile(r"(?=\s\w+=)")
adsample = sc.textFile("hdfs://hostname/user/hdfs/sample/Log18Dec.txt")
splitted_sample = adsample.flatMap(lambda (x): ((v) for v in r.split(x)))

for m in splitted_sample.collect():
    print(m)

不知道我哪里错了..

文件中的示例行:

|RECEIVE|Low| eventId=139569 msg=W4N Alert :: Critical : Interface Utilization for GigabitEthernet0/1 90.0 % in=2442 out=0 categorySignificance=/Normal categoryBehavior=/Communicate/Query categoryDeviceGroup=/Application

正则表达式应该在键之前匹配空格

产量

|RECEIVE|Low|
eventId=139569
msg=W4N Alert :: Critical : Interface Utilization for GigabitEthernet0/1 90.0 %
in=2442
out=0
categorySignificance=/Normal
categoryBehavior=/Communicate/Query
categoryDeviceGroup=/Application
python apache-spark lambda pyspark pyspark-sql
1个回答
1
投票
from pyspark.sql.functions import col,split
import re

#r = re.compile(r"(?=\s\w+=)")
adsample = sc.textFile("hdfs://hostname/user/hdfs/sample/Log18Dec.txt")
splitted_sample = adsample.flatMap(lambda (x): ((v) for v in re.split('\s+(?=\w+=)',x)))

for m in splitted_sample.collect():
    print(m)
© www.soinside.com 2019 - 2024. All rights reserved.