我正在研究sparkscala,我需要通过列上的特定字段来过滤RDD,在这种情况下。user
.
我想返回一个包含用户的RDD。["Joe","Plank","Willy"]
但似乎不知道如何
这是我的RDD。
2020-03-01T00:00:05Z my.local5.url {"request_method":"GET","request_length":281,"user":"Joe"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Willy"}
2020-03-01T00:00:05Z my.local6.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Tracy"}
2020-03-01T00:00:05Z my.local6.url {"request_method":"GET","request_length":281,"user":"Roger"}
预期的输出。
2020-03-01T00:00:05Z my.local5.url {"request_method":"GET","request_length":281,"user":"Joe"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Willy"}
2020-03-01T00:00:05Z my.local6.url {"request_method":"GET","request_length":281,"user":"Plank"}
2020-03-01T00:00:05Z my.local2.url {"request_method":"GET","request_length":281,"user":"Plank"}
我已经用spark提取了RDD,用的是这样的(伪代码)。
val sparkConf = new SparkConf().setAppName("MyApp")
master.foreach(sparkConf.setMaster)
val sc = new SparkContext(sparkConf)
val rdd = sc.textFile(inputDir)
rdd.filter(_.contains("\"user\":\"THE_ARRAY_OF_NAMES_"))
对你来说,使用数据帧更容易。
使用from_json函数,你可以将json列转换成多列。
val jsonSchema = StructType(Array(
StructField("request_method",StringType,true),
StructField("request_length",IntegerType,true),
StructField("user",StringType,true)
))
val myDf = spark.read.option("header", "true").csv(path)
val formatedDf = myDf.withColumn("formated_json", from_json($"column_name", jsonSchema)
.select($"formated_json.*")
.where($"user".isin("Joe","Plank","Willy")
formatedDf.show
但如果你想要RDD版的方法,请告诉我。
用RDD版本编辑:请记住这是manny的方法之一。
//Define a regex pattern
val Pattern = """(?i)"user":"([a-zA-Z]+)"""".r
//Define a Set with your filtered values
val userSet = Set("Joe","Plank","Willy")
//Filter only the values you want
val filteredRdd = rdd.filter( x => {
//Extract the user using the pattern we just declared
val user = for(m <- Pattern.findFirstMatchIn(x)) yield m.group(1)
//If the user variable is equal with one of your set values then this statement will return true and based on that the row will be kept
userSet(user.getOrElse(""))
})
要想知道结果是否正确,你可以用。
filteredRdd.collect().foreach(println)