我有以下数据框。我想使用spark的数据集api将其拆分为列
我该怎么做?
数据框中的数据是典型的组合apache日志中的一行。
import org.apache.spark.sql.functions.regexp_extract
val df = spark.sparkContext.textFile("apachelog.log").toDF.show(1,false)
结果:
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 1 row
理想输出:最终结果:
+-------------+------------+-----------+-------------------+-------------+------+--------+--------------------+-------------+--------+-------+--------------------+
|client_identd|content_size| date| endpoint| ip_address|method|protocol| referer|response_code| time|user_id| useragent|
+-------------+------------+-----------+-------------------+-------------+------+--------+--------------------+-------------+--------+-------+--------------------+
| -| 514|24/Sep/2014| /071300/222153| 65.156.23.76| GET|HTTP/1.1| "-"| 404|10:22:14| -|Mozilla/9.9 (comp...|
+-------------+------------+-----------+-------------------+-------------+------+--------+--------------------+-------------+--------+-------+--------------------+```
理想情况下,您应该使用RDD API处理非结构化数据。
读取文本文件将返回一个RDD[String]
,可以使用纯scala函数将其映射以进行转换(构造)。
您可以使用类似这样的内容:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("test").master("local[2]").getOrCreate()
import spark.implicits._
val testDf = Seq("66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] \"GET /071300/242153 HTTP/1.1\" 404 514 \"-\" \"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)\"").toDF
testDf.map(r => {
val splittedString = r.getString(0).split("-")
(splittedString(0), splittedString(1), splittedString(2), splittedString(3))
}).show
结果:
即使您的问题是说您只想使用Apache Spark解析日志,我也会为您提供替代的可行解决方案。使用以下配置: