如何分割数据帧-Scala Spark

问题描述 投票:0回答:3

我有以下数据框。我想使用spark的数据集api将其拆分为列

我该怎么做?

数据框中的数据是典型的组合apache日志中的一行。

import org.apache.spark.sql.functions.regexp_extract
val df = spark.sparkContext.textFile("apachelog.log").toDF.show(1,false)

结果:

+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                             |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 1 row

理想输出:最终结果:

+-------------+------------+-----------+-------------------+-------------+------+--------+--------------------+-------------+--------+-------+--------------------+
|client_identd|content_size|       date|           endpoint|   ip_address|method|protocol|             referer|response_code|    time|user_id|           useragent|
+-------------+------------+-----------+-------------------+-------------+------+--------+--------------------+-------------+--------+-------+--------------------+
|            -|         514|24/Sep/2014|     /071300/222153| 65.156.23.76|   GET|HTTP/1.1|                 "-"|          404|10:22:14|      -|Mozilla/9.9 (comp...|
+-------------+------------+-----------+-------------------+-------------+------+--------+--------------------+-------------+--------+-------+--------------------+```
regex scala apache-spark parsing apache-spark-dataset
3个回答
0
投票

理想情况下,您应该使用RDD API处理非结构化数据。

读取文本文件将返回一个RDD[String],可以使用纯scala函数将其映射以进行转换(构造)。


0
投票

您可以使用类似这样的内容:

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().appName("test").master("local[2]").getOrCreate()

import spark.implicits._

val testDf = Seq("66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] \"GET /071300/242153 HTTP/1.1\" 404 514 \"-\" \"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)\"").toDF

testDf.map(r => {

  val splittedString = r.getString(0).split("-")

  (splittedString(0), splittedString(1), splittedString(2), splittedString(3))

}).show

结果:

enter image description here


-2
投票

即使您的问题是说您只想使用Apache Spark解析日志,我也会为您提供替代的可行解决方案。使用以下配置:

  1. 配置弹性堆栈的日志存储configuration example
  2. 获取Logstash输出并将其发送到Apache Kafka example
  3. 然后用Spark Kafka spark kafka doc读取流>
© www.soinside.com 2019 - 2024. All rights reserved.