我正在尝试将包含字符串格式的日期值的列转换为Apache Spark Scala中的时间戳格式。
下面是数据框的内容(retailsNullRem):
+---------+---------+--------------+----------+
|InvoiceNo|StockCode| InvoiceDate|customerID|
+---------+---------+--------------+----------+
| 536365| 85123A|12/1/2010 8:26| 17850|
| 536365| 71053|12/1/2010 8:26| 17850|
| 536365| 84406B|12/1/2010 8:26| 17850|
| 536365| 84029G|12/1/2010 8:26| 17850|
| 536365| 84029E|12/1/2010 8:26| 17850|
| 536365| 22752|12/1/2010 8:26| 17850|
| 536365| 21730|12/1/2010 8:26| 17850|
| 536366| 22633|12/1/2010 8:28| 17850|
| 536366| 22632|12/1/2010 8:28| 17850|
| 536367| 84879|12/1/2010 8:34| 13047|
“ InvoiceDate”是我要转换为时间戳的列。我尝试了以下代码进行转换。
val timeFmt = "MM/dd/yyyy HH:mm"
val retails = retailsNullRem
.withColumn("InvoiceDateTS", to_timestamp(col("InvoiceDate"), timeFmt))
在数据源中,提到日期格式为月/日/年小时:分钟。但是上面的代码为InvoiceDateTS列返回“ Null”。我什至尝试使用((“%M /%d /%y%H:%m”)之类的格式,因为在某些情况下,月,日和小时不包含前导0,但仍然为空。请指导我所缺少的内容。
下面是示例输出:
+---------+---------+--------------+----------+-------------+
|InvoiceNo|StockCode| InvoiceDate|customerID|InvoiceDateTS|
+---------+---------+--------------+----------+-------------+
| 536365| 85123A|12/1/2010 8:26| 17850| null|
| 536365| 71053|12/1/2010 8:26| 17850| null|
| 536365| 84406B|12/1/2010 8:26| 17850| null|
| 536365| 84029G|12/1/2010 8:26| 17850| null|
| 536365| 84029E|12/1/2010 8:26| 17850| null|
| 536365| 22752|12/1/2010 8:26| 17850| null|
| 536365| 21730|12/1/2010 8:26| 17850| null|
| 536366| 22633|12/1/2010 8:28| 17850| null|
| 536366| 22632|12/1/2010 8:28| 17850| null|
| 536367| 84879|12/1/2010 8:34| 13047| null|
我不确定为什么我在下面尝试过并且不起作用的原因,为什么不起作用
import spark.implicits._
scala> val df=Seq("12/1/2010 8:26", "12/1/2010 8:29").toDF("t")
df: org.apache.spark.sql.DataFrame = [t: string]
scala> df.with
withColumn withColumnRenamed withWatermark
scala> df.withColumn
withColumn withColumnRenamed
scala> df.withColumn("s",col("t").cast("timestamp")).show
+--------------+----+
| t| s|
+--------------+----+
|12/1/2010 8:26|null|
|12/1/2010 8:29|null|
+--------------+----+
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> df.withColumn("s",to_timestamp(col("t"),"MM/dd/yyyy HH:mm")).show
+--------------+-------------------+
| t| s|
+--------------+-------------------+
|12/1/2010 8:26|2010-12-01 08:26:00|
|12/1/2010 8:29|2010-12-01 08:29:00|
+--------------+-------------------+
也许您的文件数据有问题。我对您自己的数据进行了相同的尝试,并且效果很好,您可以尝试使用数据框函数或sparkSQL。
您的数据文件
InvoiceNo,StockCode,InvoiceDate,customerID
536365,85123A,12/1/2010 8:26,17850
536365,71053,12/1/2010 8:26,17850
536365,84406B,12/1/2010 8:26,17850
536365,84029G,12/1/2010 8:26,17850
536365,84029E,12/1/2010 8:26,17850
536365,22752,12/1/2010 8:26,17850
536365,21730,12/1/2010 8:26,17850
536366,22633,12/1/2010 8:28,17850
536366,22632,12/1/2010 8:28,17850
536367,84879,12/1/2010 8:34,13047
IntelliJ中的代码
val df = sqlContext
.read
.option("header", true)
.option("inferSchema", true)
.csv("/home/cloudera/files/tests/timestamp.csv")
.cache()
df.show(5, truncate = false)
df.printSchema()
import org.apache.spark.sql.functions._
// You can try this with dataframe functions
val retails = df
.withColumn("InvoiceDateTS", to_timestamp(col("InvoiceDate"), "MM/dd/yyyy HH:mm"))
retails.show(5, truncate = false)
retails.printSchema()
// or sparkSQL
df.createOrReplaceTempView("df")
val retailsSQL = sqlContext.sql(
"""
|SELECT InvoiceNo,StockCode,InvoiceDate,customerID, TO_TIMESTAMP(InvoiceDate,"MM/dd/yyyy HH:mm") AS InvoiceDateTS
|FROM df
|""".stripMargin)
retailsSQL.show(5,truncate = false)
retailsSQL.printSchema()
输出
+---------+---------+--------------+----------+
|InvoiceNo|StockCode|InvoiceDate |customerID|
+---------+---------+--------------+----------+
|536365 |85123A |12/1/2010 8:26|17850 |
|536365 |71053 |12/1/2010 8:26|17850 |
+---------+---------+--------------+----------+
only showing top 2 rows
root
|-- InvoiceNo: integer (nullable = true)
|-- StockCode: string (nullable = true)
|-- InvoiceDate: string (nullable = true)
|-- customerID: integer (nullable = true)
+---------+---------+--------------+----------+-------------------+
|InvoiceNo|StockCode|InvoiceDate |customerID|InvoiceDateTS |
+---------+---------+--------------+----------+-------------------+
|536365 |85123A |12/1/2010 8:26|17850 |2010-12-01 08:26:00|
|536365 |71053 |12/1/2010 8:26|17850 |2010-12-01 08:26:00|
+---------+---------+--------------+----------+-------------------+
only showing top 2 rows
root
|-- InvoiceNo: integer (nullable = true)
|-- StockCode: string (nullable = true)
|-- InvoiceDate: string (nullable = true)
|-- customerID: integer (nullable = true)
|-- InvoiceDateTS: timestamp (nullable = true)
+---------+---------+--------------+----------+-------------------+
|InvoiceNo|StockCode|InvoiceDate |customerID|InvoiceDateTS |
+---------+---------+--------------+----------+-------------------+
|536365 |85123A |12/1/2010 8:26|17850 |2010-12-01 08:26:00|
|536365 |71053 |12/1/2010 8:26|17850 |2010-12-01 08:26:00|
+---------+---------+--------------+----------+-------------------+
only showing top 2 rows
root
|-- InvoiceNo: integer (nullable = true)
|-- StockCode: string (nullable = true)
|-- InvoiceDate: string (nullable = true)
|-- customerID: integer (nullable = true)
|-- InvoiceDateTS: timestamp (nullable = true)