to带有时间戳的totimestamp返回null

Question

我正在尝试将包含字符串格式的日期值的列转换为Apache Spark Scala中的时间戳格式。

下面是数据框的内容（retailsNullRem）：

+---------+---------+--------------+----------+
|InvoiceNo|StockCode|   InvoiceDate|customerID|
+---------+---------+--------------+----------+
|   536365|   85123A|12/1/2010 8:26|     17850|
|   536365|    71053|12/1/2010 8:26|     17850|
|   536365|   84406B|12/1/2010 8:26|     17850|
|   536365|   84029G|12/1/2010 8:26|     17850|
|   536365|   84029E|12/1/2010 8:26|     17850|
|   536365|    22752|12/1/2010 8:26|     17850|
|   536365|    21730|12/1/2010 8:26|     17850|
|   536366|    22633|12/1/2010 8:28|     17850|
|   536366|    22632|12/1/2010 8:28|     17850|
|   536367|    84879|12/1/2010 8:34|     13047|

“ InvoiceDate”是我要转换为时间戳的列。我尝试了以下代码进行转换。

val timeFmt = "MM/dd/yyyy HH:mm"
val retails = retailsNullRem
            .withColumn("InvoiceDateTS", to_timestamp(col("InvoiceDate"), timeFmt))

在数据源中，提到日期格式为月/日/年小时：分钟。但是上面的代码为InvoiceDateTS列返回“ Null”。我什至尝试使用（（“％M /％d /％y％H：％m”）之类的格式，因为在某些情况下，月，日和小时不包含前导0，但仍然为空。请指导我所缺少的内容。

下面是示例输出：

+---------+---------+--------------+----------+-------------+
|InvoiceNo|StockCode|   InvoiceDate|customerID|InvoiceDateTS|
+---------+---------+--------------+----------+-------------+
|   536365|   85123A|12/1/2010 8:26|     17850|         null|
|   536365|    71053|12/1/2010 8:26|     17850|         null|
|   536365|   84406B|12/1/2010 8:26|     17850|         null|
|   536365|   84029G|12/1/2010 8:26|     17850|         null|
|   536365|   84029E|12/1/2010 8:26|     17850|         null|
|   536365|    22752|12/1/2010 8:26|     17850|         null|
|   536365|    21730|12/1/2010 8:26|     17850|         null|
|   536366|    22633|12/1/2010 8:28|     17850|         null|
|   536366|    22632|12/1/2010 8:28|     17850|         null|
|   536367|    84879|12/1/2010 8:34|     13047|         null|

Answer 1

我不确定为什么我在下面尝试过并且不起作用的原因，为什么不起作用

import spark.implicits._

scala> val df=Seq("12/1/2010 8:26", "12/1/2010 8:29").toDF("t")
df: org.apache.spark.sql.DataFrame = [t: string]

scala> df.with
withColumn   withColumnRenamed   withWatermark

scala> df.withColumn
withColumn   withColumnRenamed

scala> df.withColumn("s",col("t").cast("timestamp")).show
+--------------+----+
|             t|   s|
+--------------+----+
|12/1/2010 8:26|null|
|12/1/2010 8:29|null|
+--------------+----+


scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> df.withColumn("s",to_timestamp(col("t"),"MM/dd/yyyy HH:mm")).show
+--------------+-------------------+
|             t|                  s|
+--------------+-------------------+
|12/1/2010 8:26|2010-12-01 08:26:00|
|12/1/2010 8:29|2010-12-01 08:29:00|
+--------------+-------------------+

Answer 2

也许您的文件数据有问题。我对您自己的数据进行了相同的尝试，并且效果很好，您可以尝试使用数据框函数或sparkSQL。

您的数据文件

InvoiceNo,StockCode,InvoiceDate,customerID
536365,85123A,12/1/2010 8:26,17850
536365,71053,12/1/2010 8:26,17850
536365,84406B,12/1/2010 8:26,17850
536365,84029G,12/1/2010 8:26,17850
536365,84029E,12/1/2010 8:26,17850
536365,22752,12/1/2010 8:26,17850
536365,21730,12/1/2010 8:26,17850
536366,22633,12/1/2010 8:28,17850
536366,22632,12/1/2010 8:28,17850
536367,84879,12/1/2010 8:34,13047

IntelliJ中的代码

      val df = sqlContext
        .read
        .option("header", true)
        .option("inferSchema", true)
        .csv("/home/cloudera/files/tests/timestamp.csv")
        .cache()

      df.show(5, truncate = false)
      df.printSchema()

      import org.apache.spark.sql.functions._
      // You can try this with dataframe functions
      val retails = df
        .withColumn("InvoiceDateTS", to_timestamp(col("InvoiceDate"), "MM/dd/yyyy HH:mm"))

      retails.show(5, truncate = false)
      retails.printSchema()

      // or sparkSQL
      df.createOrReplaceTempView("df")
      val retailsSQL = sqlContext.sql(
        """
          |SELECT InvoiceNo,StockCode,InvoiceDate,customerID, TO_TIMESTAMP(InvoiceDate,"MM/dd/yyyy HH:mm") AS InvoiceDateTS
          |FROM df
          |""".stripMargin)

      retailsSQL.show(5,truncate = false)
      retailsSQL.printSchema()

输出

+---------+---------+--------------+----------+
|InvoiceNo|StockCode|InvoiceDate   |customerID|
+---------+---------+--------------+----------+
|536365   |85123A   |12/1/2010 8:26|17850     |
|536365   |71053    |12/1/2010 8:26|17850     |
+---------+---------+--------------+----------+
only showing top 2 rows

root
 |-- InvoiceNo: integer (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- customerID: integer (nullable = true)

+---------+---------+--------------+----------+-------------------+
|InvoiceNo|StockCode|InvoiceDate   |customerID|InvoiceDateTS      |
+---------+---------+--------------+----------+-------------------+
|536365   |85123A   |12/1/2010 8:26|17850     |2010-12-01 08:26:00|
|536365   |71053    |12/1/2010 8:26|17850     |2010-12-01 08:26:00|
+---------+---------+--------------+----------+-------------------+
only showing top 2 rows

root
 |-- InvoiceNo: integer (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- customerID: integer (nullable = true)
 |-- InvoiceDateTS: timestamp (nullable = true)

+---------+---------+--------------+----------+-------------------+
|InvoiceNo|StockCode|InvoiceDate   |customerID|InvoiceDateTS      |
+---------+---------+--------------+----------+-------------------+
|536365   |85123A   |12/1/2010 8:26|17850     |2010-12-01 08:26:00|
|536365   |71053    |12/1/2010 8:26|17850     |2010-12-01 08:26:00|
+---------+---------+--------------+----------+-------------------+
only showing top 2 rows

root
 |-- InvoiceNo: integer (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- customerID: integer (nullable = true)
 |-- InvoiceDateTS: timestamp (nullable = true)

to带有时间戳的totimestamp返回null

问题描述投票：0回答：2

2个回答

最新问题

to带有时间戳的totimestamp返回null

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2