PySpark date_trunc修改时区:如何防止时区?

问题描述 投票:0回答:1

Context:我使用从pyspark.sql.functions导入的date_trunc函数将时间戳截断为分钟。

df_truncated = df.withColumn('dt', date_trunc('minute', df["timestamp"]))
df_truncated.show(truncate=False)

输出如下

+------------------------+-------------------+
|timestamp               |dt                 |
+------------------------+-------------------+
|2020-01-02T00:30:47.178Z|2020-01-02 02:30:00|
|2020-01-02T00:30:47.160Z|2020-01-02 02:30:00|
|2020-01-02T00:30:46.327Z|2020-01-02 02:30:00|
|2020-01-02T00:30:45.003Z|2020-01-02 02:30:00|
|2020-01-02T00:30:44.054Z|2020-01-02 02:30:00|
+------------------------+-------------------+

问题:问题是,它向原始的timstamp“加”了两个小时-从utc转换为本地时间。

问题:如何避免这种情况?我是否需要手动截断时间戳,或者date_trunc函数有一些未记录的参数?还是我需要访问spark全局设置,如果需要,那么如何或哪些设置?

date pyspark timestamp timezone truncation
1个回答
0
投票
## Here i am selecting the substring of the column "timestamp". Choose everthing till the seconds and convert that to a timestamp. df.withColumn("hour", F.to_timestamp(F.substring("timestamp_value", 0, 19), "yyyy-MM-dd'T'HH:mm:ss")).show() +-------------------------+-------------------+ |timestamp |hour | +-------------------------+-------------------+ |2017-08-01T14:30:00+05:30|2017-08-01 14:30:00| |2017-08-01T14:30:00+06:30|2017-08-01 14:30:00| |2017-08-01T14:30:00+07:30|2017-08-01 14:30:00| +-------------------------+-------------------+

有关更多技术,您可以参考链接:Link

© www.soinside.com 2019 - 2024. All rights reserved.