我正在尝试使用 Hive 中的 lead 和 datediff 函数以秒为单位查找登录开始时间、登录结束时间和在网页上花费的总时间。
样本数据集:
id | 登录时间 |
---|---|
1 | 2023-05-03 00:20:37.000 |
1 | 2023-05-03 00:20:51.000 |
2 | 2023-05-03 15:42:31.000 |
使用的查询:
with temp as
(
select id,
login_time as start_time
from login_table
)
select id,
start_time,
lead(start_time) over(partition by id) as end_time,
datediff('second', start_time, lead(start_time) over (partition by id)) as time_spent_on_page
from temp
但是我说错了
Invalid number of arguments in datediff. Expected 2, found 3
请告知我如何在 Hive 上以秒为单位找到 2 个值之间的时间差。
预期产出:
id | 开始时间 | 结束时间 | time_spent_on_page |
---|---|---|---|
1 | 2023-05-03 00:20:37.000 | 2023-05-03 00:20:51.000 | 14 |
1 | 2023-05-03 00:20:51.000 | ||
2 | 2023-05-03 12:00:54.000 | 2023-05-03 12:01:09.000 | 15 |
Hive 的
datediff
接受两个日期字符串并返回它们的天数差异。
要获得两个日期时间字符串之间的差异,我们可以使用
unix_timestamp
:
select id,
login_time as start_time,
lead(login_time) over(partition by id order by login_time) as end_time,
unix_timestamp(lead(login_time) over (partition by id order by login_time))
- unix_timestamp(login_time) as time_spent_on_page
from login_table