我正在尝试确定 Yelp 签到最多发生在一天中的什么时间

Question

第一行数据行(business_id='--1UhMGODdWsrMastO9DZw',日期='2016-04-26 19:49:16、2016-08-30 18:36:57、2016-10-15 02:45:18、2016-11- 18 01:54:50, 2017-04-20 18:39:06, 2017-05-03 17:58:02')

我的任务是创建一个变量hours_by_checkin_count。这应该是一个 PySpark DataFrame DataFrame 应按计数排序并包含 24 行。 DataFrame 应该包含这些列（按此顺序）：hour（一天中的小时为整数，午夜后的小时为 0）计数（该小时内发生的签到次数）

from pyspark.sql.functions import * 
checkin.select('business_id',datesplit('date').alias('dates')).withColumn('checkin_date',explode('dates'))
hours_by_checkin_count = checkin.withColumn('hour', hour('date')) \
    .groupBy('hour') \
    .count() \
    .orderBy('count', ascending=False)

hours_by_checkin_count = hours_by_checkin_count.limit(24)

我的输出和len都不正确。我预计第一排有 1 小时

Answer 1

您的代码中存在一些小问题，这些问题会为您创建错误的输出。

PySpark 不允许您使用 inplace （就像 Pandas 所做的那样），因此在您的第二行代码中，操作被执行，但不存储在变量中。
我不知道你的函数
```
datesplit
```
到底在做什么。从提供的上下文来看，它是不需要的。
构建您的代码，以便您可以调试它。这样，您可以输出中间数据帧并检查是否在特定时间显示预期结果。

这里是代码，如何实现创建所需的数据框、对其进行分组、对其进行计数，然后对其进行排序。

# Explode the dates array and create a new DataFrame 'checkin' with 'business_id' and 'checkin_date' columns
checkin = checkin.select('business_id', explode('dates').alias('checkin_date'))

# Add a new 'hour' column to the 'checkin' DataFrame by extracting the hour from the 'checkin_date'
checkin = df.withColumn('hour', hour('checkin_date'))

# Group the 'checkin' DataFrame by 'hour' and count the occurrences of each hour
# Also, order the results by 'hour'
hours_by_checkin_count = checkin.groupBy('hour').count().orderBy('hour')

我正在尝试确定 Yelp 签到最多发生在一天中的什么时间

问题描述投票：0回答：1

1个回答

最新问题

我正在尝试确定 Yelp 签到最多发生在一天中的什么时间

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1