Google BigQuery:从几天的表格中获取一年中所有日子的表格

问题描述 投票:0回答:1

我有这个(样本)表:

+------------+-------------------+-----------+
|    Date    |       User        | Attribute |
+------------+-------------------+-----------+
| 2019-01-01 | [email protected] | apple     |
| 2019-02-01 | [email protected] | pear      |
| 2019-03-01 | [email protected] | carrot    |
| 2019-03-01 | [email protected] | orange    |
+------------+-------------------+-----------+

我需要创建所有(日期+用户)夫妇的完整排列,填补2019年所有缺失的日子(attributenull)。

就像在我的例子中我有2个不同的用户:

结果表应该是:

+------------+-------------------+-----------+
|    Date    |       User        | Attribute |
+------------+-------------------+-----------+
| 2019-01-01 | [email protected] | apple     |
| ...        | [email protected] | null      |
| 2019-03-01 | [email protected] | carrot    |
| ...        | [email protected] | null      |
| 2019-12-31 | [email protected] | null      |
| 2019-01-01 | [email protected] | null      |
| ...        | [email protected] | null      |
| 2019-02-01 | [email protected] | pear      |
| ...        | [email protected] | null      |
| 2019-03-01 | [email protected] | orange    |
| ...        | [email protected] | null      |
| 2019-12-31 | [email protected] | null      |
+------------+-------------------+-----------+

...意味着一年中每一天都有一行,当源表提供实际值时,attribute有一个值,其他使用null

作为第一步,创建所有(日期+用户)排列我想到使用bigquery-public-data.utility_eu.date_greg表,使用CROSS JOIN创建所有需要的行。

这里有一个示例表:

#standardSQL
WITH sample AS (
  SELECT DATE('2019-01-01') date, '[email protected]' user, 'apple' attribute
  UNION ALL
  SELECT DATE('2019-02-01'), '[email protected]', 'pear'
  UNION ALL
  SELECT DATE('2019-03-01'), '[email protected]', 'carrot'
  UNION ALL
  SELECT DATE('2019-03-01'), '[email protected]', 'orange'
)

这是我尝试的第一个查询:

SELECT d.date,s.* EXCEPT(date)
FROM sample s
  CROSS JOIN `bigquery-public-data.utility_eu.date_greg` d 
WHERE d.year = 2019
ORDER BY date,user

但这太多了,因为attribute值也在连接中使用,我得到的值在与原始日期无关的所有日子都被复制。

我想我需要有一些DISTINCT才能获得唯一的(日期+用户)情侣,然后只关联attribute值,如果有的话。

这是我发现的第一个工作解决方案:

distinct_couples AS (
  SELECT DISTINCT d.date,s.user
  FROM sample s CROSS JOIN `bigquery-public-data.utility_eu.date_greg` d 
  WHERE d.year = 2019
)

SELECT d.*, s.attribute
FROM distinct_couples d
  LEFT JOIN sample s USING(date,user)
ORDER BY date,user

但我正在与sample连接两次(首先在临时表中,第二次在主查询中),所以我试图理解是否可以优化。

您对如何使其有效有任何建议吗?谢谢

google-bigquery cartesian-product cross-join
1个回答
2
投票

以下是BigQuery Standard SQL

#standardSQL
WITH users AS (
  SELECT DISTINCT user
  FROM `project.dataset.sample`
)
SELECT d.date, u.user, s.attribute
FROM `bigquery-public-data.utility_eu.date_greg` d  
CROSS JOIN users u
LEFT JOIN `project.dataset.sample` s
ON s.date = d.date
AND s.user = u.user
WHERE d.year = 2019

作为旁注 - 您不需要使用任何额外的日期表,因为您可以在飞行中生成它 - 如下面的示例所示

#standardSQL
WITH users AS (
  SELECT DISTINCT user
  FROM `project.dataset.sample`
), dates AS (
  SELECT `date` 
  FROM UNNEST(GENERATE_DATE_ARRAY('2019-01-01', '2019-12-31')) `date`
)
SELECT d.date, u.user, s.attribute
FROM dates d  
CROSS JOIN users u
LEFT JOIN `project.dataset.sample` s
ON s.date = d.date
AND s.user = u.user
© www.soinside.com 2019 - 2024. All rights reserved.