是否有一个简化的 SQL 查询来返回表中缺失值的数量和百分比? (BigQuery)

问题描述 投票:0回答:1

BigQuery

earthquake
公共数据集有 47 列,其中大部分都有缺失值。我需要一个显示摘要的输出,其中
column_name
total_entries
non_missing_entries
percentage_missing
作为该表的列。

当前我正在使用此处显示的查询,重复所有 47 列的块:

SELECT
    'id' AS column_name,
    COUNT(id) AS non_missing_entries,
    (COUNT(*) - COUNT(id)) * 100.0 / COUNT(*) AS percentage_missing
FROM
    `youtube-factcheck.earthquake_analysis.earthquakes_copy`

UNION ALL

SELECT
    'flag_tsunami' AS column_name,
    COUNT(flag_tsunami) AS non_missing_entries,
    (COUNT(*) - COUNT(flag_tsunami)) * 100.0 / COUNT(*) AS percentage_missing
FROM
    `youtube-factcheck.earthquake_analysis.earthquakes_copy`

UNION ALL

-- Repeat the above block for other columns
-- ...

输出:

| column_name|non_missing_entries | percentage_missing|
| -----------| -------------------| ------------------|
|flag_tsunami|                1869|  70.20564323290291|
|          id|                6273|                  0|
|         ...|                 ...|                ...|

是否有一种 SQL 可以避免编写 47 个相同查询的冗长乏味的工作?

sql google-bigquery statistics missing-data calculation
1个回答
0
投票

UNPIVOT
是你的朋友。 (请注意,我必须更改源,因为我无权访问
bigquery-public-data.noaa_significant_earthquakes.earthquakes

with cte as (
  select column_name,
        count(*) as non_missing_entries
  from (
    select * 
    from (
      select cast(id as string) as id,flag_tsunami,cast(year as string) as year,cast(month as string) as month,cast(day as string) as day,cast(hour as string) as hour,cast(minute as string) as minute,cast(second as string) as second
      from `bigquery-public-data.noaa_significant_earthquakes.earthquakes`)
      unpivot ( value for column_name in (id, flag_tsunami,year,month,day,hour,minute,second))
  )
  group by column_name
),
id_only as (
  select column_name,non_missing_entries
  from cte
  where column_name = 'id'
)
select cte.column_name,
      cte.non_missing_entries,
      (id_only.non_missing_entries - cte.non_missing_entries) * 100.0 / id_only.non_missing_entries as percentage_missing
from cte
cross join id_only;

它返回这个:

你必须:

  • 填写 UNPIVOT 运算符中的每一列
  • 包括内部 SELECT 中的所有列,将整数转换为字符串

但我认为这比 UNIONing 47 次要好。

© www.soinside.com 2019 - 2024. All rights reserved.