BigQuery
earthquake
公共数据集有 47 列,其中大部分都有缺失值。我需要一个显示摘要的输出,其中 column_name
、total_entries
、non_missing_entries
和 percentage_missing
作为该表的列。
当前我正在使用此处显示的查询,重复所有 47 列的块:
SELECT
'id' AS column_name,
COUNT(id) AS non_missing_entries,
(COUNT(*) - COUNT(id)) * 100.0 / COUNT(*) AS percentage_missing
FROM
`youtube-factcheck.earthquake_analysis.earthquakes_copy`
UNION ALL
SELECT
'flag_tsunami' AS column_name,
COUNT(flag_tsunami) AS non_missing_entries,
(COUNT(*) - COUNT(flag_tsunami)) * 100.0 / COUNT(*) AS percentage_missing
FROM
`youtube-factcheck.earthquake_analysis.earthquakes_copy`
UNION ALL
-- Repeat the above block for other columns
-- ...
输出:
| column_name|non_missing_entries | percentage_missing|
| -----------| -------------------| ------------------|
|flag_tsunami| 1869| 70.20564323290291|
| id| 6273| 0|
| ...| ...| ...|
是否有一种 SQL 可以避免编写 47 个相同查询的冗长乏味的工作?
UNPIVOT
是你的朋友。 (请注意,我必须更改源,因为我无权访问bigquery-public-data.noaa_significant_earthquakes.earthquakes
with cte as (
select column_name,
count(*) as non_missing_entries
from (
select *
from (
select cast(id as string) as id,flag_tsunami,cast(year as string) as year,cast(month as string) as month,cast(day as string) as day,cast(hour as string) as hour,cast(minute as string) as minute,cast(second as string) as second
from `bigquery-public-data.noaa_significant_earthquakes.earthquakes`)
unpivot ( value for column_name in (id, flag_tsunami,year,month,day,hour,minute,second))
)
group by column_name
),
id_only as (
select column_name,non_missing_entries
from cte
where column_name = 'id'
)
select cte.column_name,
cte.non_missing_entries,
(id_only.non_missing_entries - cte.non_missing_entries) * 100.0 / id_only.non_missing_entries as percentage_missing
from cte
cross join id_only;
它返回这个:
你必须:
但我认为这比 UNIONing 47 次要好。