我正在使用 Netezza SQL。
在上一个问题(用个人查询替换 CTE)中,我了解了“Gap and Island Problems”的基础知识,其目标是“填充”每个名称的缺失记录。
假设有一张表,里面有不同年份不同人的名字(有些年份漏了)。让我们假设每个人都有最喜欢的颜色、最喜欢的食物和最喜欢的运动——这些信息多年来不会改变。然而,每个人的年龄每年都在变化。
CREATE TABLE sample_table
(
name VARCHAR(50),
age INTEGER,
year INTEGER,
color VARCHAR(50),
food VARCHAR(50),
sport VARCHAR(50)
);
INSERT INTO sample_table (name, age, year, color, food)
VALUES ('aaa', 41, 2010, 'Red', 'Pizza', 'hockey');
INSERT INTO sample_table (name, age, year, color, food)
VALUES ('aaa', 42, 2012, 'Red', 'Pizza', 'hockey');
INSERT INTO sample_table (name, age, year, color, food)
VALUES ('aaa', 47, 2017, 'Red', 'Pizza', 'hockey');
INSERT INTO sample_table (name, age, year, color, food)
VALUES ('bbb', 20 2000, 'Blue', 'Burgers','football');
INSERT INTO sample_table (name, age, year, color, food)
VALUES ('bbb', 26, 2006, 'Blue', 'Burgers', 'football');
INSERT INTO sample_table (name, age, year, color, food)
VALUES ('bbb', 30, 2010, 'Blue', 'Burgers', 'football');
+------+-----+------+-------+---------+----------+
| name | age | year | color | food | sport |
+------+-----+------+-------+---------+----------+
| aaa | 41 | 2010 | Red | Pizza | hockey |
| aaa | 42 | 2012 | Red | Pizza | hockey |
| aaa | 47 | 2017 | Red | Pizza | hockey |
| bbb | 20 | 2000 | Blue | Burgers | football |
| bbb | 26 | 2006 | Blue | Burgers | football |
| bbb | 30 | 2010 | Blue | Burgers | football |
+------+-----+------+-------+---------+----------+
在这个问题中,我有兴趣仅在每个人的最小和最大年份之间填充缺失的信息(例如年龄、运动、食物、颜色)。 具体来说,我想学习如何在没有 CTE 的情况下解决这个问题并通过“标准查询”。
这是我到目前为止尝试过的:
# https://stackoverflow.com/questions/75677585/replacing-ctes-with-individual-queries
create table years_table (year integer);
insert into years_table(year)
values(2010);
insert into years_table(year)
values(2011);
insert into years_table(year)
values(2012);
insert into years_table(year)
values(2013);
insert into years_table(year)
values(2014);
insert into years_table(year)
values(2015);
insert into years_table(year)
values(2016);
insert into years_table(year)
values(2017);
insert into years_table(year)
values(2018);
insert into years_table(year)
values(2019);
insert into years_table(year)
values(2020);
select name, year, max(color) over(partition by name, grp order by year) color, max(sport) over(partition by name, grp order by year) sport, max(food) over(partition by name, grp order by year) food
from (
select n.name, y.year, t.color, t.food, t.sport
sum(case when t.name is null then 0 else 1 end) over(partition by n.name order by y.year) grp
from (
select name, min(year) min_year, max(year) max_year
from sample_table
group by name
) n
inner join years_table y on y.year between n.min_year and n.max_year
left join sample_table t on t.name = n.name and t.year = y.year
) t
但我不确定如何调整此 SQL 代码以使每个人的“年龄”信息随时间变化。
有人可以告诉我怎么做吗?最好,我想学习如何在没有递归 CTE 的情况下执行此操作,因为 Netezza 不支持它们。
谢谢!
注意: 最终结果应该是这样的:
+------+-----+------+-------+---------+----------+
| name | age | year | color | food | sport |
+------+-----+------+-------+---------+----------+
| aaa | 41 | 2010 | Red | Pizza | hockey |
| aaa | 42 | 2011 | Red | Pizza | hockey |
| aaa | 42 | 2012 | Red | Pizza | hockey |
| aaa | 43 | 2013 | Red | Pizza | hockey |
| aaa | 44 | 2014 | Red | Pizza | hockey |
| aaa | 45 | 2015 | Red | Pizza | hockey |
| aaa | 46 | 2016 | Red | Pizza | hockey |
| aaa | 47 | 2017 | Red | Pizza | hockey |
| bbb | 20 | 2000 | Blue | Burgers | football |
| bbb | 21 | 2001 | Blue | Burgers | football |
| bbb | 22 | 2002 | Blue | Burgers | football |
| bbb | 23 | 2003 | Blue | Burgers | football |
| bbb | 24 | 2004 | Blue | Burgers | football |
| bbb | 25 | 2005 | Blue | Burgers | football |
| bbb | 26 | 2006 | Blue | Burgers | football |
| bbb | 27 | 2007 | Blue | Burgers | football |
| bbb | 28 | 2008 | Blue | Burgers | football |
| bbb | 29 | 2009 | Blue | Burgers | football |
| bbb | 30 | 2010 | Blue | Burgers | football |
+------+-----+------+-------+---------+----------+
我这样做的方法是从每个源行计算出生年份,然后将其传播到缺失的行并使用
Age = Year - BirthYear
和 BirthYear = source.year - source.age
重新计算年龄。
这样你就不需要考虑源数据中的不一致或变化(结果反映源,一致与否)。
我也不认为你需要一个差距和孤岛的方法,你可以只使用
CROSS APPLY
找到当年或之前的最新行,并从那里传播价值......
WITH
person
AS
(
SELECT
name,
MIN(year) AS min_year,
MAX(year) AS max_year
FROM
sample_table
GROUP BY
name
)
SELECT
p.name,
y.year,
s.color,
s.food,
s.sport,
y.year - (s.year - s.age) AS age
FROM
person AS p
INNER JOIN
years_table AS y
ON y.year BETWEEN p.min_year AND p.max_year
CROSS APPLY
(
SELECT *
FROM sample_table
WHERE name = p.name
AND year <= y.year
ORDER BY year DESC
LIMIT 1
)
AS s
如果你真的需要避免 CTE,我刚刚读到,抱歉,你只需将 CTE 的定义作为主查询中的子查询移动,它的行为是相同的。
演示:https://dbfiddle.uk/G9AMFDdE
(使用 TOP 1 而不是 LIMIT 1,所以我可以滥用 SQL Server 作为 Netezza SQL 代理。)
首先 - 我会称之为 - 常见 - 问题是 gap filling 任务。在我看来,差距和岛屿是不同的动物。
话虽如此 - Netezza 也支持
LAST_VALUE(col IGNORE NULLS) OVER (PARTITION BY ... ORDER BY ...)
OLAP 功能。使用它,您将获得比许多复杂的自连接更具可读性和效率的查询......
-- complete with the in data, in the shape of Common Table Expressions in a WITH clause ...
WITH
sample_table("name",age,"year",colour,food,sport) AS (
SELECT 'aaa',41,2010,'Red','Pizza','hockey'
UNION ALL SELECT 'aaa',42,2012,'Red','Pizza','hockey'
UNION ALL SELECT 'aaa',47,2017,'Red','Pizza','hockey'
UNION ALL SELECT 'bbb',20,2000,'Blue','Burgers','football'
UNION ALL SELECT 'bbb',26,2006,'Blue','Burgers','football'
UNION ALL SELECT 'bbb',30,2010,'Blue','Burgers','football'
)
,
years_table("year") AS (
SELECT 2010 UNION ALL SELECT 2011
UNION ALL SELECT 2012 UNION ALL SELECT 2013
UNION ALL SELECT 2014 UNION ALL SELECT 2015
UNION ALL SELECT 2016 UNION ALL SELECT 2017
UNION ALL SELECT 2018 UNION ALL SELECT 2019
UNION ALL SELECT 2020
)
SELECT
LAST_VALUE("name" IGNORE NULLS) OVER(order by y."year") AS "name"
, LAST_VALUE(y."year" IGNORE NULLS) OVER(order by y."year")
- LAST_VALUE(s."year" IGNORE NULLS) OVER(order by y."year")
+ LAST_VALUE(age IGNORE NULLS) OVER(order by y."year")
AS age
, y."year"
, LAST_VALUE(colour IGNORE NULLS) OVER(order by y."year") AS colour
, LAST_VALUE(food IGNORE NULLS) OVER(order by y."year") AS food
, LAST_VALUE(sport IGNORE NULLS) OVER(order by y."year") AS sport
FROM years_table y
LEFT JOIN sample_table s
ON y."year" >= s."year"
AND y."year" <= s."year"
ORDER BY name, "year"
;
结果:
姓名 | 年龄 | 年 | 颜色 | 美食 | 运动 |
---|---|---|---|---|---|
aaa | 42 | 2,012 | 红色 | 披萨 | 曲棍球 |
aaa | 43 | 2,013 | 红色 | 披萨 | 曲棍球 |
aaa | 44 | 2,014 | 红色 | 披萨 | 曲棍球 |
aaa | 45 | 2,015 | 红色 | 披萨 | 曲棍球 |
aaa | 46 | 2,016 | 红色 | 披萨 | 曲棍球 |
aaa | 47 | 2,017 | 红色 | 披萨 | 曲棍球 |
aaa | 48 | 2,018 | 红色 | 披萨 | 曲棍球 |
aaa | 49 | 2,019 | 红色 | 披萨 | 曲棍球 |
aaa | 50 | 2,020 | 红色 | 披萨 | 曲棍球 |
bb | 30 | 2,010 | 蓝色 | 汉堡 | 足球 |
bb | 30 | 2,010 | 蓝色 | 汉堡 | 足球 |
bb | 31 | 2,011 | 蓝色 | 汉堡 | 足球 |
我不太了解 Netezza,但如果它支持窗口功能,您可以使用
LEAD
来获取表中下一条记录的年份。
SELECT sample_table.*,
COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) AS lastyeartogenerate
FROM sample_table
就个人而言,我会停止这里的工作,让任何负责生成最终表的应用程序通过嵌套循环从那里接管。在伪代码中:
for each record in recordset, do {
for y from year to lastyeartogenerate do {
[...]
}
}
这限制了查询的复杂性和通过网络发送的冗余数据量。
如果您仍然希望从数据库中获取您提到的预期表,只需加入
years_table
,尽管它需要从 2000 年开始。
SELECT name,
age - t.year + years_table.year AS age,
years_table.year,
color,
food,
sport
FROM (
SELECT sample_table.*,
COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) AS lastyeartogenerate
FROM sample_table
) T
JOIN years_table ON years_table.year BETWEEN T.year AND t.lastyeartogenerate
ORDER BY name, year
更好的是,将您的
years_table
替换为here中的内容(我必须承认它略高于我对该数据库管理系统的了解,我无法访问它来测试):
SELECT name,
age + idx AS age,
year + idx AS year,
color,
food,
sport
FROM (
SELECT sample_table.*,
COALESCE(LEAD(year) OVER (partition by name ORDER BY year) -1, year) - year AS indicestogenerate
FROM sample_table
) T
JOIN _v_vector_idx ON idx <= indicesToGenerate
ORDER BY name, year + idx
我希望它能工作,但又一次,无法访问数据库来测试它。
这里是一个没有CTE,没有窗口功能的版本:
select
AllCombos.name
, AllCombos.year
, coalesce(st.age,stprev.age+(AllCombos.year-stprev.year)) age
, coalescE(st.food,stprev.food) food
, coalesce(st.color,stprev.color) color
, coalesce(st.sport,stprev.sport) sport
from (
select name, year
from (
select name, min(year) min_year, max(year) max_year
from #sample_table
group by name
) n
inner join #years_table y on y.year between n.min_year and n.max_year
) AllCombos
left join
#sample_table st
on st.name=AllCombos.name
and st.year=AllCombos.year
left join
#sample_table stprev
on stprev.name=AllCombos.name
and stprev.year<AllCombos.year
and not exists (select 1
from #sample_table morerecent
where morerecent.name=stprev.name
and morerecent.year<AllCombos.year
and morerecent.year>stprev.Year)
它找到所有需要列出的名字-年份组合,然后尝试(左连接)找到一个完全匹配的,如果没有找到,它会找到最近的前一行。它还会根据年数调整年龄(但这并不准确,因为记录是按年计算的,一个人每年会处于两个不同的年龄)。