此查询适用于 Tableau 仪表板。用户希望看到某个行业的活跃客户以及这些公司的联系人以及数据的构成如何随季度变化。我创建了一个基本时间子查询,并且基本上在事实数据中分层。
我使用
LATERAL JOINS
使用历史变化的时间戳将数据放入各自的季度。我在数据子集上测试了查询,并且运行良好。当向整个数据集开放时,查询成本变得巨大。我已经做了一些优化,但似乎没有什么能让成本达到可接受的水平。
所有表都有足够的索引。最后一个
WITH
子查询 HACPCQ 是问题所在,我在其中使用 LATERAL JOINS
对联系人数据进行分层。两个连接表分别有 50 万行和 350 万行。我无法完成此任务。
数据库工程师不会创建视图或存储过程,因为产品服务器的潜在成本。目前没有可用的真正复制数据库。
下面的 SQL 和查询计划。我见过的关于性能改进的唯一帖子是针对以前的 Postgres 版本的。我应该如何重写这些
LATERL JOINS
?
-- Q creates a baseline of quarters to analyze the client and contact tables over time. Last full year plus current year.
WITH Q AS (
SELECT Q_Start, ((Q_Start + INTERVAL '3 months') - INTERVAL '1 day') :: DATE Q_End, RANK() OVER (ORDER BY Q_Start DESC) Q_Rank
FROM (
SELECT GENERATE_SERIES( DATE_TRUNC('year', CURRENT_DATE) - INTERVAL '1 year', DATE_TRUNC('quarter', CURRENT_DATE), INTERVAL '3 months') :: DATE Q_Start
) q1
),
-- HA gathers the historic client records within the timeframe of Q and filters to the last change per quarter.
HA AS (
SELECT id CL_ID, history_id CL_Hist_ID, history_date CL_Hist_Date FROM (
SELECT ha.id, history_id, history_date, ROW_NUMBER() OVER (PARTITION BY ha.id, EXTRACT(year FROM history_date),
EXTRACT(quarter FROM history_date) ORDER BY history_date DESC) AS Row_Rank
FROM historicalclient ha
WHERE EXTRACT(year FROM history_date) >= (EXTRACT(year FROM CURRENT_DATE) - 1)
) ha1
WHERE Row_Rank = 1
),
-- Tag clients to relevant quarters
HAQ AS (
SELECT *
FROM Q
JOIN LATERAL (SELECT CL_ID, MAX(CL_Hist_ID) CL_Hist_ID, MAX(CL_Hist_Date) CL_Hist_Date FROM HA WHERE CL_Hist_Date :: DATE <= Q.Q_End GROUP BY CL_ID) HA
ON TRUE
),
-- Same deal as HA but for contactposition.
HCP AS (
SELECT CP_ID, client_id CP_CL_ID, contact_id CP_Cont_ID, history_id CP_Hist_ID, history_date CP_Hist_Date, current FROM (
SELECT hcp.id CP_ID, client_id, Cont_ID, history_id, history_date, current,
ROW_NUMBER() OVER (PARTITION BY hcp.id, EXTRACT(year FROM history_date), EXTRACT(quarter FROM history_date) ORDER BY history_date DESC) AS Row_Rank
FROM contactposition hcp
WHERE historydate >= '10/1/2023'
) hcp1
WHERE Row_Rank = 1
AND current = TRUE
),
-- HC is HCP but for contacts
HC AS (
SELECT id Cont_ID, history_id Con_Hist_ID, history_date Con_Hist_Date FROM (
SELECT hc.id, history_id, history_date, ROW_NUMBER() OVER (PARTITION BY hc.id, EXTRACT(year FROM history_date), EXTRACT(quarter FROM history_date) ORDER BY history_date DESC) AS Row_Rank
FROM historicalcontact hc
LEFT JOIN contactposition cp
ON hc.id = cp.Cont_ID
WHERE EXTRACT(year FROM history_date) >= (EXTRACT(year FROM CURRENT_DATE) - 1)
) hc1
WHERE Row_Rank = 1
),
-- Joining all of the base data sources together
HACPCQ AS (
SELECT DISTINCT *
FROM HAQ
LEFT JOIN LATERAL (SELECT CP_CL_ID, CP_ID, CP_Cont_ID, MAX(CP_Hist_ID) CP_Hist_ID, MAX(CP_Hist_Date) CP_Hist_Date FROM HCP WHERE CP_CL_ID = HAQ.CL_ID AND CP_Hist_Date :: DATE <= HAQ.Q_End GROUP BY CP_CL_ID, CP_ID, CP_Cont_ID) HCP
ON TRUE
LEFT JOIN LATERAL (SELECT Cont_ID, MAX(Con_Hist_ID) Con_Hist_ID, MAX(Con_Hist_Date) Con_Hist_Date FROM HC WHERE Cont_ID = HCP.CP_Cont_ID AND Con_Hist_Date :: DATE <= HAQ.Q_End GROUP BY Cont_ID) HC
ON TRUE
)
SELECT *
FROM HACPCQ a
查询计划
编辑:切换了一个客户端的
EXPLAIN
查询计划的完整 EXPLAIN ANALYZE
查询计划。释放了我帖子中的字符 - 下面的链接。
https://explain.depesz.com/s/Cor5
编辑 2:精简查询以使其更具可读性。仍然过滤到一位客户。下面是新的
EXPLAIN ANALYZE
查询计划。
interval
常数可以包含一个方程。您可以将 interval '1 month -1 day'
合而为一,而无需将其拆分为 interval '1 month'
并分别减去 - interval '1 day'
。generate_series()
)中进行选择,而无需将它们包装在子查询中。row_number()
而不是 rank()
。WITH ORDINALITY AS g(val,row_num)
,您将获得编号值。这样您就根本不需要单独收集、排序和编号,从而无需 rank()
或 row_number()
。timestamp(tz)
(即使是负值)时,它被假定为间隔,从而节省了显式转换。这是应用于第一个 CTE 的单独演示,适用于上述五点
Q
。
distinct on
。这是一个简单的构造,可以获取每组的顶部/样本记录,而无需聚合、编号,然后过滤掉除不断出现在该代码中的 Row_Rank = 1
之外的所有内容。
HAQ
中,两个
max()
可以返回不同行的值。您的数据/结构可能规定它们必须始终来自同一行,但在查询中没有任何强制要求。同时,
distinct on
确实强制执行了这一点,同时也实现了更好的性能和优化。
CROSS JOIN
相当于 JOIN ON TRUE
,与逗号相同
,
:demo.
JOIN
不需要是
lateral
并且您根本不需要这些子查询:它们会重新执行您已经对数据执行的操作,并且它们已经有要匹配的列。
-- Q creates a baseline of quarters to analyze the client and contact tables over time.
-- Last full year plus current year.
WITH Q AS (
SELECT Q_Start::DATE,
(Q_Start + '3 months -1 day')::DATE Q_End,
Q_Rank
FROM GENERATE_SERIES(DATE_TRUNC('year', CURRENT_DATE) - INTERVAL '1 year',
DATE_TRUNC('quarter',CURRENT_DATE),
INTERVAL '3 months') WITH ORDINALITY g(Q_Start, Q_Rank)
),
HA AS (-- HA gathers the historic client records within the timeframe of Q
-- and filters to the last change per quarter.
SELECT DISTINCT ON (ha.id,
EXTRACT(year FROM history_date),
EXTRACT(quarter FROM history_date) )
ha.id AS CL_ID,
history_id AS CL_Hist_ID,
history_date AS CL_Hist_Date
FROM historicalclient ha
WHERE EXTRACT(year FROM history_date) >= (EXTRACT(year FROM CURRENT_DATE) - 1)
ORDER BY ha.id,
EXTRACT(year FROM history_date),
EXTRACT(quarter FROM history_date),
history_date DESC
),
HAQ AS (-- Tag clients to relevant quarters
SELECT DISTINCT ON (CL_ID)
CL_ID,
CL_Hist_ID,
CL_Hist_Date
FROM Q, HA
WHERE CL_Hist_Date::DATE <= Q.Q_End
ORDER BY CL_ID,
CL_Hist_ID DESC,
CL_Hist_Date DESC
),
HCP AS (-- Same deal as HA but for contactposition.
SELECT DISTINCT ON (hcp.id,
EXTRACT(year FROM history_date),
EXTRACT(quarter FROM history_date))
hcp.id AS CP_ID,
client_id AS CP_CL_ID,
contact_id AS CP_Cont_ID,
history_id AS CP_Hist_ID,
history_date AS CP_Hist_Date,
current
FROM contactposition hcp
WHERE historydate >= '10/1/2023'
AND current IS TRUE
),
HC AS (-- HC is HCP but for contacts
SELECT DISTINCT ON (hc.id,
EXTRACT(year FROM history_date),
EXTRACT(quarter FROM history_date))
hc.id AS Cont_ID,
history_id AS Con_Hist_ID,
history_date AS Con_Hist_Date
FROM historicalcontact hc
LEFT JOIN contactposition cp
ON hc.id = cp.Cont_ID
WHERE EXTRACT(year FROM history_date) >= (EXTRACT(year FROM CURRENT_DATE) - 1)
ORDER BY hc.id,
EXTRACT(year FROM history_date),
EXTRACT(quarter FROM history_date),
history_date DESC
)
-- Joining all of the base data sources together
SELECT DISTINCT *
FROM HAQ, HCP, HC
WHERE HCP.CP_CL_ID = HAQ.CL_ID
AND HCP.CP_Hist_Date::DATE <= HAQ.Q_End
AND HC.Cont_ID = HCP.CP_Cont_ID
AND HC.Con_Hist_Date::DATE <= HAQ.Q_End;