首次在 BigQuery 上使用 dbt 增量模型运行时出现重复记录

Question

在我们组织的一个用例中，我们有

incremental

表，它基本上保存传入事件的仅附加记录，以及

current

表，它存储具有唯一的增量记录的最新状态关键。

上面的用例看起来与我们完全匹配，可以永久实现增量模型。

文档here指出；

...the first time a model is run, the table is built by transforming all rows of source data.

因为我们在仅追加增量表上有多个具有相同唯一键的记录，所以第一次增量运行会在当前表上生成多个具有相同唯一键的记录。因此，连续批次跟随误差；

UPDATE/MERGE must match at most one source row for each target row

有人可以告诉我如何通过解决方案解决这个问题还是我遗漏了什么？

提前致谢，索纳

Answer 1

如果您只需要保留同一 ID 的最新版本，您可以首先使用 CTE 中的窗口函数删除重复项。

就我而言，我使用的是 BigQuery，并且具有相同 ID 的完全相同的重复数据和较少更新的数据。这是我的处理方法。

对于 header，我具体化为增量，并选择 id 作为 unique_key。

{{
    config(
        materialized='incremental',
        unique_key='id'
    )
}}

您照常查询相关列。我将 is_incremental 放在这个 CTE 中。

with cte as ( ... {% if is_incremental() %} ... ),

现在处理重复项

no_dupe as (
    select 
        distinct *, -- this is to handle the exact duplicate
        row_number() over(partition by id order by datemodified desc) as rank_uptodate -- this is to rank the row with the latest modified one as 1.
    from cte
),

no_dupe_final as (
    select * except(rank_uptodate) 
    from no_dupe
    where rank_uptodate = 1 -- this is where you only fetch the most up to date version of the same id
)

select * from no_dupe_final

这对于完全刷新和增量刷新都适用。将 id 和 datemodified 更改为您案例中的相关列。

首次在 BigQuery 上使用 dbt 增量模型运行时出现重复记录

问题描述投票：0回答：1

1个回答

最新问题

首次在 BigQuery 上使用 dbt 增量模型运行时出现重复记录

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1