如何使用 DELETE 语句删除 Bigquery 中的重复行?

问题描述 投票:0回答:1

我有一个包含数百万行的 Bigquery 表。我在这些行中有重复项(X 行在所有列中具有完全相同的值),并且希望在不创建临时表的情况下删除它们(由于某些约束),而是仅使用 DELETE 语句来删除有问题的行。

以下列的组合可用于识别重复项:

Timestamp
、column_1、column_2、
column_3
、column_4

我当前正在使用以下查询来删除重复行,但我认为它没有按预期工作(它可能不会删除所有重复行):

DELETE FROM BIGQUERY_TABLE t
WHERE EXISTS (
  SELECT 1
  FROM (
    SELECT
      Timestamp,
      column_1,
      column_2,
      column_3,
      column_4,
      ROW_NUMBER() OVER (
        PARTITION BY Timestamp, column_1, column_2, column_3, column_4
        ORDER BY Timestamp DESC
      ) AS rn
    FROM
      BIGQUERY_TABLE
    WHERE Timestamp > 1709424000000 AND Timestamp < 1709510400000
  ) d
  WHERE
    t.Timestamp = d.Timestamp AND
    t.column_1= d.column_1 AND
    t.column_2= d.column_2 AND
    t.column_3 = d.column_3 AND
    t.column_4 = d.column_4 AND
    d.rn > 1
)
AND t.Timestamp > 1709424000000 AND t.Timestamp < 1709510400000;
sql database google-bigquery duplicates
1个回答
1
投票

我认为你的方法和查询几乎是正确的,只有一些错误。

以下是您的查询的修订版本,可能有效:

DELETE FROM BIGQUERY_TABLE
WHERE STRUCT(Timestamp, column_1, column_2, column_3, column_4) IN (
  SELECT AS STRUCT Timestamp, column_1, column_2, column_3, column_4
  FROM (
    SELECT *,
      ROW_NUMBER() OVER(
        PARTITION BY Timestamp, column_1, column_2, column_3, column_4 
        ORDER BY Timestamp
      ) as rn
    FROM BIGQUERY_TABLE
    WHERE Timestamp > 1709424000000 AND Timestamp < 1709510400000
  )
  WHERE rn > 1
) AND Timestamp > 1709424000000 AND Timestamp < 1709510400000;
© www.soinside.com 2019 - 2024. All rights reserved.