我有一个包含数百万行的 Bigquery 表。我在这些行中有重复项(X 行在所有列中具有完全相同的值),并且希望在不创建临时表的情况下删除它们(由于某些约束),而是仅使用 DELETE 语句来删除有问题的行。
以下列的组合可用于识别重复项:
Timestamp
、column_1、column_2、column_3
、column_4
我当前正在使用以下查询来删除重复行,但我认为它没有按预期工作(它可能不会删除所有重复行):
DELETE FROM BIGQUERY_TABLE t
WHERE EXISTS (
SELECT 1
FROM (
SELECT
Timestamp,
column_1,
column_2,
column_3,
column_4,
ROW_NUMBER() OVER (
PARTITION BY Timestamp, column_1, column_2, column_3, column_4
ORDER BY Timestamp DESC
) AS rn
FROM
BIGQUERY_TABLE
WHERE Timestamp > 1709424000000 AND Timestamp < 1709510400000
) d
WHERE
t.Timestamp = d.Timestamp AND
t.column_1= d.column_1 AND
t.column_2= d.column_2 AND
t.column_3 = d.column_3 AND
t.column_4 = d.column_4 AND
d.rn > 1
)
AND t.Timestamp > 1709424000000 AND t.Timestamp < 1709510400000;
我认为你的方法和查询几乎是正确的,只有一些错误。
以下是您的查询的修订版本,可能有效:
DELETE FROM BIGQUERY_TABLE
WHERE STRUCT(Timestamp, column_1, column_2, column_3, column_4) IN (
SELECT AS STRUCT Timestamp, column_1, column_2, column_3, column_4
FROM (
SELECT *,
ROW_NUMBER() OVER(
PARTITION BY Timestamp, column_1, column_2, column_3, column_4
ORDER BY Timestamp
) as rn
FROM BIGQUERY_TABLE
WHERE Timestamp > 1709424000000 AND Timestamp < 1709510400000
)
WHERE rn > 1
) AND Timestamp > 1709424000000 AND Timestamp < 1709510400000;