我有一个包含约 700K 行的表,每行都有一个
REPEATED
字段(每个 id 平均有 150 个值)。我需要找到具有共同值的 id 对。
我的表:
id | 价值 |
---|---|
A | v1 |
v2 | |
v3 | |
B | v2 |
C | v8 |
D | v2 |
v3 |
输出:
id1 | id2 |
---|---|
A | B |
A | D |
B | D |
我展平了表(约 1 亿行),然后将其与其自身连接以获得值相同的 id 对,但它会持续运行很长时间。考虑到规模,是否有优化的方法?
WITH
-- Results in ~100 million rows
flattened_table AS (
SELECT
id,
value,
FROM
my_table,
UNNEST(value) AS value
)
SELECT
t1.id AS id1,
t2.id AS id2,
FROM
flattened_table t1,
flattened_table t2
WHERE
t1.id < t2.id -- Get (A,B) only, don't need (B,A)
AND t1.value = t2.value
您可以尝试聚合按值分组的 id,这可以避免可能减慢进程的连接。
WITH
values_aggregated AS (
SELECT
value,
ARRAY_AGG(id) AS ids
FROM
my_table,
UNNEST(value) AS value
GROUP BY
value
)
SELECT
ids[OFFSET(0)] AS id1,
ids[OFFSET(1)] AS id2
FROM
values_aggregated
WHERE
ARRAY_LENGTH(ids) > 1 -- filter values that have more than 1 id
and ids[OFFSET(0)] < ids[OFFSET(1)]