在表中查找给定列中具有相同值的行对

问题描述 投票:0回答:1

我有一个包含约 700K 行的表,每行都有一个

REPEATED
字段(每个 id 平均有 150 个值)。我需要找到具有共同值的 id 对。

我的表:

id 价值
A v1
v2
v3
B v2
C v8
D v2
v3

输出:

id1 id2
A B
A D
B D

我展平了表(约 1 亿行),然后将其与其自身连接以获得值相同的 id 对,但它会持续运行很长时间。考虑到规模,是否有优化的方法?

WITH

-- Results in ~100 million rows
flattened_table AS (
  SELECT 
    id,
    value,
  FROM
    my_table,
    UNNEST(value) AS value
)

SELECT 
  t1.id AS id1,
  t2.id AS id2,
FROM
  flattened_table t1,
  flattened_table t2
WHERE
  t1.id < t2.id      -- Get (A,B) only, don't need (B,A)
  AND t1.value = t2.value
sql google-bigquery
1个回答
0
投票

您可以尝试聚合按值分组的 id,这可以避免可能减慢进程的连接。

WITH
values_aggregated AS (
 SELECT
   value,
   ARRAY_AGG(id) AS ids
 FROM
   my_table,
   UNNEST(value) AS value
 GROUP BY
   value
)

 SELECT
   ids[OFFSET(0)] AS id1,
   ids[OFFSET(1)] AS id2
 FROM
   values_aggregated
 WHERE
   ARRAY_LENGTH(ids) > 1 -- filter values that have more than 1 id 
and ids[OFFSET(0)]  < ids[OFFSET(1)] 


© www.soinside.com 2019 - 2024. All rights reserved.