Redshift - 每条记录返回单个数据的数组

Question

我有一个包含以下字段的表：

email

- 登录的用户电子邮件

allowed_id

- 另一个用户的 ID

该表包含同一电子邮件的多个条目，每个条目包含不同的 allowed_id。

我正在尝试将其聚合在一个数组中，以便将其保存在 Redis 上，以加快内部进程之一。

通常，我会使用 ArrayAgg，但这在 Redshift 中不可用。 Redshift 有一个 ListAgg 函数，其工作原理相同，但它将所有内容转换为字符串，并且有 64k 长度限制，我在第一次尝试中已经达到了该限制。当将其转移到生产中时，我将面临更大的数据集。

重要的是要知道查询的时间并不重要，它将作为每天凌晨 2:00 左右的 cronjob 运行。

我一直在尝试使用 Array 函数，但它返回类似以下内容：

email, [id]
same_email, [another_id]

这不是我要找的。

这是我的询问：


    SELECT
      email,
      ARRAY(allowed_id) AS user_ids
    FROM
      sec_table
    GROUP BY
      email, allowed_id;

为了更清楚地说明，这就是我想要实现的结果类型：

email, [id1, id2, id3]

Answer 1

我相信 64k listagg 限制就是这样 - 一个硬性限制。

请参阅：如何处理 Redshift 中的 Listagg 大小限制？（nb调整下面使用的 10000 以适合您的数据）

WITH numbered_rows AS (
  SELECT 
    email,
    allowed_id,
    NTILE(10000) OVER (PARTITION BY email ORDER BY allowed_id) AS chunk
  FROM your_table
)
SELECT 
  email,
  chunk,
  LISTAGG(allowed_id, ',') WITHIN GROUP (ORDER BY allowed_id) AS allowed_ids
FROM numbered_rows
GROUP BY email, chunk

按照这种方法，您可能会得到更少的行，其中一些行需要进一步拼接在一起 - （也许使用 python？不确定这是否可以解决内存问题）。

或者 - 我几乎从不建议这样做 - 尝试程序方法

创建带有超级列的汇总表，例如：

CREATE TABLE email_summary (
    email VARCHAR(256),
    allowed_ids SUPER
);

现在使用存储过程来填充该表，例如：

CREATE OR REPLACE PROCEDURE create_summary()
LANGUAGE plpgsql
AS $$
DECLARE
    cur_email VARCHAR(256);
    cur_allowed_id VARCHAR(256);
    cur_allowed_ids SUPER := '[]'::SUPER;  -- Initialize an empty SUPER array
    prev_email VARCHAR(256) := NULL;
BEGIN
    FOR cur_email, cur_allowed_id IN SELECT email, allowed_id FROM your_existing_table ORDER BY email
    LOOP
        IF cur_email != prev_email AND prev_email IS NOT NULL THEN
            -- Insert the previous email and its allowed_ids into the summary table
            INSERT INTO email_summary (email, allowed_ids) VALUES (prev_email, cur_allowed_ids);
            -- Reset the allowed_ids array for the next email
            cur_allowed_ids := '[]'::SUPER;
        END IF;
        -- Add the current allowed_id to the allowed_ids array
        cur_allowed_ids := cur_allowed_ids || ('"' || cur_allowed_id || '"')::SUPER;
        -- Remember the current email for the next iteration
        prev_email := cur_email;
    END LOOP;
    -- Don't forget to insert the last email and its allowed_ids into the summary table
    IF prev_email IS NOT NULL THEN
        INSERT INTO email_summary (email, allowed_ids) VALUES (prev_email, cur_allowed_ids);
    END IF;
END;
$$;

注意事项 最初先小规模尝试一下，因为您在上面看到的内容完全未经测试，如果有效，可能会很慢。然后你面临着获取汇总表的问题 - 这可能是另一个问题，而不是我想在这里讨论的问题。

Redshift - 每条记录返回单个数据的数组

问题描述投票：0回答：1

1个回答

最新问题

Redshift - 每条记录返回单个数据的数组

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1