如何在SQL中根据一列的概率分布选择记录，同时确保另一列值是唯一的

Question

背景

我在 Microsoft SQL Server 中有以下表：

主表

用户ID
电子邮件类型
第 1 栏
第 2 栏
..
N 列

电子邮件分发

电子邮件类型
重量必填

对于这两个表，EmailType 大约有 10 个不同的值可能性，例如 A-J。此外，10 个值的 WeightageRequired 总和为 1。

MainTable中，UserId & EmailType有多种组合。但是，UserId-EmailType 组合将具有唯一值。因此，以下是可能性：

用户 ID - 电子邮件类型

1-A
1-B
一维
2-A
2-C
2-G
3-A
3-B

等等。此外，一个特定的 UserId 不需要与所有 EmailType 值组合。因此，在上面给出的示例中，UserId 1 仅具有与 A、B 和 D 对应的行，而不具有与 EmailType 的其余可能值对应的行。

现在，要求：

在 MainTable 中的 50000 行中，我想为每个 UserId 选择 1 行，以便 EmailType 的分布尽可能接近 EmailDistribution 的 WeightageRequired。

例如，如果 50000 行有 12000 个唯一的 UserId，则结果集必须有 12000 行（并且只有 12000 行）。但是，在特定 UserId 的行集中，应选择随机行，以便实现所需的权重分布。

权重完全符合要求没有硬性要求，但越接近要求，模型拟合越好。

我希望我能够正确解释我的问题。

请求 StackOverflow 上的大佬们寻求帮助。

Answer 1

这是一个非循环版本，尽管并不理想，但您可以创建一堆 CTE 来获取不同的用户 ID。

让我们假设这些是您通过电子邮件输入的体重：

EmailType   Weights
A           0.000976563
B           0.001953125
C           0.00390625
D           0.0078125
E           0.015625
F           0.03125
G           0.0625
H           0.125
I           0.25
J           0.5

假设主表中有 10,000 个不同的 UserID。因此 A 组应代表约 10 行 (10,000 * 0.00097)。

with group_a as (
select distinct top 10 userid, 'A' as email_type
  from mainTable
 where emailType = 'A'
),
group_b as (
select distinct top 20 userid, 'B' as email_type
  from mainTable
 where emailType = 'B'
   and userid not in (select userid from group_a)
),
group_c as (
select distinct top 40 userid, 'C' as email_type
  from mainTable
 where emailType = 'C'
   and userid not in (select userid from group_a)
   and userid not in (select userid from group_b)
),
...
group_j as (
select distinct top 5000 userid, 'J' as email_type
  from mainTable
 where emailType = 'J'
   and userid not in (select userid from group_a)
   and userid not in (select userid from group_b)
   and userid not in (select userid from group_c)
   and userid not in (select userid from group_d)
   and userid not in (select userid from group_e)
   and userid not in (select userid from group_f)
   and userid not in (select userid from group_g)
   and userid not in (select userid from group_h)
   and userid not in (select userid from group_i)
)
select userid, email_type from group_a union
select userid, email_type from group_b union 
select userid, email_type from group_c union
...
select userid, email_type from group_j

这在某种程度上“有效”，但不能保证您获得所有 10,000 个用户 ID。例如，前 3 个 CTE 可能会吸引您 D 组所需的一些人员。如果您正在寻找非循环解决方案，那么也许首先通过 EmailType 计算您拥有多少个唯一 ID，然后与您的权重进行比较。这可能有助于确定如何订购 CTE（如果您选择此方法）。

同样，这并不理想，但可能会让您接近您所需要的。

如何在SQL中根据一列的概率分布选择记录，同时确保另一列值是唯一的

问题描述投票：0回答：1

背景

主表

电子邮件分发

用户 ID - 电子邮件类型

1个回答

最新问题

如何在SQL中根据一列的概率分布选择记录，同时确保另一列值是唯一的

问题描述 投票：0回答：1

背景

主表

电子邮件分发

用户 ID - 电子邮件类型

1个回答

最新问题

问题描述投票：0回答：1