如何改进对 Postgres DB 的批量插入

问题描述 投票:0回答:1

我目前有一个 C# 服务,它使用 dapper 调用一个存储过程,该存储过程执行 2 件事:如果客户存在,它会获取客户

GUID
并将其添加到
CustomerInformations
表中;如果客户不存在,则插入客户,然后返回
GUID
并将其添加到
CustomerInformations
表中。

以前,插入每小时大约需要 175 万条记录。现在每小时只能勉强获取20万条记录。我的

CustomerInformations
表中有大约 7500 万条记录,我正在寻求解决瓶颈。

对于每个 Customer 属性,它都会迭代调用存储过程。每个存储过程调用可以有 2 次插入到数据库中。首先,将客户添加到

Customers
表中,然后将属性添加到
CustomerInformations
表中。我知道这可能不是存储数据的最理想方式,但这不是我可以改变的。

C# 服务

foreach (var info request.Data)
{
    string sql = "add_one_by_customer";
    object parameters = new
    {
        p_customer_first_name = info.FirstName,
        p_customer_last_name = info.LastName,
        p_customer_property_name = info.PropertyName,
        p_customer_property_value = info.PropertyValue
    };

    try
    {
        await db.ExecuteAsync(sql, parameters, transaction: transaction, commandType: CommandType.StoredProcedure);
    }
    catch (Exception e)
    {
        throw new Exception($"Failed to insert");
    }
}

Postgres 存储过程:

CREATE OR REPLACE PROCEDURE add_one_by_customer(
    p_customer_first_name  VARCHAR,
    p_customer_last_name  VARCHAR,
    p_customer_property_name  VARCHAR,
    p_customer_property_value  VARCHAR,
    )
    LANGUAGE plpgsql
AS $procedure$
DECLARE p_customer_id uuid;
        p_current_item_value varchar;   
begin   
    SELECT INTO p_customer_id,
                customer_id
    FROM customers
    WHERE customer_first_name = p_customer_first_name AND
          customer_last_name = p_customer_last_name
    limit 1;
                           
    
    IF (p_customer_id IS NULL) THEN  
        begin               
            INSERT INTO customers(customer_first_name, customer_last_name)
            VALUES (p_customer_first_name, p_customer_last_name) RETURNING  customer_id into p_customer_id;
            EXCEPTION WHEN unique_violation THEN
            p_customer_id  = (SELECT custmomer_id 
                              FROM  customers
                              WHERE customer_first_name = p_customer_first_name AND
                                    customer_last_name = p_customer_last_name
        END;
    end if;    
   
    p_current_item_value := (select property_value
                             from customer_informations
                             where customer_id = p_customer_id AND
                                   customer_property_name = p_customer_property_name);
  

   
    if (p_current_item_value is NULL) THEN
        INSERT INTO customer_informations(customer_id, customer_property_name, customer_property_value)
        VALUES (p_customer_id, p_customer_property_name, p_customer_property_value);
    elseif (p_current_item_value is not null AND p_current_item_value != p_item_value) then
        UPDATE customer_informations 
        SET customer_property_value = p_current_item_value
        WHERE  customer_id = p_customer_id ;        
    end if;
end; $procedure$;

目前我的

CustomerInformations
表对
Customer_Id, Customer_property_name
有唯一的约束。

我尝试增强的东西:

  • 在服务中并行化(这就是您在存储过程中看到唯一的违规异常行的原因),这确实加快了速度,但还不够。
  • 我正在考虑删除唯一约束和索引,但我不确定清理重复项有多容易(其他人与数据库交互)

任何提示或建议将不胜感激。

客户信息唯一约束:

CONSTRAINT ux_customer_informations UNIQUE (customer_id, customer_property_name)

客户独特的约束:

CONSTRAINT ux_customers UNIQUE (customer_firstname, customer_lastname)
c# postgresql bulkinsert postgresql-performance
1个回答
0
投票

您当前的程序是效率极低。参见:

避免带有错误处理的嵌套代码块,这是非常昂贵的。可以通过我使用的“SELECT 或 INSERT”技术正确完成。参见:

第二部分是变相的UPSERT。现在也便宜很多了:

CREATE OR REPLACE PROCEDURE dd_one_by_customer(
      p_customer_first_name      text
    , p_customer_last_name       text
    , p_customer_property_name   text
    , p_customer_property_value  text
      )
  LANGUAGE plpgsql AS
$proc$
DECLARE
   p_customer_id uuid;
   p_current_item_value text;
BEGIN
   LOOP
      SELECT customer_id
      FROM   customers
      WHERE  customer_first_name = p_customer_first_name
      AND    customer_last_name = p_customer_last_name
      INTO   p_customer_id;

      EXIT WHEN FOUND;
      
      INSERT INTO customers
             (  customer_first_name,   customer_last_name)
      VALUES (p_customer_first_name, p_customer_last_name)
      ON     CONFLICT (customer_first_name, customer_last_name) DO NOTHING
      RETURNING customer_id
      INTO   p_customer_id;

      EXIT WHEN FOUND;
   END LOOP;

   INSERT INTO customer_informations
          (  customer_id,   customer_property_name,   customer_property_value)
   VALUES (p_customer_id, p_customer_property_name, p_customer_property_value)
   ON     CONFLICT (customer_id, customer_property_name) DO UPDATE
   SET    customer_property_value = EXCLUDED.customer_property_value
   WHERE  customer_property_value IS DISTINCT FROM p_current_item_value;
END
$proc$;

这需要对两个表分别施加

UNIQUE
约束 - 正是您声明的表(
ux_customer_informations
ux_customers
)。参见:

如果

customer_property_value
p_current_item_value
都不能是
null
,则将最终的 WHERE 子句简化为:

...
WHERE  customer_property_value <> p_current_item_value;
© www.soinside.com 2019 - 2024. All rights reserved.