Mysql:连接重复数据但忽略重复的字符串

问题描述 投票:0回答:3

有没有办法在忽略给定字符串的同时找到重复数据?

例如,如果我有一个名称表,是否有一种方法可以连接名称为“Ann Smith”但忽略字符串“Dr.”的行。例如,包含“Ann Smith”和“Ann Smith博士”的行应连接成一行,其名称为“Ann Smith博士”。如果名称匹配(减去“dr。”字符串)并且两行的地址匹配,则连接电话号码。我想取两个名字中较大的一个,我认为这将涉及使用MAX语句。

目前我有一个名为t的表:

name          | phone      | address
ann smith     | 1234567899 | 123 home address
dr. ann smith | 1234567890 | 123 home address
brian smith   | 1235551234 | 789 city street

我想去:

name          | phone                  | address
dr. ann smith | 1234567890, 1234567899 | 123 home address
brian smith   | 1235551234             | 789 city street
mysql sql mysql-5.7 fuzzy-comparison
3个回答
1
投票

要做你想做的事,你可能需要CTE(公用表格表达式)和LATERAL查询。不幸的是,MySQL 5.x没有实现它们中的任何一个。

以下查询查找重复的名称:

select plain_name, count(*)
  from (
    select name, trim(replace(lower(name), lower('Dr.'), '')) as plain_name
      from my_table
  ) x
  group by plain_name
  having count(*) > 1

这是朝着正确方向迈出的一步,但您需要进一步处理以获得所需的结果。

如果升级到MySQL 8,您将获得CTE,但仍然无法获得LATERAL查询。

编辑:我更进一步确定重复的名称。没有CTE,这个查询看起来越来越丑陋:

select z.*, y.times
  from (
    select name, trim(replace(lower(name), lower('Dr.'), '')) as plain_name
      from my_table
  ) z,
  (
    select plain_name, count(*) as times
      from (
        select name, trim(replace(lower(name), lower('Dr.'), '')) as plain_name
          from my_table
      ) x
      group by plain_name
      having count(*) > 1
  ) y
  where z.plain_name = y.plain_name;

1
投票

假设这些是完全嵌套的,你可以通过以下方式获得“长形式”:

select name,
       (select t2.name
        from t t2
        where t2.name like concat('%', t.name, '%')
        order by length(t2.name) desc
        limit 1
       ) as long_form
from t;

然后,您可以在聚合中使用它。我会使用子查询:

select long_form, group_concat(distinct phone) as phones,
       group_concat(distinct address) as addresses
from (select t.*,
             (select t2.name
              from t t2
              where t2.name like concat('%', t.name, '%')
              order by length(t2.name) desc
              limit 1
             ) as long_form
      from t
     ) tt
group by long_from;

0
投票

我最终使用了上述答案的组合。首先,我创建了一个临时表,用于修剪和替换'博士'带有空字符串的字符串。

create temporary table if not exists temp_names AS (
select *, 
    case when name like lower('dr. %') then trim(replace(lower(name), lower('dr. %'), ''))
    else name end as plain_name from t);

然后我使用select和group by来连接该表中具有相同plain_name值的值。

select max(name) as name, group_concat(distinct phone_number) as phone_number, address from temp_names 
    group by plain_name, address having count(*) >=1;

这给出了一个具有所需结果的表格:

name          | phone_number           | address
dr. ann smith | 1234567890, 1234567899 | 123 home address
brian smith   | 1235551234             | 789 city street
© www.soinside.com 2019 - 2024. All rights reserved.