使用 PostgreSQL 从 url 中提取域

问题描述 投票:0回答:4

我需要使用 PostgreSQL 提取 URL 列表的域名。在第一个版本中,我尝试使用 REGEXP_REPLACE 替换不需要的字符,如 www.、biz.、sports. 等来获取域名。

 SELECT REGEXP_REPLACE(url, ^((www|www2|www3|static1|biz|health|travel|property|edu|world|newmedia|digital|ent|staging|cpelection|dev|m-staging|m|maa|cdnnews|testing|cdnpuc|shipping|sports|life|static01|cdn|dev1|ad|backends|avm|displayvideo|tand|static03|subscriptionv3|mdev|beta)\.)?', '') AS "Domain", 
 COUNT(DISTINCT(user)) AS "Unique Users"
 FROM db
 GROUP BY 1
 ORDER BY 2 DESC;

这似乎不利,因为查询需要不断更新不需要的单词列表。

我确实尝试过 https://stackoverflow.com/a/21174423/10174021 使用 PostgreSQL REGEXP_SUBSTR 从行尾提取,但是,我得到了空白行作为回报。有更好的方法吗?

尝试使用的数据集示例:

 CREATE TABLE sample (
 url VARCHAR(100) NOT NULL);

 INSERT INTO sample url) 
 VALUES 
 ("sample.co.uk"),
 ("www.sample.co.uk"),
 ("www3.sample.co.uk"),
 ("biz.sample.co.uk"),
 ("digital.testing.sam.co"),
 ("sam.co"),
 ("m.sam.co");

所需输出

+------------------------+--------------+
|    url                 |  domain      |
+------------------------+--------------+
| sample.co.uk           | sample.co.uk |
| www.sample.co.uk       | sample.co.uk |
| www3.sample.co.uk      | sample.co.uk |
| biz.sample.co.uk       | sample.co.uk |
| digital.testing.sam.co | sam.co       |
| sam.co                 | sam.co       |
| m.sam.co               | sam.co       |
+------------------------+--------------+
regex postgresql url
4个回答
3
投票

所以,我使用 Jeremy 和 Rémy Baron 的答案找到了解决方案。

  1. public suffix中提取所有的public后缀并存储到 我标记为 tlds 的表格。

  2. 获取数据集中的唯一 URL 并与其 TLD 匹配。

  3. 使用 regexp_replace(在此查询中使用)或替代方案
    regexp_substr(t1.url, '([a-z]+)(.)'||t1."tld")
    提取域名。最终输出:

SQL查询如下:

WITH stored_tld AS(
SELECT 
DISTINCT(s.url),
FIRST_VALUE(t.domain) over (PARTITION BY s.url ORDER BY length(t.domain) DESC
                            rows between unbounded preceding and unbounded following) AS "tld" 
FROM sample s 
JOIN tlds t 
ON (s.url like '%%'||domain))

SELECT 
t1.url,
CASE WHEN t1."tld" IS NULL THEN t1.url ELSE regexp_replace(t1.url,'(.*\.)((.[a-z]*).*'||replace(t1."tld",'.','\.')||')','\2') 
END AS "extracted_domain" 
FROM(
    SELECT a.url,st."tld"
    FROM sample a
    LEFT JOIN stored_tld st
    ON a.url = st.url
    )t1

尝试链接:SQL Tester


1
投票

你可以试试这个:

with tlds as (
     select * from (values('.co.uk'),('.co'),('.uk')) a(tld)
) ,
sample as (
    select * from (values ('sample.co.uk'),
                          ('www.sample.co.uk'),
                          ('www3.sample.co.uk'),
                          ('biz.sample.co.uk'),
                          ('digital.testing.sam.co'),
                          ('sam.co'),
                          ('m.sam.co')
                   ) a(url)
     ) 
  select url,regexp_replace(url,'(.*\.)(.*'||replace(tld,'.','\.')||')','\2') "domain" from (
            select distinct url,first_value(tld) over (PARTITION BY url order by length(tld) DESC) tld 
               from sample join tlds on (url like '%'||tld) 
         ) a

1
投票

我使用 split_part(url,'/',3) 为此:

select split_part('https://stackoverflow.com/questions/56019744', '/', 3) ;

输出

stackoverflow.com

0
投票

这是我的解决方案(稍微复杂一点)

WITH
fqdn AS (
    SELECT
        row_number() over () as id,
        url,
        FQDN(url) AS "fqdn"
    FROM urls
),
stored_tld AS (
    SELECT DISTINCT ON (id)
        id,
        url,
        tld,
        fqdn
    FROM fqdn
    LEFT JOIN tlds
           ON reverse(fqdn(url)) LIKE 
              replace(lower(reverse(tld)), '*', '%') || '.%' COLLATE "C"
    ORDER BY id, -- for correct distinct on
             tld LIKE '%*%' DESC, -- prefer tld with wildcard
             length(tld) DESC -- prefer longer tld
),  extrated_domain AS (
    SELECT
        id,
        url,
        fqdn,
        reverse(
                substring(
                        reverse(fqdn), 
                        '#"' || replace(lower(reverse(tld)), '*', '[^.]*') || '.[^.]*#"(.%|)', 
                        '#'
                )
        ) AS "extracted_domain"
    FROM stored_tld
)
SELECT
    url,
    fqdn,
    coalesce(extracted_domain, fqdn) AS "domain",
    extracted_domain IS NOT NULL AS "extracted"
FROM extrated_domain

摆弄评论:https://dbfiddle.uk/QSDKx2-t

FQDN 提取

为了从网址中提取FQDN,您可以使用更复杂的正则表达式https://regex101.com/r/vT9k3d/2

/^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)/igm

此外,您可以将此正则表达式存储为函数

CREATE OR REPLACE FUNCTION fqdn(url TEXT)
  RETURNS TEXT
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
select (regexp_matches(url, '^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', 'i'))[1]
$function$;

保存行顺序

可能存在重复项,尤其是在提取域之后。最好使用 row_number() 而不是 () 来保存订单

SELECT
    row_number() over () as id,
    url,
    (regexp_matches(url, '^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', 'i'))[1] AS "fqdn"
FROM urls

模式匹配

首先,我们需要匹配该域的所有模式

fqdn LIKE '%.' || replace(lower(tld), '*', '%') COLLATE "C"

或者,更好的是,使用反向字符串来加速稍后使用带有前缀匹配的索引的过程

reverse(fqdn) LIKE replace(lower(reverse(tld)), '*', '%') || '.%' COLLATE "C"

成绩排名

为了提取最匹配的顶级域名,应使用顺序规则

ORDER BY id, -- for correct distinct on
         tld LIKE '%*%' DESC, -- prefer tld with wildcard
         length(tld) DESC -- prefer longer tld

提取带后缀名的域名,不带子域名

我们将使用 substing postgresql 函数和模式匹配

经过一些实验,我发现这对我有用(对于反向后缀

abc.def

select substring('db.abc.def.fsdfsd', '#"db.[a-z0-9]*.[a-z0-9]*#"(.%|)', '#');
select substring('db.abc.def', '#"db.[a-z0-9]*.[a-z0-9]*#"(.%|)', '#');

所得提取物是

reverse(
        substring(
                reverse(fqdn), 
                '#"' || replace(lower(reverse(tld)), '*', '[^.]*') || '.[^.]*#"(.%|)', 
                '#'
        )
) AS "extracted_domain"

合并结果

在最后一步,我们为未找到的域添加

coalesce
,并添加一个标志来监控域是否被提取。

SELECT
    url,
    fqdn,
    coalesce(extracted_domain, fqdn) AS "domain",
    extracted_domain IS NOT NULL AS "extracted"
FROM extrated_domain

摆弄评论:https://dbfiddle.uk/QSDKx2-t

© www.soinside.com 2019 - 2024. All rights reserved.