我需要使用 PostgreSQL 提取 URL 列表的域名。在第一个版本中,我尝试使用 REGEXP_REPLACE 替换不需要的字符,如 www.、biz.、sports. 等来获取域名。
SELECT REGEXP_REPLACE(url, ^((www|www2|www3|static1|biz|health|travel|property|edu|world|newmedia|digital|ent|staging|cpelection|dev|m-staging|m|maa|cdnnews|testing|cdnpuc|shipping|sports|life|static01|cdn|dev1|ad|backends|avm|displayvideo|tand|static03|subscriptionv3|mdev|beta)\.)?', '') AS "Domain",
COUNT(DISTINCT(user)) AS "Unique Users"
FROM db
GROUP BY 1
ORDER BY 2 DESC;
这似乎不利,因为查询需要不断更新不需要的单词列表。
我确实尝试过 https://stackoverflow.com/a/21174423/10174021 使用 PostgreSQL REGEXP_SUBSTR 从行尾提取,但是,我得到了空白行作为回报。有更好的方法吗?
尝试使用的数据集示例:
CREATE TABLE sample (
url VARCHAR(100) NOT NULL);
INSERT INTO sample url)
VALUES
("sample.co.uk"),
("www.sample.co.uk"),
("www3.sample.co.uk"),
("biz.sample.co.uk"),
("digital.testing.sam.co"),
("sam.co"),
("m.sam.co");
所需输出
+------------------------+--------------+
| url | domain |
+------------------------+--------------+
| sample.co.uk | sample.co.uk |
| www.sample.co.uk | sample.co.uk |
| www3.sample.co.uk | sample.co.uk |
| biz.sample.co.uk | sample.co.uk |
| digital.testing.sam.co | sam.co |
| sam.co | sam.co |
| m.sam.co | sam.co |
+------------------------+--------------+
所以,我使用 Jeremy 和 Rémy Baron 的答案找到了解决方案。
从public suffix中提取所有的public后缀并存储到 我标记为 tlds 的表格。
regexp_substr(t1.url, '([a-z]+)(.)'||t1."tld")
提取域名。最终输出:SQL查询如下:
WITH stored_tld AS(
SELECT
DISTINCT(s.url),
FIRST_VALUE(t.domain) over (PARTITION BY s.url ORDER BY length(t.domain) DESC
rows between unbounded preceding and unbounded following) AS "tld"
FROM sample s
JOIN tlds t
ON (s.url like '%%'||domain))
SELECT
t1.url,
CASE WHEN t1."tld" IS NULL THEN t1.url ELSE regexp_replace(t1.url,'(.*\.)((.[a-z]*).*'||replace(t1."tld",'.','\.')||')','\2')
END AS "extracted_domain"
FROM(
SELECT a.url,st."tld"
FROM sample a
LEFT JOIN stored_tld st
ON a.url = st.url
)t1
尝试链接:SQL Tester
你可以试试这个:
with tlds as (
select * from (values('.co.uk'),('.co'),('.uk')) a(tld)
) ,
sample as (
select * from (values ('sample.co.uk'),
('www.sample.co.uk'),
('www3.sample.co.uk'),
('biz.sample.co.uk'),
('digital.testing.sam.co'),
('sam.co'),
('m.sam.co')
) a(url)
)
select url,regexp_replace(url,'(.*\.)(.*'||replace(tld,'.','\.')||')','\2') "domain" from (
select distinct url,first_value(tld) over (PARTITION BY url order by length(tld) DESC) tld
from sample join tlds on (url like '%'||tld)
) a
我使用 split_part(url,'/',3) 为此:
select split_part('https://stackoverflow.com/questions/56019744', '/', 3) ;
输出
stackoverflow.com
这是我的解决方案(稍微复杂一点)
WITH
fqdn AS (
SELECT
row_number() over () as id,
url,
FQDN(url) AS "fqdn"
FROM urls
),
stored_tld AS (
SELECT DISTINCT ON (id)
id,
url,
tld,
fqdn
FROM fqdn
LEFT JOIN tlds
ON reverse(fqdn(url)) LIKE
replace(lower(reverse(tld)), '*', '%') || '.%' COLLATE "C"
ORDER BY id, -- for correct distinct on
tld LIKE '%*%' DESC, -- prefer tld with wildcard
length(tld) DESC -- prefer longer tld
), extrated_domain AS (
SELECT
id,
url,
fqdn,
reverse(
substring(
reverse(fqdn),
'#"' || replace(lower(reverse(tld)), '*', '[^.]*') || '.[^.]*#"(.%|)',
'#'
)
) AS "extracted_domain"
FROM stored_tld
)
SELECT
url,
fqdn,
coalesce(extracted_domain, fqdn) AS "domain",
extracted_domain IS NOT NULL AS "extracted"
FROM extrated_domain
摆弄评论:https://dbfiddle.uk/QSDKx2-t
为了从网址中提取FQDN,您可以使用更复杂的正则表达式https://regex101.com/r/vT9k3d/2
/^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)/igm
此外,您可以将此正则表达式存储为函数
CREATE OR REPLACE FUNCTION fqdn(url TEXT)
RETURNS TEXT
LANGUAGE sql
IMMUTABLE
STRICT
AS $function$
select (regexp_matches(url, '^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', 'i'))[1]
$function$;
可能存在重复项,尤其是在提取域之后。最好使用 row_number() 而不是 () 来保存订单
SELECT
row_number() over () as id,
url,
(regexp_matches(url, '^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', 'i'))[1] AS "fqdn"
FROM urls
首先,我们需要匹配该域的所有模式
fqdn LIKE '%.' || replace(lower(tld), '*', '%') COLLATE "C"
或者,更好的是,使用反向字符串来加速稍后使用带有前缀匹配的索引的过程
reverse(fqdn) LIKE replace(lower(reverse(tld)), '*', '%') || '.%' COLLATE "C"
为了提取最匹配的顶级域名,应使用顺序规则
ORDER BY id, -- for correct distinct on
tld LIKE '%*%' DESC, -- prefer tld with wildcard
length(tld) DESC -- prefer longer tld
我们将使用 substing postgresql 函数和模式匹配。
经过一些实验,我发现这对我有用(对于反向后缀
abc.def
)
select substring('db.abc.def.fsdfsd', '#"db.[a-z0-9]*.[a-z0-9]*#"(.%|)', '#');
select substring('db.abc.def', '#"db.[a-z0-9]*.[a-z0-9]*#"(.%|)', '#');
所得提取物是
reverse(
substring(
reverse(fqdn),
'#"' || replace(lower(reverse(tld)), '*', '[^.]*') || '.[^.]*#"(.%|)',
'#'
)
) AS "extracted_domain"
在最后一步,我们为未找到的域添加
coalesce
,并添加一个标志来监控域是否被提取。
SELECT
url,
fqdn,
coalesce(extracted_domain, fqdn) AS "domain",
extracted_domain IS NOT NULL AS "extracted"
FROM extrated_domain