我想从网站中提取特定链接。
链接看起来像这样:
/topic/Funny/G1pdeJm
链接始终相同 - 除了最后一个随机字符。
我很难将这些部分组合起来
(preg_match("/^http:\/\//i",$str) || is_file($str))
和
(preg_match("/Funny(.*)/", $str) || is_file($str))
第一个代码提取每个链接 第二次从链接中仅提取 /topic/Funny/* 部分。
不幸的是,我无法将它们组合起来,我也想阻止这些标签:
/topic/Funny/viral
/topic/Funny/time
/topic/Funny/top
/topic/Funny/top/week
/topic/Funny/top/month
/topic/Funny/top/year
/topic/Funny/top/all
您可以尝试使用否定前瞻来“过滤”您不喜欢的网址:
.*\/Funny\/(?!viral|time|top\/week|top\/month|top\/year|top\/all|top(\n|$)).*
我将准备一组测试字符串并展示使用正则表达式过滤 URL 的实现。
正则表达式细分:
^
http:// #match literal characters
[^/]+ #match one or more non-slash characters (domain portion)
/topic/Funny/ #match literal characters
(?! #not followed by:
viral #viral
|time #OR time
|top(?:/week|/month|/year|/all)? #OR top, top/week, top/month, top/year, top/all
)
实现:(演示)
$tests = [
'http://example.com/topic/Funny/G1pdeJm',
'http://example.com/topic/Funny/viral',
'http://example.com/topic/Funny/time',
'http://example.com/topic/Funny/top',
'http://example.com/topic/Funny/top/week',
'http://example.com/topic/Funny/top/month',
'http://example.com/topic/Funny/top/year',
'http://example.com/topic/Funny/top/all',
'http://example.com/topic/NotFunny/IL2dsRq',
];
$result = [];
foreach ($tests as $str) {
if (preg_match('~^http://[^/]+/topic/Funny/(?!viral|time|top(?:/week|/month|/year|/all)?)~', $str)) {
$result[] = $str;
}
}
var_export($result);
输出:
array (
0 => 'http://example.com/topic/Funny/G1pdeJm',
)