如何将rel =“nofollow”添加到preg_replace()的链接

问题描述 投票:10回答:7

下面的函数旨在将rel="nofollow"属性应用于所有外部链接,并且没有内部链接,除非该路径与下面定义为$my_folder的预定义根URL匹配。

所以考虑到变量......

$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';

而内容......

<a href="http://localhost/mytest/">internal</a>

<a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>

<a href="http://cnn.com">external</a>

最终结果,更换后应该......

<a href="http://localhost/mytest/">internal</a>

<a href="http://localhost/mytest/go/hostgator" rel="nofollow">internal cloaked link</a>

<a href="http://cnn.com" rel="nofollow">external</a>

请注意,第一个链接不会更改,因为它是一个内部链接。

第二行上的链接也是一个内部链接,但由于它匹配我们的$my_folder字符串,它也获得了nofollow

第三个链接是最简单的,因为它与blog_url不匹配,它显然是一个外部链接。

但是,在下面的脚本中,我的所有链接都获得了nofollow。如何修复脚本以执行我想要的操作?

function save_rseo_nofollow($content) {
$my_folder =  $rseo['nofollow_folder'];
$blog_url = get_bloginfo('url');
    preg_match_all('~<a.*>~isU',$content["post_content"],$matches);
    for ( $i = 0; $i <= sizeof($matches[0]); $i++){
        if ( !preg_match( '~nofollow~is',$matches[0][$i])
            && (preg_match('~' . $my_folder . '~', $matches[0][$i]) 
               || !preg_match( '~'.$blog_url.'~',$matches[0][$i]))){
            $result = trim($matches[0][$i],">");
            $result .= ' rel="nofollow">';
            $content["post_content"] = str_replace($matches[0][$i], $result, $content["post_content"]);
        }
    }
    return $content;
}
php regex preg-match
7个回答
9
投票

首先尝试使其更具可读性,然后才使你的if规则更复杂:

function save_rseo_nofollow($content) {
    $content["post_content"] =
    preg_replace_callback('~<(a\s[^>]+)>~isU', "cb2", $content["post_content"]);
    return $content;
}

function cb2($match) { 
    list($original, $tag) = $match;   // regex match groups

    $my_folder =  "/hostgator";       // re-add quirky config here
    $blog_url = "http://localhost/";

    if (strpos($tag, "nofollow")) {
        return $original;
    }
    elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
        return $original;
    }
    else {
        return "<$tag rel='nofollow'>";
    }
}

给出以下输出:

[post_content] =>
  <a href="http://localhost/mytest/">internal</a>
  <a href="http://localhost/mytest/go/hostgator" rel=nofollow>internal cloaked link</a>    
  <a href="http://cnn.com" rel=nofollow>external</a>

原始代码中的问题可能是$ rseo,它没有在任何地方声明。


14
投票

这是DOMDocument解决方案......

$str = '<a href="http://localhost/mytest/">internal</a>

<a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>

<a href="http://cnn.com" rel="me">external</a>

<a href="http://google.com">external</a>

<a href="http://example.com" rel="nofollow">external</a>

<a href="http://stackoverflow.com" rel="junk in the rel">external</a>
';
$dom = new DOMDocument();

$dom->preserveWhitespace = FALSE;

$dom->loadHTML($str);

$a = $dom->getElementsByTagName('a');

$host = strtok($_SERVER['HTTP_HOST'], ':');

foreach($a as $anchor) {
        $href = $anchor->attributes->getNamedItem('href')->nodeValue;

        if (preg_match('/^https?:\/\/' . preg_quote($host, '/') . '/', $href)) {
           continue;
        }

        $noFollowRel = 'nofollow';
        $oldRelAtt = $anchor->attributes->getNamedItem('rel');

        if ($oldRelAtt == NULL) {
            $newRel = $noFollowRel;
        } else {
            $oldRel = $oldRelAtt->nodeValue;
            $oldRel = explode(' ', $oldRel);
            if (in_array($noFollowRel, $oldRel)) {
                continue;
            }
            $oldRel[] = $noFollowRel;
            $newRel = implode($oldRel,  ' ');
        }

        $newRelAtt = $dom->createAttribute('rel');
        $noFollowNode = $dom->createTextNode($newRel);
        $newRelAtt->appendChild($noFollowNode);
        $anchor->appendChild($newRelAtt);

}

var_dump($dom->saveHTML());

Output

string(509) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<a href="http://localhost/mytest/">internal</a>

<a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>

<a href="http://cnn.com" rel="me nofollow">external</a>

<a href="http://google.com" rel="nofollow">external</a>

<a href="http://example.com" rel="nofollow">external</a>

<a href="http://stackoverflow.com" rel="junk in the rel nofollow">external</a>
</body></html>
"

7
投票

试试这个(PHP 5.3+):

  • 跳过选定的地址
  • 允许手动设置rel参数

和代码:

function nofollow($html, $skip = null) {
    return preg_replace_callback(
        "#(<a[^>]+?)>#is", function ($mach) use ($skip) {
            return (
                !($skip && strpos($mach[1], $skip) !== false) &&
                strpos($mach[1], 'rel=') === false
            ) ? $mach[1] . ' rel="nofollow">' : $mach[0];
        },
        $html
    );
}

例子:

echo nofollow('<a href="link somewhere" rel="something">something</a>');
// will be same because it's already contains rel parameter

echo nofollow('<a href="http://www.cnn.com">something</a>'); // ad
// add rel="nofollow" parameter to anchor

echo nofollow('<a href="http://localhost">something</a>', 'localhost');
// skip this link as internall link

3
投票

使用正则表达式来正确完成这项工作会非常复杂。使用实际的解析器会更容易,例如DOM extension中的解析器。 DOM不是非常适合初学者,所以你可以做的是用DOM加载HTML然后用SimpleXML运行修改。它们由相同的库支持,因此很容易使用另一个库。

这是它的样子:

$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';

$html = '<html><body>
<a href="http://localhost/mytest/">internal</a>
<a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>
<a href="http://cnn.com">external</a>
</body></html>';

$dom = new DOMDocument;
$dom->loadHTML($html);

$sxe = simplexml_import_dom($dom);

// grab all <a> nodes with an href attribute
foreach ($sxe->xpath('//a[@href]') as $a)
{
    if (substr($a['href'], 0, strlen($blog_url)) === $blog_url
     && substr($a['href'], 0, strlen($my_folder)) !== $my_folder)
    {
        // skip all links that start with the URL in $blog_url, as long as they
        // don't start with the URL from $my_folder;
        continue;
    }

    if (empty($a['rel']))
    {
        $a['rel'] = 'nofollow';
    }
    else
    {
        $a['rel'] .= ' nofollow';
    }
}

$new_html = $dom->saveHTML();
echo $new_html;

正如您所看到的,它非常简短。根据您的需要,您可能希望使用preg_match()代替strpos()的东西,例如:

    // change the regexp to your own rules, here we match everything under
    // "http://localhost/mytest/" as long as it's not followed by "go"
    if (preg_match('#^http://localhost/mytest/(?!go)#', $a['href']))
    {
        continue;
    }

Note

当我第一次阅读这个问题时,我错过了OP中的最后一个代码块。我发布的代码(基本上是基于DOM的任何解决方案)更适合处理整个页面而不是HTML块。否则,DOM将尝试“修复”您的HTML并可能添加<body>标记,DOCTYPE等...


0
投票
<?

$str='<a href="http://localhost/mytest/">internal</a>
<a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>
<a href="http://cnn.com">external</a>';

function test($x){
  if (preg_match('@localhost/mytest/(?!go/)@i',$x[0])>0) return $x[0];
  return 'rel="nofollow" '.$x[0];
}

echo preg_replace_callback('/href=[\'"][^\'"]+/i', 'test', $str);

?>

0
投票

这是另一个具有白名单选项并添加目标空白属性的解决方案。并且它还会在添加新属性之前检查是否已存在rel属性。

function Add_Nofollow_Attr($Content, $Whitelist = [], $Add_Target_Blank = true) 
{
    $Whitelist[] = $_SERVER['HTTP_HOST'];
    foreach ($Whitelist as $Key => $Link) 
    {
        $Host = preg_replace('#^https?://#', '', $Link);
        $Host = "https?://". preg_quote($Host, '/');
        $Whitelist[$Key] = $Host;
    }

    if(preg_match_all("/<a .*?>/", $Content, $matches, PREG_SET_ORDER)) 
    {
        foreach ($matches as $Anchor_Tag) 
        {
            $IS_Rel_Exist = $IS_Follow_Exist = $IS_Target_Blank_Exist = $Is_Valid_Tag =  false;
            if(preg_match_all("/(\w+)\s*=\s*['|\"](.*?)['|\"]/",$Anchor_Tag[0],$All_matches2)) 
            {
                foreach ($All_matches2[1] as $Key => $Attr_Name)
                {
                    if($Attr_Name == 'href')
                    {
                        $Is_Valid_Tag = true;
                        $Url = $All_matches2[2][$Key];
                        // bypass #.. or internal links like "/"
                        if(preg_match('/^\s*[#|\/].*/', $Url)) 
                        {
                            continue 2;
                        }

                        foreach ($Whitelist as $Link) 
                        {
                            if (preg_match("#$Link#", $Url)) {
                                continue 3;
                            }
                        }
                    }
                    else if($Attr_Name == 'rel')
                    {
                        $IS_Rel_Exist = true;
                        $Rel = $All_matches2[2][$Key];
                        preg_match("/[n|d]ofollow/", $Rel, $match, PREG_OFFSET_CAPTURE);
                        if( count($match) > 0 )
                        {
                            $IS_Follow_Exist = true;
                        }
                        else
                        {
                            $New_Rel = 'rel="'. $Rel . ' nofollow"';
                        }
                    }
                    else if($Attr_Name == 'target')
                    {
                        $IS_Target_Blank_Exist = true;
                    }
                }
            }

            $New_Anchor_Tag = $Anchor_Tag;
            if(!$IS_Rel_Exist)
            {
                $New_Anchor_Tag = str_replace(">",' rel="nofollow">',$Anchor_Tag);
            }
            else if(!$IS_Follow_Exist)
            {
                $New_Anchor_Tag = preg_replace("/rel=[\"|'].*?[\"|']/",$New_Rel,$Anchor_Tag);
            }

            if($Add_Target_Blank && !$IS_Target_Blank_Exist)
            {
                $New_Anchor_Tag = str_replace(">",' target="_blank">',$New_Anchor_Tag);
            }

            $Content = str_replace($Anchor_Tag,$New_Anchor_Tag,$Content);
        }
    }
    return $Content;
}

要使用它:

$Page_Content = '<a href="http://localhost/">internal</a>
                 <a href="http://yoursite.com">internal</a>
                 <a href="http://google.com">google</a>
                 <a href="http://example.com" rel="nofollow">example</a>
                 <a href="http://stackoverflow.com" rel="random">stackoverflow</a>';

$Whitelist = ["http://yoursite.com","http://localhost"];

echo Add_Nofollow_Attr($Page_Content,$Whitelist,true);

0
投票

感谢@alex提供了很好的解决方案。但是,我遇到了日语文本的问题。我已经按照以下方式修复了它。此外,此代码可以使用$whiteList数组跳过多个域。

public function addRelNoFollow($html, $whiteList = [])
{
    $dom = new \DOMDocument();
    $dom->preserveWhiteSpace = false;
    $dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
    $a = $dom->getElementsByTagName('a');

    /** @var \DOMElement $anchor */
    foreach ($a as $anchor) {
        $href = $anchor->attributes->getNamedItem('href')->nodeValue;
        $domain = parse_url($href, PHP_URL_HOST);

        // Skip whiteList domains
        if (in_array($domain, $whiteList, true)) {
            continue;
        }

        // Check & get existing rel attribute values
        $noFollow = 'nofollow';
        $rel = $anchor->attributes->getNamedItem('rel');
        if ($rel) {
            $values = explode(' ', $rel->nodeValue);
            if (in_array($noFollow, $values, true)) {
                continue;
            }
            $values[] = $noFollow;
            $newValue = implode($values, ' ');
        } else {
            $newValue = $noFollow;
        }

        // Create new rel attribute
        $rel = $dom->createAttribute('rel');
        $node = $dom->createTextNode($newValue);
        $rel->appendChild($node);
        $anchor->appendChild($rel);
    }

    // There is a problem with saveHTML() and saveXML(), both of them do not work correctly in Unix.
    // They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.
    // So we need to do as follows. @see https://stackoverflow.com/a/20675396/1710782
    return $dom->saveHTML($dom->documentElement);
}
© www.soinside.com 2019 - 2024. All rights reserved.