php strip_tags 正则表达式用于替换单个 < with HTML entities

Question

我使用 strip_tags 来确保在保存字符串之前删除每个 HTML 标签。现在我遇到的问题是，没有任何结束标签的单个

也被删除了。现在的想法是用匹配的 HTML 实体

替换每个

&#60;

我有一个正则表达式，但它只替换了第一个发现，知道如何调整它吗？

这是我现在得到的正则表达式：

preg_replace("/<([^>]*(<|$))/", "&lt;$1", $string);

我想要这个：

<p> Hello < 30 </p> < < < <!-- Test --> &#60;> > > >

成为第一个

preg_replace(REGEX, REPLACE, $string)

:

<p> Hello &#60; 30 </p> &#60; &#60; &#60; <!-- Test --> &#60;> > > >

然后是

strip_tags($string)

之后：

Hello &#60; 30  &#60; &#60; &#60;  &#60;> > > >

知道如何实现这一目标吗？
也许您甚至知道更好的方法。

Answer 1

你的问题很有趣，因此我花时间尝试解决这个问题。我认为唯一的方法就是分几步完成：

第一步是删除 HTML 注释。
下一步是尝试将所有 HTML 标记与常规标记相匹配表达式以便将它们重写为另一种形式，替换
```
<
```
和
```
>
```
由其他字符组成，例如
```
[[
```
和
```
]]
```
。
之后，您可以将
```
<
```
替换为
```
&lt;
```
，将
```
>
```
替换为
```
&gt;
```
。

我们将

[[tag attr="value"]]

和

[[/tag]]

替换为原始 HTML 标签

<tag attr="value">

和

</tag>

。

我们现在可以删除我们想要的 HTML 标签
```
strip_tags()
```
或者使用更安全、更灵活的库，例如 HTMLPurifier.

PHP 代码

抱歉，由于我使用了用于轻松编辑的 Nowdoc 字符串 :

<?php

define('LINE_LENGTH', 60);

// A regular expression to find HTML tags.
define('REGEX_HTML_TAG', <<<'END_OF_REGEX'
~
<(?!!--)                # Opening of a tag, but not for HTML comments.
(?<tagcontent>          # Capture the text between the "<" and ">" chars.
  \s*/?\s*              # Optional spaces, optional slash (for closing tags).
  (?<tagname>[a-z]+\b)  # The tag name.
  (?<attributes>        # The tag attributes. Handle a possible ">" in one of them.
    (?:
      (?<quote>["']).*?\k<quote>
      |                 # Or
      [^>]              # Any char not beeing ">"
    )*
  )
  \s*/?\s*              # For self-closing tags such as <img .../>.
)
>                       # Closing of a tag.
~isx
END_OF_REGEX);

// A regular expression to find 
define('REGEX_BRACKETED_TAG', <<<'END_OF_REGEX'
~
\[\[            # Opening of a bracketed tag.
(?<tagcontent>  # Capture the text between the brackets.
  .*?
)
\]\]            # Closing of a bracketed tag.
~xs
END_OF_REGEX);

$html = <<<'END_OF_HTML'
<p> Hello < 30 </p> < < < <!-- Test --> &#60;> > > >
<p><span class="icon icon-print">print</SPAN></p>

<LABEL for="firstname">First name:</LABEL>
<input required type="text" id="firstname" name="firstname" /><!-- with self-closing slash -->
<label for="age">Age:</label>
<INPut required type="number" id="age" name="age"><!-- without self-closing slash -->

Shit should not happen with malformed HTML --> <p id="paragraph-58">Isn't closed
Or something not opened </div>
Be carefull with ">" in tag attribute values (seems to be allowed unescaped):
<input type="password" pattern="(?!.*[><]).{8,}" name="password">
<abbr data-symbol=">" title="Greater than">gt</abbr>

<nav class="floating-nav">
  <ul class="menu">
    <li><a href="/">Home</a></li>
    <li><a href="/contact">Contact</a></li>
  </ul>
</nav>

Test with spaces: <
textarea id="text"
         name="text"
         class="decorative"
>Some text< / textarea>
END_OF_HTML;

/**
 * Just to print a title or a step of the operations.
 *
 * @param string $text The text to print.
 * @param bool $is_a_step If set to false then no step counter will be printed
 *                        and incremented.
 * @return void
 */
function printTitle($text, $is_a_step = true) {
    static $counter = 1;
    if ($is_a_step) {
        print "\n\nSTEP $counter : $text\n";
        $counter++;
    } else {
        print "\n\n$text\n";
    }
    print str_repeat('=', LINE_LENGTH) . "\n\n";
}

printTitle('Input HTML:', false);
print $html;

printTitle('Strip out HTML comments');
$output = preg_replace('/<!--.*?-->/', '', $html);
print $output;

printTitle('replace all HTML tags by [[tag]]');
// preg_replace() doesn't support named groups but pre_replace_callback() does, so we'll use $1.
$output = preg_replace(REGEX_HTML_TAG, '[[$1]]', $output);
print $output;

printTitle('replace all < and > by &lt; and &gt;');
$output = htmlspecialchars($output, ENT_HTML5); // ENT_HTML5 will leave single and double quotes.
print $output;

printTitle('replace back [[tag]] by <tag>');
$output = preg_replace(REGEX_BRACKETED_TAG, '<$1>', $output);
print $output;

printTitle('Strip the HTML tags with strip_tags()');
$output = strip_tags($output);
print $output;

// It seems that the crapy strip_tags() doesn't always manage it's job!!!
// So let's see if we find some left HTML tags.
printTitle('Check what strip_tags() did');
if (preg_match_all(REGEX_HTML_TAG, $output, $matches, PREG_SET_ORDER)) {
    print "Oups! strip_tags() didn't clean up everything!\n";
    print "Found " . count($matches) . " occurrences:\n";
    foreach ($matches as $i => $match) {
        $indented_match = preg_replace('/\r?\n/', '$0  ', $match[0]);
        print '- match ' . ($i + 1) . " : $indented_match\n";
    }

    print "\n\nLet's try to do it ourselves, by replacing the matched tags by nothing.\n\n";
    $output = preg_replace(REGEX_HTML_TAG, '', $output);
    print $output;
}
else {
    print "Ok, no tag found.\n";
}

您可以在这里运行它：https://onlinephp.io/c/17665

对于正则表达式，我使用

而不是通常的

界定模式和标志。这只是因为我们然后可以使用斜杠而不在模式中转义它。

我还使用了

标志来表示 ex 倾向符号，这样我就可以在我的模式中添加一些注释并写在几行上。

为了可读性和灵活性，我还使用了命名捕获组，例如

(?<quote>)

，这样我们就没有索引，这如果我们添加一些其他捕获组，则可能会移动。反向引用使用

\k<quote>

而不是索引版本

\4

完成。

HTML5 似乎相当宽松，因为

字符似乎可以放入属性值中而不用

&gt;

替换它。我假设这在过去是不允许的，现在“可以”提供帮助用户在

pattern

字段上写入可读的

<input>

属性。我添加了不允许您使用的密码字段的示例使用

和

字符。这是为了展示如何在正则表达式，通过接受带有单个或的属性双引号值。

输出：



Input HTML:
============================================================

<p> Hello < 30 </p> < < < <!-- Test --> &#60;> > > >
<p><span class="icon icon-print">print</SPAN></p>

<LABEL for="firstname">First name:</LABEL>
<input required type="text" id="firstname" name="firstname" /><!-- with self-closing slash -->
<label for="age">Age:</label>
<INPut required type="number" id="age" name="age"><!-- without self-closing slash -->

Shit should not happen with malformed HTML --> <p id="paragraph-58">Isn't closed
Or something not opened </div>
Be carefull with ">" in tag attribute values (seems to be allowed unescaped):
<input type="password" pattern="(?!.*[><]).{8,}" name="password">
<abbr data-symbol=">" title="Greater than">gt</abbr>

<nav class="floating-nav">
  <ul class="menu">
    <li><a href="/">Home</a></li>
    <li><a href="/contact">Contact</a></li>
  </ul>
</nav>

Test with spaces: <
textarea id="text"
         name="text"
         class="decorative"
>Some text< / textarea>

STEP 1 : Strip out HTML comments
============================================================

<p> Hello < 30 </p> < < <  &#60;> > > >
<p><span class="icon icon-print">print</SPAN></p>

<LABEL for="firstname">First name:</LABEL>
<input required type="text" id="firstname" name="firstname" />
<label for="age">Age:</label>
<INPut required type="number" id="age" name="age">

Shit should not happen with malformed HTML --> <p id="paragraph-58">Isn't closed
Or something not opened </div>
Be carefull with ">" in tag attribute values (seems to be allowed unescaped):
<input type="password" pattern="(?!.*[><]).{8,}" name="password">
<abbr data-symbol=">" title="Greater than">gt</abbr>

<nav class="floating-nav">
  <ul class="menu">
    <li><a href="/">Home</a></li>
    <li><a href="/contact">Contact</a></li>
  </ul>
</nav>

Test with spaces: <
textarea id="text"
         name="text"
         class="decorative"
>Some text< / textarea>

STEP 2 : replace all HTML tags by [[tag]]
============================================================

[[p]] Hello < 30 [[/p]] < < <  &#60;> > > >
[[p]][[span class="icon icon-print"]]print[[/SPAN]][[/p]]

[[LABEL for="firstname"]]First name:[[/LABEL]]
[[input required type="text" id="firstname" name="firstname" /]]
[[label for="age"]]Age:[[/label]]
[[INPut required type="number" id="age" name="age"]]

Shit should not happen with malformed HTML --> [[p id="paragraph-58"]]Isn't closed
Or something not opened [[/div]]
Be carefull with ">" in tag attribute values (seems to be allowed unescaped):
[[input type="password" pattern="(?!.*[><]).{8,}" name="password"]]
[[abbr data-symbol=">" title="Greater than"]]gt[[/abbr]]

[[nav class="floating-nav"]]
  [[ul class="menu"]]
    [[li]][[a href="/"]]Home[[/a]][[/li]]
    [[li]][[a href="/contact"]]Contact[[/a]][[/li]]
  [[/ul]]
[[/nav]]

Test with spaces: [[
textarea id="text"
         name="text"
         class="decorative"
]]Some text[[ / textarea]]

STEP 3 : replace all < and > by &lt; and &gt;
============================================================

[[p]] Hello &lt; 30 [[/p]] &lt; &lt; &lt;  &amp;#60;&gt; &gt; &gt; &gt;
[[p]][[span class="icon icon-print"]]print[[/SPAN]][[/p]]

[[LABEL for="firstname"]]First name:[[/LABEL]]
[[input required type="text" id="firstname" name="firstname" /]]
[[label for="age"]]Age:[[/label]]
[[INPut required type="number" id="age" name="age"]]

Shit should not happen with malformed HTML --&gt; [[p id="paragraph-58"]]Isn't closed
Or something not opened [[/div]]
Be carefull with "&gt;" in tag attribute values (seems to be allowed unescaped):
[[input type="password" pattern="(?!.*[&gt;&lt;]).{8,}" name="password"]]
[[abbr data-symbol="&gt;" title="Greater than"]]gt[[/abbr]]

[[nav class="floating-nav"]]
  [[ul class="menu"]]
    [[li]][[a href="/"]]Home[[/a]][[/li]]
    [[li]][[a href="/contact"]]Contact[[/a]][[/li]]
  [[/ul]]
[[/nav]]

Test with spaces: [[
textarea id="text"
         name="text"
         class="decorative"
]]Some text[[ / textarea]]

STEP 4 : replace back [[tag]] by <tag>
============================================================

<p> Hello &lt; 30 </p> &lt; &lt; &lt;  &amp;#60;&gt; &gt; &gt; &gt;
<p><span class="icon icon-print">print</SPAN></p>

<LABEL for="firstname">First name:</LABEL>
<input required type="text" id="firstname" name="firstname" />
<label for="age">Age:</label>
<INPut required type="number" id="age" name="age">

Shit should not happen with malformed HTML --&gt; <p id="paragraph-58">Isn't closed
Or something not opened </div>
Be carefull with "&gt;" in tag attribute values (seems to be allowed unescaped):
<input type="password" pattern="(?!.*[&gt;&lt;]).{8,}" name="password">
<abbr data-symbol="&gt;" title="Greater than">gt</abbr>

<nav class="floating-nav">
  <ul class="menu">
    <li><a href="/">Home</a></li>
    <li><a href="/contact">Contact</a></li>
  </ul>
</nav>

Test with spaces: <
textarea id="text"
         name="text"
         class="decorative"
>Some text< / textarea>

STEP 5 : Strip the HTML tags with strip_tags()
============================================================

 Hello &lt; 30  &lt; &lt; &lt;  &amp;#60;&gt; &gt; &gt; &gt;
print

First name:

Age:


Shit should not happen with malformed HTML --&gt; Isn't closed
Or something not opened 
Be carefull with "&gt;" in tag attribute values (seems to be allowed unescaped):

gt


  
    Home
    Contact
  


Test with spaces: <
textarea id="text"
         name="text"
         class="decorative"
>Some text< / textarea>

STEP 6 : Check what strip_tags() did
============================================================

Oups! strip_tags() didn't clean up everything!
Found 2 occurrences:
- match 1 : <
  textarea id="text"
           name="text"
           class="decorative"
  >
- match 2 : < / textarea>


Let's try to do it ourselves, by replacing the matched tags by nothing.

 Hello &lt; 30  &lt; &lt; &lt;  &amp;#60;&gt; &gt; &gt; &gt;
print

First name:

Age:


Shit should not happen with malformed HTML --&gt; Isn't closed
Or something not opened 
Be carefull with "&gt;" in tag attribute values (seems to be allowed unescaped):

gt


  
    Home
    Contact
  


Test with spaces: Some text

如您所见，

strip_tags()

不处理标签周围的空格 name，我发现它完全不安全！这就是为什么我建议使用诸如 HTMLPurifier 之类的库或 DOM 解析器。

php strip_tags 正则表达式用于替换单个 < with HTML entities

问题描述投票：0回答：1

1个回答

PHP 代码

输出：

最新问题

php strip_tags 正则表达式用于替换单个 < with HTML entities

问题描述 投票：0回答：1

1个回答

PHP 代码

输出：

最新问题

问题描述投票：0回答：1