我正在尝试在 PHP 中创建一个坏词过滤器,它将搜索文本,与一组已知的坏词进行匹配,然后用星号替换坏词中的每个字符(第一个字母除外)。
示例:
fook
会变成 f***
shoot
会变成 s****
我唯一不知道的部分是如何保留字符串中的第一个字母,以及如何在保持相同字符串长度的同时用其他字母替换剩余的字母。
我的代码不合适,因为它总是用 3 个星号替换整个单词。
$string = preg_replace("/\b(". $word .")\b/i", "***", $string);
$string = 'fook would become';
$word = 'fook';
$string = preg_replace("~\b". preg_quote($word, '~') ."\b~i", $word[0] . str_repeat('*', strlen($word) - 1), $string);
var_dump($string);
这可以通过多种方式完成,使用非常奇怪的自动生成的正则表达式...... 但我相信使用
preg_replace_callback()
最终会变得更加强大
<?php
# as already pointed out, your words *may* need sanitization
foreach($words as $k=>$v)
$words[$k]=preg_quote($v,'/');
# and to be collapsed into a **big regexpy goodness**
$words=implode('|',$words);
# after that, a single preg_replace_callback() would do
$string = preg_replace_callback('/\b('. $words .')\b/i', "my_beloved_callback", $string);
function my_beloved_callback($m)
{
$len=strlen($m[1])-1;
return $m[1][0].str_repeat('*',$len);
}
$string = preg_replace("/\b".$word[0].'('.substr($word, 1).")\b/i", "***", $string);
假设要屏蔽的不良单词黑名单完全由字母或至少单词字符(允许数字和下划线)组成,则在内爆并插入正则表达式模式之前,您不需要调用
preg_quote()
。
在匹配限定词的第一个字母后,使用
\G
元字符继续匹配。坏单词中每个后续匹配的字母都将用星号一对一替换。
\K
用于忘记/释放坏词的第一个字母。
此方法无需调用
preg_replace_callback()
来测量每个匹配的字符串,并在文本块中每个匹配的坏词的第一个字母后写入 N 个星号。
细分:
/ #start of pattern delimiter
(?: #non-capturing group to encapsulate logic
\b #position separating word character and non-word character
(?= #start lookahead -- to match without consuming letters
(?:fook|shoot) #OR-delimited bad words
\b #position separating word character and non-word character
) #end lookahead
\w #first word character of bad word
\K #forget first matched word character
| #OR -- to set up \G technique
\G(?!^) #continue matching from previous match but not from the start of the string
) #end of non-capturing group
\w #match non-first letter of bad word
/ #ending pattern delimiter
i #make pattern case-insensitive
代码:(演示)
$bad = ['fook', 'shoot'];
$pattern = '/(?:\b(?=(?:' . implode('|', $bad) . ')\b)\w\K|\G(?!^))\w/i';
echo preg_replace($pattern, '*', 'Holy fook n shoot, Batman; The Joker\'s shooting The Riddler!');
// Holy f*** n s****, Batman; The Joker's shooting The Riddler!
这是 PHP 的 unicode 友好正则表达式。 正则表达式可以给你一个想法。
function do_something_except_first_letter($s) {
// the following line SKIP the first character and pass it to callback func...
// allows to keep the first letter even in words in quotes and brackets.
// alternative regex is '/(?<!^|\s|\W)(\w)/u'.
return preg_replace_callback('/(\B\w)/u', function($m) {
// do what you need...
// for example, lowercase all characters except the first letter
return mb_strtolower($m[1]);
}, $s);
}