我想使用
preg_split()
及其 PREG_SPLIT_OFFSET_CAPTURE
选项来捕获单词及其在原始字符串中开始的索引。
但是我的字符串包含多字节字符,这导致计数错误。似乎没有与此等效的
mb_
。我有什么选择?
示例:
$text = "Hello world — goodbye";
$words = preg_split("/(\w+)/x",
$text,
-1,
PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_OFFSET_CAPTURE);
foreach($words as $word) {
print("$word[0]: $word[1]<br>");
}
输出:
您好:0
:5
世界:6
— : 11
再见:16
因为破折号是长破折号,而不是标准连字符,所以它是多字节字符 - 所以“再见”的偏移量显示为 16 而不是 14。
这是一种黑客攻击,但似乎有效。使用
str_replace()
将多字节字符替换为非多字节字符,然后对字符串运行 preg_split()
。
$text = 'Hello world — goodbye';
$mb = '—';
$rplmnt = "X";
function chkPlc($text, $mb, $rplmnt){
if(strpos($text, $mb) !== false){
$rpl = str_replace($mb, $rplmnt, $text);
$words = preg_split("/(\w+)/x",
$rpl,
-1,
PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_OFFSET_CAPTURE);
foreach($words as $word) {
$stmt = print("$word[0]: $word[1]<br>");
}
}
$stmt .= 'New String with replaced md char with non mb char: '.$rpl.'<br>';
return $stmt;
}
chkPlc($text, $mb, $rplmnt);
输出:
Hello: 0
: 5
world: 6
X : 11
goodbye: 14
可以编写一个更深入的函数来首先检查字符串中是否不存在非多字节字符,然后用作定义的多字节字符的替换。再说一遍,有点像黑客,但它确实有效。
这是另一个不理想的解决方案:使用 mb_convert_encoding() 将文本转换为 ISO-8859-1 之类的内容,这将消除多字节字符。它们将被转换为类似的 ASCII 字符或问号。
因此,在使用此方法进行
$text
之前,请先转换 preg_split()
:
$text = mb_convert_encoding($text, "ISO-8859-1", "UTF-8");
结果:
您好:0
:5
世界:6
? :11
再见:14
虽然它使文本变得混乱,但您当然仍然可以保留原件的副本。
我通过关于 iconv()
功能的
此评论找到了它。
一年多后,我重新审视了这个问题,并提出了一个可以做得更好的函数。好处是它可以处理多字节字符串,而不必完全放弃多字节字符。不好的是它不能像
preg_split()
那样使用正则表达式。
/**
* Splits a piece of text into individual words and the words' position within
* the text.
*
* @param string $text The text to split.
* @return array Each element is an array, of the word and its 0-based position.
*/
function split_offset_capture($text) {
$words = array();
// We split into words based on these characters:
$non_word_chars = array(
" ", "-", "–", "—", ".", ",", ";" ,":", "(", ")", "/",
"\\", "?", "!", "*", "'", "’", "\n", "\r", "\t",
);
// To keep track within the loop:
$word_started = FALSE;
$current_word = "";
$current_word_position = 0;
$characters = mb_str_split($text);
foreach($characters as $i => $letter) {
if ( ! in_array($letter, $non_word_chars)) {
// A character in a word.
if ( ! $word_started) {
// We're starting a brand new word.
if ($current_word != "") {
// Save the previous, now complete, word's info.
$words[] = array($current_word, $current_word_position);
}
$current_word_position = $i;
$word_started = TRUE;
$current_word = "";
}
$current_word .= $letter;
} else {
$word_started = FALSE;
}
};
// Add on the final word.
$words[] = array($current_word, $current_word_position);
return $words;
}
这样做:
$text = "Héllo world — goodbye";
$words = split_offset_capture($text);
最终包含以下内容:
$words
您可能需要向
array(
array("Héllo", 0),
array("world", 6),
array("goodbye", 14),
);
添加更多字符。
对于现实世界的文本,一件尴尬的事情是处理紧跟在单词后面的标点符号(例如$non_word_chars
或
Russ'
),或单词内的标点符号(例如 Russ’
、Bob's
或 Bob’s
)。为了解决这个问题,我想出了这个修改过的函数,它具有三个要查找的字符数组。所以它可能比 new-found
做更多的事情,但同样,它不使用正则表达式:preg_split()
所以如果我们有:
/**
* Splits a piece of text into individual words and the words' position within
* the text.
*
* @param string $text The text to split.
* @return array Each element is an array, of the word and its 0-based position.
*/
function split_offset_capture_2($text) {
$words = array();
// We split into words based on these characters:
$non_word_chars = array(
" ", "-", "–", "—", ".", ",", ";" ,":", "(", ")", "/",
"\\", "?", "!", "*", "'", "’", "\n", "\r", "\t"
);
// EXCEPT, these characters are allowed to be WITHIN a word:
// e.g. "up-end", "Bob's", "O'Brien"
$in_word_chars = array("-", "'", "’");
// AND, these characters are allowed to END a word:
// e.g. "Russ'"
$end_word_chars = array("'", "’");
// To keep track within the loop:
$word_started = FALSE;
$current_word = "";
$current_word_position = 0;
$characters = mb_str_split($text);
foreach($characters as $i => $letter) {
if ( ! in_array($letter, $non_word_chars)
||
(
// It's a non-word-char that's allowed within a word.
in_array($letter, $in_word_chars)
&&
! in_array($characters[$i-1], $non_word_chars)
&&
! in_array($characters[$i+1], $non_word_chars)
)
||
(
// It's a non-word-char that's allowed at the end of a word.
in_array($letter, $end_word_chars)
&&
! in_array($characters[$i-1], $non_word_chars)
)
) {
// A character in a word.
if ( ! $word_started) {
// We're starting a brand new word.
if ($current_word != "") {
// Save the previous, now complete, word's info.
$words[] = array($current_word, $current_word_position);
}
$current_word_position = $i;
$word_started = TRUE;
$current_word = "";
}
$current_word .= $letter;
} else {
$word_started = FALSE;
}
};
// Add on the final word.
$words[] = array($current_word, $current_word_position);
return $words;
}
然后第一个函数 (
$text = "Héllo Bob's and Russ’ new-found folks — goodbye";
) 给我们:
split_offset_capture()
第二个函数(
array(
array("Héllo", 0),
array("Bob", 6),
array("s", 10),
array("and", 12),
array("Russ", 16),
array("new", 22),
array("found", 26),
array("folks", 32),
array("goodbye", 40),
);
)让我们:
split_offset_capture_2()
array(
array("Héllo", 0),
array("Bob's", 6),
array("and", 12),
array("Russ’", 16),
array("new-found", 22),
array("folks", 32),
array("goodbye", 40),
);
并在您使用时维护多字节字符数隔离后续单词。
在我的正则表达式模式中,我将“单词”定义为由字母、反引号、单引号和连字符组成的连续字符。在这种情况下,所有其他字符将被视为非单词。如果/需要时,您可以调整这些定义。代码:(
演示输出:
$text = "Héllo Bob's and Russ’ new-found folks — goodbye";
var_export(
array_reduce(
preg_match_all("~([^\p{L}`'-]*)([\p{L}`'-]+)~u", $text, $m, PREG_SET_ORDER) ? $m : [],
function($result, $m) {
static $last = 0;
$last += mb_strlen($m[1]);
$result[] = [$m[2] => $last];
$last += mb_strlen($m[2]);
return $result;
},
[]
)
);