这是我的简化测试代码:
<!DOCTYPE html>
<?php
//uncommenting the next line results in the whole page displaying in "chinese -simplified"
//header("content-type: text/html; charset=UTF-16");
header('Content-language: he');
?>
<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=UTF-16">
<meta http-equiv="content-language" content="he-il">
</head>
<body>
<?php
// in Production, we are grabbing the hebrew word from the database
//$sql = "SELECT masoretic FROM codex WHERE id = 20"; // just grabs a word from the database
// it is stored using UTF16_general_ci on mySQL
// in this test we can mock the exact same data that was copy and pasted in
// the results were the same with the data from the db
$masoretic = "בָּרָ֣א";
echo $masoretic . '<br>'; // displays correctly in HEBREW = בָּרָ֣א
// now loop through the word and process each letter
$length = strlen($masoretic);
// even though there are only 3 real letters, the diacritic marks count as characters, so we should get at least 7 loops
for ($x = 0; $x <= $length; $x++) {
$letter = substr($masoretic,0,1); // process this letter
$masoretic = substr($masoretic, 1); // the rest of the word
$name = '';
$recognized = false;
switch($letter){
case 'ר':
$recognized = true;
$name = 'Raysh';
break;
case 'א':
$recognized = true;
$name = 'Aleph';
break;
default:
$recognized = false;
break;
}
if($recognized){
echo ('found a ' . $name);
echo $letter; // for now just display it
}else{
echo 'unrecognized letter:';
print_r($letter);
echo '<br>';
}
}
?>
</body>
页面显示如下:
בָּרָ֣א
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:
我觉得很奇怪,完整的希伯来语单词显示正常,但每个单独的字母却无法显示。我认为 UTF16 发生了一些奇怪的事情,所以我添加了标头,但在某些情况下这实际上使情况变得更糟。 (见内嵌评论)
在 UTF-8 中,除了基本英文字母和标点符号之外的任何内容都将具有由多个字节表示的字形,因此您需要使用多字节感知字符串函数,例如:
mb_str_split()
。
$in = 'בָּרָ֣א';
foreach(mb_str_split($in) as $glyph) {
var_dump($glyph);
}
输出:
string(2) "ב"
string(2) "ָ"
string(2) "ּ"
string(2) "ר"
string(2) "ָ"
string(2) "֣"
string(2) "א"