我在转换preg_match的字符串编码时遇到问题。我在windows-1250中有一个源页面。
我正在使用DOMDocument和XPath来获取此文:“2014年6月8日拍卖的出价目录”
尝试了很多方法,并且无法在输入preg_match时生成此输出。我想从这个字符串中提取日期。 preg_match无法正常使用多字节字符。
最好的结果是:
echo 'htmlentities: ' . htmlentities($string) . "<br>\n";
// htmlentities: Nabídkový katalog na aukci dne 8. ÄŤervna 2014
我认为一个字符编码被破坏了,但是当它被Web浏览器查看时它会正确显示。
我可以这样做:
$a = array('ÄŤ' => 'č');
但我需要一个通用的解决方案。
这是我的测试代码:
$url = 'http://numismatika.cz/web/Nummus-Praha/U064_Katalog/obsah.htm';
// <meta http-equiv="Content-Type" content="text/html; charset=windows-1250" />
$xpath = '/html/body/table//tr/td[2]/p[2]/font';
$html = @file_get_contents($url, FALSE);
if ($html === FALSE) {
throw new Exception($url);
}
$domDocument = new \DOMDocument();
if (!@$domDocument->loadHTML($html)) {
throw new Exception($url);
}
$domXpath = new \DOMXPath($domDocument);
$element = $domXpath->query($xpath);
if (isset($element->item(0)->textContent)) {
$string = $element->item(0)->textContent;
// Expected output: "Nabídkový katalog na aukci dne 8. června 2014"
echo 'Original: ' . $string . "<br>\n";
// Original: NabĂdkovĂ˝ katalog na aukci dne 8. ÄŤervna 2014
echo 'utf8_decode: ' . utf8_decode($string) . "<br>\n";
// utf8_decode: Nabídkový katalog na aukci dne 8. ?ervna 2014
echo 'iconv (windows-1250 to ISO-8859-1): ' . iconv('windows-1250', 'ISO-8859-1', $string) . "<br>\n";
// Notice: iconv(): Detected an illegal character in input string in C:\xampp\htdocs\testDateExtract.php on line 37
// iconv (windows-1250 to ISO-8859-1):
echo 'iconv (windows-1250 to ISO-8859-1//TRANSLIT): ' . iconv('windows-1250', 'ISO-8859-1//TRANSLIT', $string) . "<br>\n";
// iconv (windows-1250 to ISO-8859-1//TRANSLIT): NabAdkovA" katalog na aukci dne 8. ÄTervna 2014
echo 'iconv (windows-1250 to ISO-8859-1//IGNORE): ' . iconv('windows-1250', 'ISO-8859-1//IGNORE', $string) . "<br>\n";
// iconv (windows-1250 to ISO-8859-1//IGNORE): Nabdkov katalog na aukci dne 8. Äervna 2014
echo 'iconv (windows-1250 to ISO-8859-1//TRANSLIT//IGNORE): ' . iconv('windows-1250', 'ISO-8859-1//IGNORE', $string) . "<br>\n";
// iconv (windows-1250 to ISO-8859-1//TRANSLIT//IGNORE): Nabdkov katalog na aukci dne 8. Äervna 2014
echo 'iconv (windows-1250 to ASCII): ' . iconv('windows-1250', 'ASCII', $string) . "<br>\n";
// Notice: iconv(): Detected an illegal character in input string in C:\xampp\htdocs\testDateExtract.php on line 56
// iconv (windows-1250 to ASCII):
echo 'iconv (windows-1250 to ASCII//TRANSLIT): ' . iconv('windows-1250', 'ASCII//TRANSLIT', $string) . "<br>\n";
// iconv (windows-1250 to ASCII//TRANSLIT): NabA-dkovA" katalog na aukci dne 8. "ATervna 2014
echo 'iconv (windows-1250 to ASCII//IGNORE): ' . iconv('windows-1250', 'ASCII//IGNORE', $string) . "<br>\n";
// iconv (windows-1250 to ASCII//IGNORE): Nabdkov katalog na aukci dne 8. ervna 2014
echo 'iconv (windows-1250 to ASCII//TRANSLIT//IGNORE): ' . iconv('windows-1250', 'ASCII//IGNORE', $string) . "<br>\n";
// iconv (windows-1250 to ASCII//TRANSLIT//IGNORE): Nabdkov katalog na aukci dne 8. ervna 2014
echo 'iconv("UTF-8", "ASCII//TRANSLIT", utf8_encode($string)): ' . iconv('UTF-8', 'ASCII//TRANSLIT', utf8_encode($string)) . "<br>\n";
// Notice: iconv(): Detected an illegal character in input string in C:\xampp\htdocs\testDateExtract.php on line 71
// iconv("UTF-8", "ASCII//TRANSLIT", utf8_encode($string)):
echo 'htmlentities: ' . htmlentities($string) . "<br>\n";
// htmlentities: Nabídkový katalog na aukci dne 8. ÄŤervna 2014
echo 'htmlspecialchars_decode(htmlentities($string), ENT_COMPAT): ' . htmlspecialchars_decode(htmlentities($string), ENT_COMPAT) . "<br>\n";
// htmlspecialchars_decode(htmlentities($string), ENT_COMPAT): Nabídkový katalog na aukci dne 8. ÄŤervna 2014
$regex = '([0-9]{1,2})[.]?[\s]*([0-9a-zúřěčáí]+)[.]?[\s]*([0-9]{4})';
echo 'preg_match /([0-9]{1,2})[.]?[\s]*([0-9a-zúřěčáí]+)[.]?[\s]*([0-9]{4})/: ' . preg_match('/' . $regex . '/', $string, $matches) . "<br>\n";
// preg_match /([0-9]{1,2})[.]?[\s]*([0-9a-zúřěčáí]+)[.]?[\s]*([0-9]{4})/: 0
$regex = '([\p{N}]{1,2})[.]?[\S]*([\p{N}\p{L}]+)[.]?[\S]*([\p{N}]{4})';
echo 'preg_match /([\p{N}]{1,2})[.]?[\S]*([\p{N}\p{L}]+)[.]?[\S]*([\p{N}]{4})/u: ' . preg_match('/' . $regex . '/u', $string, $matches, PREG_OFFSET_CAPTURE) . "<br>\n";
// preg_match /([\p{N}]{1,2})[.]?[\S]*([\p{N}\p{L}]+)[.]?[\S]*([\p{N}]{4})/u: 0
}
我已经尝试过你的代码并在浏览器上输出$string
并正确显示。我的猜测是file_get_contents
打破了utf-8编码
你可能想试试这个:
// Tell PHP that we're using UTF-8 strings until the end of the script
mb_internal_encoding('UTF-8');
// Tell PHP that we'll be outputting UTF-8 to the browser
mb_http_output('UTF-8');
$string = $element->item(0)->textContent;
$string = mb_convert_encoding($string, 'HTML-ENTITIES', "UTF-8");
如果要将结果输出到浏览器,则应设置正确的标题:
header('Content-Type: text/html; charset=UTF-8');