PHP web提取字符串损坏编码

问题描述 投票:1回答:1

我在转换preg_match的字符串编码时遇到问题。我在windows-1250中有一个源页面。

我正在使用DOMDocument和XPath来获取此文:“2014年6月8日拍卖的出价目录”

尝试了很多方法,并且无法在输入preg_match时生成此输出。我想从这个字符串中提取日期。 preg_match无法正常使用多字节字符。

最好的结果是:

echo 'htmlentities: ' . htmlentities($string) . "<br>\n";
// htmlentities: Nabídkový katalog na aukci dne 8. ÄŤervna 2014

我认为一个字符编码被破坏了,但是当它被Web浏览器查看时它会正确显示。

我可以这样做:

$a = array('ÄŤ' => 'č');

但我需要一个通用的解决方案。

这是我的测试代码:

$url = 'http://numismatika.cz/web/Nummus-Praha/U064_Katalog/obsah.htm';
// <meta http-equiv="Content-Type" content="text/html; charset=windows-1250" />

$xpath = '/html/body/table//tr/td[2]/p[2]/font';

$html = @file_get_contents($url, FALSE);
if ($html === FALSE) {
    throw new Exception($url);
}

$domDocument = new \DOMDocument();
if (!@$domDocument->loadHTML($html)) {
    throw new Exception($url);
}

$domXpath = new \DOMXPath($domDocument);

$element = $domXpath->query($xpath);
if (isset($element->item(0)->textContent)) {

    $string = $element->item(0)->textContent;

    // Expected output: "Nabídkový katalog na aukci dne 8. června 2014"

    echo 'Original: ' . $string . "<br>\n";
    // Original: NabĂ­dkovĂ˝ katalog na aukci dne 8. ÄŤervna 2014

    echo 'utf8_decode: ' . utf8_decode($string) . "<br>\n";
    // utf8_decode: Nabídkový katalog na aukci dne 8. ?ervna 2014

    echo 'iconv (windows-1250 to ISO-8859-1): ' . iconv('windows-1250', 'ISO-8859-1', $string) . "<br>\n";
    // Notice: iconv(): Detected an illegal character in input string in C:\xampp\htdocs\testDateExtract.php on line 37
    // iconv (windows-1250 to ISO-8859-1):

    echo 'iconv (windows-1250 to ISO-8859-1//TRANSLIT): ' . iconv('windows-1250', 'ISO-8859-1//TRANSLIT', $string) . "<br>\n";
    // iconv (windows-1250 to ISO-8859-1//TRANSLIT): NabA­dkovA" katalog na aukci dne 8. ÄTervna 2014

    echo 'iconv (windows-1250 to ISO-8859-1//IGNORE): ' . iconv('windows-1250', 'ISO-8859-1//IGNORE', $string) . "<br>\n";
    // iconv (windows-1250 to ISO-8859-1//IGNORE): Nab­dkov katalog na aukci dne 8. Äervna 2014

    echo 'iconv (windows-1250 to ISO-8859-1//TRANSLIT//IGNORE): ' . iconv('windows-1250', 'ISO-8859-1//IGNORE', $string) . "<br>\n";
    // iconv (windows-1250 to ISO-8859-1//TRANSLIT//IGNORE): Nab­dkov katalog na aukci dne 8. Äervna 2014

    echo 'iconv (windows-1250 to ASCII): ' . iconv('windows-1250', 'ASCII', $string) . "<br>\n";
    // Notice: iconv(): Detected an illegal character in input string in C:\xampp\htdocs\testDateExtract.php on line 56
    // iconv (windows-1250 to ASCII):

    echo 'iconv (windows-1250 to ASCII//TRANSLIT): ' . iconv('windows-1250', 'ASCII//TRANSLIT', $string) . "<br>\n";
    // iconv (windows-1250 to ASCII//TRANSLIT): NabA-dkovA" katalog na aukci dne 8. "ATervna 2014

    echo 'iconv (windows-1250 to ASCII//IGNORE): ' . iconv('windows-1250', 'ASCII//IGNORE', $string) . "<br>\n";
    // iconv (windows-1250 to ASCII//IGNORE): Nabdkov katalog na aukci dne 8. ervna 2014

    echo 'iconv (windows-1250 to ASCII//TRANSLIT//IGNORE): ' . iconv('windows-1250', 'ASCII//IGNORE', $string) . "<br>\n";
    // iconv (windows-1250 to ASCII//TRANSLIT//IGNORE): Nabdkov katalog na aukci dne 8. ervna 2014

    echo 'iconv("UTF-8", "ASCII//TRANSLIT", utf8_encode($string)): ' . iconv('UTF-8', 'ASCII//TRANSLIT', utf8_encode($string)) . "<br>\n";
    // Notice: iconv(): Detected an illegal character in input string in C:\xampp\htdocs\testDateExtract.php on line 71
    // iconv("UTF-8", "ASCII//TRANSLIT", utf8_encode($string)):

    echo 'htmlentities: ' . htmlentities($string) . "<br>\n";
    // htmlentities: Nabídkový katalog na aukci dne 8. ÄŤervna 2014

    echo 'htmlspecialchars_decode(htmlentities($string), ENT_COMPAT): ' . htmlspecialchars_decode(htmlentities($string), ENT_COMPAT) . "<br>\n";
    // htmlspecialchars_decode(htmlentities($string), ENT_COMPAT): Nabídkový katalog na aukci dne 8. ÄŤervna 2014

    $regex = '([0-9]{1,2})[.]?[\s]*([0-9a-zúřěčáí]+)[.]?[\s]*([0-9]{4})';
    echo 'preg_match /([0-9]{1,2})[.]?[\s]*([0-9a-zúřěčáí]+)[.]?[\s]*([0-9]{4})/: ' . preg_match('/' . $regex . '/', $string, $matches) . "<br>\n";
    // preg_match /([0-9]{1,2})[.]?[\s]*([0-9a-zúřěčáí]+)[.]?[\s]*([0-9]{4})/: 0

    $regex = '([\p{N}]{1,2})[.]?[\S]*([\p{N}\p{L}]+)[.]?[\S]*([\p{N}]{4})';
    echo 'preg_match /([\p{N}]{1,2})[.]?[\S]*([\p{N}\p{L}]+)[.]?[\S]*([\p{N}]{4})/u: ' . preg_match('/' . $regex . '/u', $string, $matches, PREG_OFFSET_CAPTURE) . "<br>\n";
    // preg_match /([\p{N}]{1,2})[.]?[\S]*([\p{N}\p{L}]+)[.]?[\S]*([\p{N}]{4})/u: 0
}
php encoding preg-match domxpath
1个回答
0
投票

我已经尝试过你的代码并在浏览器上输出$string并正确显示。我的猜测是file_get_contents打破了utf-8编码

你可能想试试这个:

// Tell PHP that we're using UTF-8 strings until the end of the script
mb_internal_encoding('UTF-8');

// Tell PHP that we'll be outputting UTF-8 to the browser
mb_http_output('UTF-8');


$string = $element->item(0)->textContent;
$string = mb_convert_encoding($string, 'HTML-ENTITIES', "UTF-8");

如果要将结果输出到浏览器,则应设置正确的标题:

header('Content-Type: text/html; charset=UTF-8');
© www.soinside.com 2019 - 2024. All rights reserved.