simple_html_dom不会从某些网站获取数据

Question

simple_html_dom不会从某些网站获取数据。对于网站www.google.pl，它会下载页面的来源，但对于其他如：gearbest.com，stooq.pl不会下载任何数据。

require('simple_html_dom.php');

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://www.google.com/"); //  work

/*
curl_setopt($ch, CURLOPT_URL, "https://www.gearbest.com/"); // dont work
curl_setopt($ch, CURLOPT_URL, "https://stooq.pl/"); // dont work
*/

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$response = curl_exec($ch);
curl_close($ch);

$html = new simple_html_dom();
$html->load($response);

echo $html;

我应该在代码中更改什么才能从网站接收数据？

Answer 1

这里的根本问题（至少在我的计算机上，可能与你的版本不同......）是该网站返回gzip压缩数据，并且在传递给dom解析器之前，它没有被php和curl正确解压缩。如果您使用的是PHP 5.4，则可以使用gzdecode和file_get_contents自行解压缩。
<?php
    // download the site
    $data = file_get_contents("http://www.tsetmc.com/loader.aspx?ParTree=151311&i=49776615757150035");
    // decompress it (a bit hacky to strip off the gzip header)
    $data = gzinflate(substr($data, 10, -8));
    include("simple_html_dom.php");
    // parse and use
    $html = str_get_html($data);
    echo $html->root->innertext();
请注意，此hack不适用于大多数网站。这个问题背后的主要原因似乎是curl并没有宣布它接受gzip数据......但是该域上的Web服务器并不关注那个头，并且无论如何都会对它进行gzip。然后curl和php都没有实际检查响应上的Content-Encoding标头，并假设它没有被gzip压缩，所以它传递它没有错误也没有调用gunzip。这里的服务器和客户端都有错误！

对于更强大的解决方案，也许您可以使用curl来获取标头并自己检查它们以确定是否需要解压缩它。或者您可以将此hack用于此站点，并为其他人使用常规方法来保持简单。

它可能还有助于在输出上设置字符编码。在回显任何内容之前添加此内容，以确保您读取的数据不会被读取为用户浏览器中的错误字符集：
header('Content-Type: text/html; charset=utf-8');

simple_html_dom不会从某些网站获取数据

问题描述投票：0回答：1

1个回答

最新问题

simple_html_dom不会从某些网站获取数据

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1