PHP cURL和简单HTML DOM

问题描述 投票:0回答:1

对不起,但我只会说一点英语。

我使用这个:

<?php

function file_get_contents_curl ( $url ) {

    $ch = curl_init ();

    curl_setopt ( $ch, CURLOPT_AUTOREFERER, TRUE );
    curl_setopt ( $ch, CURLOPT_HEADER, 0 );
    curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, 1 );
    curl_setopt ( $ch, CURLOPT_URL, $url );
    curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, TRUE );
    curl_setopt ( $ch, CURLOPT_SSL_VERIFYPEER, 0 ); //
    curl_setopt ( $ch, CURLOPT_SSL_VERIFYHOST, 0 ); //
    curl_setopt ( $ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; rv:71.0) Gecko/20100101 Firefox/71.0' ); // spoof

    $data = curl_exec ( $ch );

    curl_close ( $ch );

    return $data;

}

include ( __DIR__ . '/simplehtmldom_1_9_1/simple_html_dom.php' );

// 1. OK:     $url = 'https://www.p***hub.com/model/ashley-porner';
// 2. OK:     $url = 'https://www.p***hub.com/model/ashley-diamond-and-diamond-king';
// 3. NOT OK: $url = 'https://www.p***hub.com/model/ambercashh';
// 4. NOT OK: $url = 'https://www.p***hub.com/model/autumn-raine';

$html = file_get_contents_curl ( $url );
$html = str_get_html ( $html );

var_dump ( $html ); // boolean(false) if NOT OK

?>

1-2。 URL可以,但是3-4。网址不正确。没有显示,没有视图。返回为假。

我尝试从600000更改为6000000(〜/ simplehtmldom_1_9_1 / simple_html_dom.php),但是新的值是加载时间更长,而且比我的网站崩溃还多:

// OLD: defined('MAX_FILE_SIZE') || define('MAX_FILE_SIZE', 600000);
defined('MAX_FILE_SIZE') || define('MAX_FILE_SIZE', 6000000); // NEW

怎么了?

谢谢。

php curl simple-html-dom php-curl
1个回答
0
投票

作为测试,您可以运行以下命令-显然需要编辑url,但它显示出合理的性能-因此,内存不足的原因必须归因于未包含的代码中

<?php


    function file_get_contents_curl ( $url ) {
        $ch = curl_init ();
        curl_setopt ( $ch, CURLOPT_AUTOREFERER, TRUE );
        curl_setopt ( $ch, CURLOPT_HEADER, 0 );
        curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, 1 );
        curl_setopt ( $ch, CURLOPT_URL, $url );
        curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, TRUE );
        curl_setopt ( $ch, CURLOPT_SSL_VERIFYPEER, 0 );
        curl_setopt ( $ch, CURLOPT_SSL_VERIFYHOST, 0 );
        curl_setopt ( $ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; rv:71.0) Gecko/20100101 Firefox/71.0' ); // spoof
        $data = curl_exec ( $ch );
        curl_close ( $ch );
        return $data;
    }


    $start=time();
    $memstart=memory_get_usage();


    $baseurl='https://www.*******.com/model/';
    $models=['ashley-porner','ashley-diamond-and-diamond-king','ambercashh','autumn-raine'];


    libxml_use_internal_errors( true );
    $dom=new DOMDocument;
    $dom->validateOnParse=false;
    $dom->recover=true;
    $dom->strictErrorChecking=false;


    /* do some expensive DOM operations to test performance */
    $query='//section[ @class="topProfileHeader" ]/div/div/div[ @class="content-columns" ]/div[ @class="infoPiece" ]';


    foreach( $models as $model ){
        $url = $baseurl . $model;
        $res = file_get_contents_curl( $url );

        $dom->loadHTML( $res );
        $xp=new DOMXPath( $dom );
        libxml_clear_errors();

        $col=$xp->query( $query );
        if( $col->length > 0 ){
            foreach( $col as $node ) {
                echo str_repeat( '.', strlen( $node->nodeValue ) ) . '<br />';
            }
        }
    }

    $memory=memory_get_usage() - $memstart;
    printf(
        '<div style="padding:1rem; border:1px solid red;">Script took approx: %ss - consumed: %sMb, Peak memory consumption: %sMb</div>', 
        ( time() - $start ), 
        round( $memory / pow(1024,2), 2 ), 
        round( memory_get_peak_usage() / pow(1024,2), 2 )
    );

?>  

The result...

© www.soinside.com 2019 - 2024. All rights reserved.