如何为自动图像优化提取图像路径及其推荐的新尺寸?

问题描述 投票:0回答:2

我正在创建一个php脚本以从https://gtmetrix.com/reports/example.com/a_unique_code中刮取图像和相应的尺寸建议。

提取图像路径和建议的新高度和宽度后,将以编程方式优化我的图像。

以下是从统一资源定位符返回的html的相关部分:

<tr class="rules-details" style="display: none">
    <td colspan="4">
        <a href="/serve-scaled-images.html" class="rule-help btn hover-tooltip" data-tooltip-interactive data-tooltip-max-width="450" title="&lt;h4&gt;Serve scaled images&lt;/h4&gt;&lt;p&gt;Serving appropriately-sized images can save many bytes of data and improve the performance of your webpage, especially on low-powered (eg. mobile) devices.&lt;/p&gt;&lt;p class=&quot;rule-help-tooltip-more&quot;&gt;&lt;a href=&quot;/serve-scaled-images.html&quot;&gt;Read more&lt;/a&gt;&lt;/p&gt;"><i class="sprite-question"></i><span class="resp-hidden">What's this mean?</span></a>
        <div>
            <p>The following images are resized in HTML or CSS. Serving scaled images could save 1.3MiB (45% reduction).
                <ul>
                    <li><a href="https://www.example.com/Pictures/thumbs/0029.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0029.jpg</a> is resized in HTML or CSS from 300x623 to 123x200. Serving a scaled image could save 51.3KiB (86% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumbs/0133.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0133.jpg</a> is resized in HTML or CSS from 300x578 to 135x200. Serving a scaled image could save 44.0KiB (84% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumbs/0075.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0075.jpg</a> is resized in HTML or CSS from 300x390 to 176x200. Serving a scaled image could save 43.2KiB (69% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumbs/0057.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0057.jpg</a> is resized in HTML or CSS from 300x436 to 174x200. Serving a scaled image could save 35.0KiB (73% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumb/thumb.png" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/thumb.png</a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 31.4KiB (78% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumb/thumb.png" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/thumb.png</a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 30.9KiB (78% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumb/thumb.png" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/thumb.png</a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 30.7KiB (78% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumb/thumb.png" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/thumb.png</a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 30.7KiB (78% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumbs/0093.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0093.jpg</a> is resized in HTML or CSS from 300x458 to 138x200. Serving a scaled image could save 28.9KiB (79% reduction).</li>
                </ul>
            </p>
        </div>
    </td>
</tr>

在John Conde建议使用DOM解析器之后,这是我的编码尝试:

$html = file_get_contents('https://gtmetrix.com/reports/example.com/a_unique_code');
$document = new DOMDocument();
$document->loadHTML($html);
$xpath = new DOMXpath($document);
$stack = array();

$expression = './/tr[contains(concat(" ", normalize-space(@class), " "), " rules-details ")]';
foreach ($xpath->evaluate($expression) as $tr) 
{
    array_push($stack, $tr->nodeValue);
}
$i=0;
foreach ($stack as $string) 
{
    $search_string = $string;
    $find = 'reduction';
    $pos = strpos($search_string, $find);
    if($pos===false){}
    else
    {
        $string = str_replace("What's this mean?","",$string);
        $string = trim(preg_replace("/\s+/", " ", $string));
        $string_array = explode(').', $string);
        for($i=0;$i<sizeof($string_array);$i++)
        {
            $search_string = $string_array[$i];
            $find = 'The following images are resized in HTML or CSS.';
            $pos = strpos($search_string, $find);
            if($pos===false){}
            else
            {
                unset($string_array[$i]);
            }

            $find = "Optimize the following images to reduce their size by";
            $pos = strpos($search_string, $find);
            if($pos===false){}
            else
            {
                $current_index = $string_array[$i];
                $array_size = sizeof($string_array);

                for($j=$current_index;$j<$array_size;$j++)
                {
                    unset($string_array[$i]);
                }
            }

            echo '<pre>'.$string_array[$i];
        }
    }
}

问题是,给定以下字符串,如何提取URL和第二个图像尺寸?

example.com/Pictures/thumbs/0093.jpg is resized in HTML or CSS from 300x458 to 138x200. Serving a scaled image could save 28.9KiB (79% reduction).

我需要:

  • example.com/Pictures/thumbs/0093.jpg
  • 138x200

我将优化此原型脚本,但这就是我实现JohnConde的答案的方式:

<?php

// #########################################
// AUTOMATED IMAGE OPTIMIZATION
// #########################################

class Image
{
    public $image_url;
    public $image_name;
    public $image_path;
    public $image_full_path;
    public $original_size;
    public $new_size;
}

$debugging = true;

if($debugging === true){echo '<ul class="Results" style="display:block; height:auto;">';}

try
{

    $HTML = file_get_contents('https://gtmetrix.com/reports/www.example.com/a_unique_code');// Get Webpage
    //var_dump($HTML);
    switch($HTML)
    {

        case false:
            if($debugging === true)
            {
                $error = error_get_last();
                echo '<li class="Error_Msg" style="display:block; height:auto;">';
                echo '<span><b>## FATAL ERROR - PROGRAM ABORTED ##</b></span>';
                echo '<span><b>Message:</b> Could not retrieve the HTML document</span>';
                echo '</li>';
                error_clear_last();
                exit;
            }
            break;

        default:// START OF WRAPPER

            $DOMdoc = new DOMDocument();// Object to store an HTML document
            libxml_use_internal_errors(true);// 
            $html = @$DOMdoc->loadHTML($HTML);// Parse the HTML
            $racks = (new DOMXPath($DOMdoc))->query('//tr/td/div//ul/li');// Creates a new DOMXPath object from the XPath expression
            $images_info_array = array();// Array for storing image details objects
            $document_root = $_SERVER['DOCUMENT_ROOT'];// Define the document root

            foreach($racks as $rack)// Traverse over the HTML structure
            {
                // Define a pattern to search for
                $expression = "/https?\:\/\/[^\",]+ is resized in HTML or CSS from \d{1,3}x\d{1,3} to \d{1,3}x\d{1,3}./";
                if(preg_match_all($expression, $rack->nodeValue, $matched) == 1)// If the pattern is found then
                {
                    $url = $rack->firstChild->nodeValue;// Get the URL from the string
                    preg_match_all('/\d{1,4}x\d{1,4}/', $rack->nodeValue, $matches);// Get the image dimensions from the string
                    [$original_size, $new_size] = $matches[0];// 

                    $url_parts = parse_url($url);// Break the URL up into sections
                    $directory_path = $url_parts['path'];// Get the directory path without the domain
                    $path_parts = pathinfo($directory_path);// Get information about a file path

                    $position = strpos($directory_path, '/');// Find the first / in the file path
                    if ($position !== false)// If found 
                    {

                        $new_directory_path = substr_replace($directory_path, "", $position, strlen('/'));// Remove the /

                        $image_info = new Image();// Create a new Image Object 
                        $image_info->image_url = $url;// Store the image URL
                        $image_info->image_name = basename($url);// Store just the image name
                        $image_info->image_path = $path_parts['dirname'];// Store image directory without domain & file name
                        $image_info->image_full_path = $new_directory_path;// 
                        $image_info->original_size = $original_size;// Store the original image size
                        $image_info->new_size = $new_size;// Store the new image size

                        array_push($images_info_array, $image_info);// Add the image information to an array

                    }else{
                        if($debugging === true)
                        {
                            $error = error_get_last();
                            echo '<li class="Warning_Msg">';
                            echo '<span><b>## WARNING - FILE PATH CHARACTER MISSING ##</b></span>';
                            echo '<span><b>Message:</b> / in the file path not found</span>';
                            echo '</li>';
                            error_clear_last();
                        }
                    }

                }else{// If the pattern is not found then
                    if($debugging === true)
                    {
                        $error = error_get_last();
                        echo '<li class="Error_Msg" style="display:block; height:auto;">';
                        echo '<span><b>## FATAL ERROR - PROGRAM ABORTED ##</b></span>';
                        echo '<span><b>Message:</b> Could not find the pattern required to extract the URL & size information</span>';
                        echo '</li>';
                        error_clear_last();
                        exit;
                    }
                }
            }

            //$command = 'ls /kunden/homepages/25/d828767522/htdocs/Pages/Who/Pictures/thumbs/ 2>&1';
            //$last_line = system($command, $return_value);

            foreach($images_info_array as $image_info)// Traverse the image info array
            {
                if(file_exists($image_info->image_full_path))// Check if the image exists
                {
                    $temp_path = $document_root.$image_info->image_path.'/temp/';// Define a temporary folder location


                    switch(file_exists($temp_path))// Check if the temporary folder exists
                    {
                        case true:// If it does recursively delete it
                            $files = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($temp_path, RecursiveDirectoryIterator::SKIP_DOTS), RecursiveIteratorIterator::CHILD_FIRST);

                            foreach ($files as $fileinfo) 
                            {
                                $todo = ($fileinfo->isDir() ? 'rmdir' : 'unlink');
                                $todo($fileinfo->getRealPath());
                            }

                            rmdir($temp_path);
                        break;
                        case false:// If it does not exist create it
                            mkdir($temp_path, 0777);// If it doesnt create the temporary folder
                            break;
                    }              

                    // Define the convert command for recommended optimization of the image
                    $command = 'convert -thumbnail '.$image_info->new_size.' "'.$document_root.'/'.$image_info->image_full_path.'" "'.$document_root.''.$image_info->image_path.'/temp/'.$image_info->image_name.'" 2>&1';
                    $last_line = system($command, $return_value);// Run the defined command

                    if($debugging === true)
                    {
                        switch ($return_value)
                        {
                            case true:
                                echo '<li class="Normal_Message">';
                                echo '<span><b>MESSAGE - THE COMMAND COMPLETED SUCCESSFULLY</b></span>';
                                echo '<span><b>Command:</b> '.$command.'</span>';
                                echo '<span><b>Directory:</b> '.$item->image_full_path.'</span>';
                                echo '<span><b>Resized:</b> '.$item->new_size.'</span>';
                                echo '<span><b>Returned:</b> '.$return_value.'</span>';
                                echo '<span><b>Output:</b> '.$last_line.'</span>';
                                echo '</li>';
                                break;
                            case false;
                                $error = error_get_last();
                                echo '<li class="Error_Msg" style="display:block; height:auto;">';
                                echo '<span><b>## ERROR - THE COMMAND DID NOT COMPLETE ##</b></span>';
                                echo '<span><b>TYPE:</b> '.$error['type'].'</span>';
                                echo '<span><b>MESSAGE:</b> '.$error['message'].'</span>';
                                echo '<span><b>FILE:</b> '.$error['file'].'</span>';
                                echo '<span><b>LINE:</b> '.$error['line'].'</span>';
                                echo '</li>';
                                error_clear_last();
                                break;
                            default:
                                break;
                        }
                    }
                }
                else// If the file does not exist
                {
                    echo '<li class="Warning_Message" style="display:block; height:auto;">The file doesn\'t exist</li>';
                }

            }

            break;// END OF WRAPPER

    }


}
catch(Exception $Error_Message)
{
    echo $Error_Message;
}

echo '</ul>';

?>
php web-scraping data-extraction domparser image-optimization
2个回答
2
投票

这将解析该HTML并输出您要查找的文本:

$html = '<tr class="rules-details" style="display: none">
    <td colspan="4">
        <a href="/serve-scaled-images.html" class="rule-help btn hover-tooltip" data-tooltip-interactive data-tooltip-max-width="450" title="&lt;h4&gt;Serve scaled images&lt;/h4&gt;&lt;p&gt;Serving appropriately-sized images can save many bytes of data and improve the performance of your webpage, especially on low-powered (eg. mobile) devices.&lt;/p&gt;&lt;p class=&quot;rule-help-tooltip-more&quot;&gt;&lt;a href=&quot;/serve-scaled-images.html&quot;&gt;Read more&lt;/a&gt;&lt;/p&gt;"><i class="sprite-question"></i><span class="resp-hidden">What\'s this mean?</span></a>
        <div>
            <p>The following images are resized in HTML or CSS. Serving scaled images could save 1.3MiB (45% reduction).
                <ul>
                    <li><a href="https://www.example.com/Pictures/thumbs/0029.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0029.jpg</a> is resized in HTML or CSS from 300x623 to 123x200. Serving a scaled image could save 51.3KiB (86% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumbs/0133.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0133.jpg</a> is resized in HTML or CSS from 300x578 to 135x200. Serving a scaled image could save 44.0KiB (84% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumbs/0075.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0075.jpg</a> is resized in HTML or CSS from 300x390 to 176x200. Serving a scaled image could save 43.2KiB (69% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumbs/0057.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0057.jpg</a> is resized in HTML or CSS from 300x436 to 174x200. Serving a scaled image could save 35.0KiB (73% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumb/thumb.png" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/thumb.png</a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 31.4KiB (78% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumb/thumb.png" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/thumb.png</a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 30.9KiB (78% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumb/thumb.png" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/thumb.png</a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 30.7KiB (78% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumb/thumb.png" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/thumb.png</a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 30.7KiB (78% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumbs/0093.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0093.jpg</a> is resized in HTML or CSS from 300x458 to 138x200. Serving a scaled image could save 28.9KiB (79% reduction).</li>
                </ul>
            </p>
        </div>
    </td>
</tr>';

$doc = new DOMDocument();
$html = @$doc->loadHTML($html);
$items = (new DOMXPath($doc))->query('//tr/td/div//ul/li');
foreach ($items as $item) {
    $url = $item->firstChild->nodeValue;
    preg_match_all('/\d{1,3}x\d{1,3}/', $item->nodeValue, $matches);
    [$original, $resized] = $matches[0];
    printf('URL:%s Original: %s Resized: %s%s', $url, $original, $resized, PHP_EOL);
}

输出

URL:https://www.example.com/Pictures/thumbs/0029.jpg Original: 300x623 Resized: 123x200
URL:https://www.example.com/Pictures/thumbs/0133.jpg Original: 300x578 Resized: 135x200
URL:https://www.example.com/Pictures/thumbs/0075.jpg Original: 300x390 Resized: 176x200
URL:https://www.example.com/Pictures/thumbs/0057.jpg Original: 300x436 Resized: 174x200
URL:https://www.example.com/Pictures/thumbs/thumb.png Original: 148x100 Resized: 68x46
URL:https://www.example.com/Pictures/thumbs/thumb.png Original: 148x100 Resized: 68x46
URL:https://www.example.com/Pictures/thumbs/thumb.png Original: 148x100 Resized: 68x46
URL:https://www.example.com/Pictures/thumbs/thumb.png Original: 148x100 Resized: 68x46
URL:https://www.example.com/Pictures/thumbs/0093.jpg Original: 300x458 Resized: 138x200

1
投票

我将提供与约翰的回答稍有不同的方法。

使用XPath访问所需的<a>标记,然后获取其值,然后隔离<a>标记的父值,并使用preg_match隔离关键字to之后的维子字符串(\K重置完整字符串匹配,因此不需要捕获组)。

代码:(Demo

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = [];
foreach ($xpath->query('//tr/td/div//ul/li/a') as $a) {
    $result[] = [
        $a->nodeValue,
        preg_match('~to \K\d+x\d+~', $a->parentNode->nodeValue, $m) ? $m[0] : ''
    ];
}
var_export($result);

请注意,我要抑制<p>标记生成的html错误。

为什么:Should ol/ul be inside <p> or outside?

因此,XPath表达式直接将p标记传递到其内部的ul

输出:

array (
  0 => 
  array (
    0 => 'https://www.example.com/Pictures/thumbs/0029.jpg',
    1 => '123x200',
  ),
  1 => 
  array (
    0 => 'https://www.example.com/Pictures/thumbs/0133.jpg',
    1 => '135x200',
  ),
  2 => 
  array (
    0 => 'https://www.example.com/Pictures/thumbs/0075.jpg',
    1 => '176x200',
  ),
  3 => 
  array (
    0 => 'https://www.example.com/Pictures/thumbs/0057.jpg',
    1 => '174x200',
  ),
  4 => 
  array (
    0 => 'https://www.example.com/Pictures/thumbs/thumb.png',
    1 => '68x46',
  ),
  5 => 
  array (
    0 => 'https://www.example.com/Pictures/thumbs/thumb.png',
    1 => '68x46',
  ),
  6 => 
  array (
    0 => 'https://www.example.com/Pictures/thumbs/thumb.png',
    1 => '68x46',
  ),
  7 => 
  array (
    0 => 'https://www.example.com/Pictures/thumbs/thumb.png',
    1 => '68x46',
  ),
  8 => 
  array (
    0 => 'https://www.example.com/Pictures/thumbs/0093.jpg',
    1 => '138x200',
  ),
)
© www.soinside.com 2019 - 2024. All rights reserved.