我正在使用内置的 PHP cURL multi 构建一个简单的网络蜘蛛。效果很好。这是基本实现:
我正在使用内置的 PHP cURL multi 构建一个简单的网络蜘蛛。效果很好。这是基本实现:
<?php
$remainingTargets = ...;
$concurrency = 30;
$multiHandle = curl_multi_init();
$targets = [];
while (count($targets) < $concurrency && count($remainingTargets) > 0) {
$target = array_shift($remainingTargets);
$alreadyChecked = ...;
if ($alreadyChecked !== false) {
continue;
}
$curl = curl_init($target);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
curl_setopt($curl, CURLOPT_FAILONERROR, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 4);
curl_setopt($curl, CURLOPT_TIMEOUT, 5);
curl_multi_add_handle($multiHandle, $curl);
$targets[$target] = $curl;
}
// Run loop for downloading
$running = null;
do {
curl_multi_exec($multiHandle, $running);
} while ($running);
// Harvest results
foreach ($targets as $target => $curl) {
$html = curl_multi_getcontent($curl);
curl_multi_remove_handle($multiHandle, $curl);
// Process this page
}
curl_multi_close($multiHandle);
// If done show results, or continue processing queue...
但是我想知道,是否可以在这里的“运行循环”中进行收获?我想这会更快地释放资源并运行得更好。看来我想要一个c风格的选择。但
curl_multi_select
不返回特定资源。
我知道这已经很旧了,但回答是因为我有同样的问题:
解决方案似乎是使用 curl_multi_info_read 它将返回一个包含已完成传输的数组。
$mh = curl_multi_init();
// Add CurlHandles to CurlMultiHandle
foreach ([
'https://example.com',
'https://example.net',
'https://example.org',
] as $url) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($mh, $ch);
}
do {
// Run sub-connections
curl_multi_exec($mh, $running);
// Wait for activity on CurlMultiHandle
curl_multi_select($mh);
// Consume any completed transfers
while ($curlMultiInfoRead = curl_multi_info_read($mh)) {
// Check CurlHandle has not had an error
if ($curlMultiInfoRead['result'] !== CURLE_OK) {
throw new \RuntimeException(curl_error($curlMultiInfoRead['handle']));
}
// Get information on the request
$curlGetInfo = curl_getinfo($curlMultiInfoRead['handle']);
echo $curlGetInfo['http_code'].'<br>';
echo $curlGetInfo['url'].'<br>';
// Get contents of the request etc.
$curlMultiGetContent = curl_multi_getcontent($curlMultiInfoRead['handle']);
echo htmlentities(substr($curlMultiGetContent, 0, 100)).'<br>';
// Close this CurlHandles and remove it from CurlMultiHandle
curl_close($curlMultiInfoRead['handle']);
curl_multi_remove_handle($mh, $curlMultiInfoRead['handle']);
}
} while ($running > 0);
与 CURLMOPT_MAX_TOTAL_CONNECTIONS 结合使用时特别有用,它将限制活动连接总数,并使用 Generator 在发生每个卷曲响应时生成它。