我已经写了一个基本的网络爬虫在 perl
. 我怎样才能让它更复杂,让它以 "先宽后窄 "的方式进行检索,就像这样。wget
是吗?
这是来自 wget docs:
HTTP和HTMLCSS内容的递归检索是广度优先。 这意味着Wget首先下载所请求的文档,然后是该文档所链接的文档,然后是它们所链接的文档,以此类推。换句话说,Wget首先下载深度1的文档,然后是深度2的文档,以此类推,直到指定的最大深度。
如果对我的代码有什么意见,也请大家多多指教。
use feature 'say';
use WWW::Mechanize;
use List::MoreUtils 'any';
##############
# parameters #
##############
my $url = "https://www.crawler-test.com/"; # starting url
my $depth_level = 2; # depth level for crawling (level 1 will only look at links on the starting page)
my $filter = ".pdf"; # for multiple types use this format: ".pdf|.docx|.doc|.rtf"
my $wait = 2; # wait this number of seconds between http requests (be kind to the server)
my $domain = ""; # only crawl links with host ending in this string, leave blank if not required. For multiple domains, use this format: "domain1|domain2"
my $nocrawlagain = 1; # setting this to 1 will mean that the same link will not be crawled again, set to 0 to turn this off
##############
$domain = quotemeta($domain);
$domain =~ s/\\\|/|/g;
my @linkscrawled;
open LOG, ">mecherrors.log";
LOG->autoflush;
my $mech = WWW::Mechanize->new(stack_depth => 0, onerror => \&mecherror);
sub crawl {
my $url = shift;
my $filter = shift;
my $depth = shift || 1;
return if $depth > $depth_level;
say "Crawling $url";
$mech->get($url);
sleep $wait;
return unless ($mech->success and $mech->is_html);
my @linkstocrawl;
for $link ($mech->find_all_links(url_abs_regex => qr/^http/)) # only get http links (excludes things like mailto:)
{
next if $link->url =~ /#/; # excludes URLs that are referring to an anchor
# if the link matches the filter then download it
if ($link->url =~ /($filter)$/)
{
my $urlfilename = ($link->URI->path_segments)[-1];
next if -e $urlfilename;
$mech->get($url); # go to base page
sleep $wait;
$mech->get($link->url);
sleep $wait;
my $filename = $mech->response->filename;
next if -e $filename;
$mech->save_content($filename);
say "Saved $filename";
} else {
push @linkstocrawl, $link;
}
}
for $link (@linkstocrawl)
{
next unless $link->url_abs->host =~ /($domain)$/;
if ($nocrawlagain)
{
# skip if already crawled this link
next if any { $_ eq $link->url_abs->abs } @alreadycrawled;
push @alreadycrawled, $link->url_abs->abs;
}
crawl($link->url_abs->abs, $filter, $depth + 1);
}
}
crawl($url, $filter);
sub mecherror {
print LOG "[", $mech->uri, "] ", $mech->response->message, "\n";
}
如果你想做 "宽度优先",你需要把 "宽度优先 "改为 "宽度优先"。my @linkstocrawl
宣言 sub crawl
这样就只有一个主待办事项列表,而不是每次调用爬行子的时候都有一个单独的列表。
如果你把代码做成非递归的,也会更容易做到广度优先,因为递归多少会自动适合深度优先。 (当你递归地调用一个子来处理搜索空间的某一部分时,这个子不会返回,直到这一部分完全完成,这不是你想要的广度优先。)
所以你想要的一般结构是这样的(不是完整的或经过测试的代码)。
my @linkstocrawl = $starting_url;
my %linkscrawled; # hash instead of array for faster/easier lookups
while (my $url = shift @linkstocrawl) {
next if exists $linkscrawled{$url}; # already saw it, so skip it
$linkscrawled{$url}++;
my $page = fetch($url);
push @linkstocrawl, find_links_on($page);
# you could also push the links onto @linkstocrawl one-by-one, depending on
# whether you prefer to parse the page incrementally or grab them all at once
# Do whatever else you want to do with $page
}
深度优先搜索(DFS)和广度优先搜索(BFS)的区别很简单。
DFS使用的是待办事项堆栈。
my @todo = ...;
while (@todo) {
my $job = pop(@todo);
push @todo, process($job);
}
BFS使用的是待办事项队列。
my @todo = ...;
while (@todo) {
my $job = shift(@todo);
push @todo, process($job);
}
递归是一种利用执行状态堆栈的技术。这就是为什么递归搜索例程会执行深度优先搜索。你将需要消除对 crawl
.
每个请求都需要两个信息。请求的URL和页面的深度。我们的待办事项列表的元素将由这两部分任务定义组成。
以上面的内容为指导,下面是所需的整体代码流程。
my @todo = [ $starting_url, 0 ];
my %urls_seen = map { $_ => 1 } $start_url;
while (@todo) {
my ($url, $depth) = @{ shift(@todo) };
my $response = fetch($url);
# Optionally do something with $response.
my $linked_depth = $depth+1;
if ($linked_depth <= $max_depth) {
push @todo,
map [ $_, $linked_depth ],
grep !$urls_seen{$_}++,
find_links($response);
}
# Optionally do something with $response.
}
顺便说一下,你应该防止UA自动跟随重定向(requests_redirectable => []
),以避免下载之前下载过的页面。