如何让perl网络爬虫像wget一样做 "宽度优先 "检索?

问题描述 投票:0回答:2

我已经写了一个基本的网络爬虫在 perl. 我怎样才能让它更复杂,让它以 "先宽后窄 "的方式进行检索,就像这样。wget 是吗?

这是来自 wget docs:

HTTP和HTMLCSS内容的递归检索是广度优先。 这意味着Wget首先下载所请求的文档,然后是该文档所链接的文档,然后是它们所链接的文档,以此类推。换句话说,Wget首先下载深度1的文档,然后是深度2的文档,以此类推,直到指定的最大深度。

如果对我的代码有什么意见,也请大家多多指教。

use feature 'say';
use WWW::Mechanize;
use List::MoreUtils 'any';

##############
# parameters #
##############
my $url = "https://www.crawler-test.com/"; # starting url
my $depth_level = 2; # depth level for crawling (level 1 will only look at links on the starting page)
my $filter = ".pdf"; # for multiple types use this format: ".pdf|.docx|.doc|.rtf"
my $wait = 2; # wait this number of seconds between http requests (be kind to the server)
my $domain = ""; # only crawl links with host ending in this string, leave blank if not required. For multiple domains, use this format: "domain1|domain2"
my $nocrawlagain = 1; # setting this to 1 will mean that the same link will not be crawled again, set to 0 to turn this off
##############


$domain = quotemeta($domain);
$domain =~ s/\\\|/|/g;

my @linkscrawled;

open LOG, ">mecherrors.log";
LOG->autoflush;

my $mech = WWW::Mechanize->new(stack_depth => 0, onerror => \&mecherror);

sub crawl {

    my $url = shift;
    my $filter = shift;
    my $depth = shift || 1;

    return if $depth > $depth_level;

    say "Crawling $url";
    $mech->get($url);
    sleep $wait;
    return unless ($mech->success and $mech->is_html);


    my @linkstocrawl;

    for $link ($mech->find_all_links(url_abs_regex => qr/^http/))  # only get http links (excludes things like mailto:)
    {

        next if $link->url =~ /#/;  # excludes URLs that are referring to an anchor

        # if the link matches the filter then download it
        if ($link->url =~ /($filter)$/)
        {
            my $urlfilename = ($link->URI->path_segments)[-1];
            next if -e $urlfilename;
            $mech->get($url); # go to base page
            sleep $wait;
            $mech->get($link->url);
            sleep $wait;
            my $filename = $mech->response->filename;
            next if -e $filename;
            $mech->save_content($filename);
            say "Saved $filename";

        } else {

            push @linkstocrawl, $link;

        }
    }

    for $link (@linkstocrawl)
    {
        next unless $link->url_abs->host =~ /($domain)$/;
        if ($nocrawlagain)
        {
            # skip if already crawled this link
            next if any { $_ eq $link->url_abs->abs } @alreadycrawled;
            push @alreadycrawled, $link->url_abs->abs;
        }
        crawl($link->url_abs->abs, $filter, $depth + 1);
    }

}


crawl($url, $filter);

sub mecherror {
    print LOG "[", $mech->uri, "] ", $mech->response->message, "\n";
}
perl web-crawler wget
2个回答
3
投票

如果你想做 "宽度优先",你需要把 "宽度优先 "改为 "宽度优先"。my @linkstocrawl 宣言 sub crawl 这样就只有一个主待办事项列表,而不是每次调用爬行子的时候都有一个单独的列表。

如果你把代码做成非递归的,也会更容易做到广度优先,因为递归多少会自动适合深度优先。 (当你递归地调用一个子来处理搜索空间的某一部分时,这个子不会返回,直到这一部分完全完成,这不是你想要的广度优先。)

所以你想要的一般结构是这样的(不是完整的或经过测试的代码)。

my @linkstocrawl = $starting_url;
my %linkscrawled; # hash instead of array for faster/easier lookups

while (my $url = shift @linkstocrawl) {
  next if exists $linkscrawled{$url}; # already saw it, so skip it
  $linkscrawled{$url}++;

  my $page = fetch($url);
  push @linkstocrawl, find_links_on($page);
  # you could also push the links onto @linkstocrawl one-by-one, depending on
  # whether you prefer to parse the page incrementally or grab them all at once

  # Do whatever else you want to do with $page
}

2
投票

深度优先搜索(DFS)和广度优先搜索(BFS)的区别很简单。

  • DFS使用的是待办事项堆栈。

    my @todo = ...;
    while (@todo) {
       my $job = pop(@todo);
       push @todo, process($job);
    }
    
  • BFS使用的是待办事项队列。

    my @todo = ...;
    while (@todo) {
       my $job = shift(@todo);
       push @todo, process($job);
    }
    

递归是一种利用执行状态堆栈的技术。这就是为什么递归搜索例程会执行深度优先搜索。你将需要消除对 crawl.

每个请求都需要两个信息。请求的URL和页面的深度。我们的待办事项列表的元素将由这两部分任务定义组成。

以上面的内容为指导,下面是所需的整体代码流程。

my @todo = [ $starting_url, 0 ];
my %urls_seen = map { $_ => 1 } $start_url;

while (@todo) {
   my ($url, $depth) = @{ shift(@todo) };

   my $response = fetch($url);

   # Optionally do something with $response.

   my $linked_depth = $depth+1;
   if ($linked_depth <= $max_depth) {
      push @todo,
         map [ $_, $linked_depth ],
            grep !$urls_seen{$_}++,
               find_links($response);
   }

   # Optionally do something with $response.
}

顺便说一下,你应该防止UA自动跟随重定向(requests_redirectable => []),以避免下载之前下载过的页面。

© www.soinside.com 2019 - 2024. All rights reserved.