电影抓取器,正则表达式并没有抓取每部电影

问题描述 投票:0回答:1

这是我的程序从此链接(http://www.rottentomatoes.com/movie/box_office.php)的输出。正如您所看到的,我在页面上缺少一些电影,例如 18 号电影(一部收费电影)不在那里。我的问题是任何人都可以检查我的正则表达式并帮助我找出为什么它没有抓取所有电影或者我的代码中是否有我找不到的错误?

我需要补充一点,我正在使用 lynx 命令来获取数据。是的,我必须使用它 =(。我更新了代码以显示我如何从网页获取信息。

另外,我只想打印电影名称的 35 个字符,所以如果超过了,我只想截断后面的所有内容。

输出:

##  ##  Movie Title                           Weekend      Cume   T-Meter
1   2   Safe House                             $78.2M     $7.7k       52%
2   1   The Vow                                $85.5M     $8.0k       30%
3   --  Ghost Rider: Spirit of Vengeance       $22.0M     $6.9k       15%
4   3   Journey 2: The Mysterious Island       $53.2M     $5.7k       43%
5   --  This Means War                         $19.2M     $5.5k       25%
6   4   Star Wars: Episode I - The Phantom Menace (in 3D) $33.7M     $3.0k       57%
7   5   Chronicle                              $51.0M     $2.9k       84%
8   6   The Woman in Black                     $45.3M     $2.6k       63%
9   --  The Secret World of Arrietty            $6.4M     $4.2k       93%
10  7   The Grey                               $47.9M     $1.4k       78%
11  9   The Descendants                        $75.0M     $2.4k       89%
12  13  The Artist                             $27.4M     $2.9k       97%
13  8   Big Miracle                            $16.6M     $1.3k       73%
14  14  Hugo                                   $66.7M     $2.9k       93%
15  11  Red Tails                              $47.5M     $1.4k       36%
16  10  Underworld Awakening                   $61.3M     $1.3k       28%
17  18  The Iron Lady                          $24.4M     $1.7k       53%
19  15  Extremely Loud & Incredibly Close      $30.6M     $1.1k       45%
20  17  Contraband                             $65.7M     $1.2k       49%
21  23  Alvin and the Chipmunks: Chipwrecked  $129.7M     $1.2k       13%
22  20  Mission: Impossible Ghost Protocol    $207.3M     $1.8k       93%
23  22  Tinker Tailor Soldier Spy              $22.7M     $2.6k       84%
24  29  The Adventures of Tintin               $76.4M     $1.3k       75%
25  33  A Separation                            $2.1M     $6.2k       99%
27  31  Albert Nobbs                            $2.4M     $1.6k       53%
28  --  Thin Ice                                $0.2M     $3.6k       72%
29  36  My Week with Marilyn                   $13.6M     $1.5k       84%
30  37  A Dangerous Method                      $5.2M     $1.7k       77%
31  35  Puss in Boots                         $149.0M     $1.0k       83%
33  53  In Darkness                             $0.1M     $5.5k       86%
34  44  We Need to Talk About Kevin             $0.6M     $4.0k       80%
36  48  W.E.                                    $0.2M     $2.5k       13%
37  47  Rampart                                 $0.1M     $1.8k       73%
38  52  Coriolanus                              $0.3M     $2.9k       94%
39  --  Bullhead                               $33.6k     $4.8k       86%
40  --  Undefeated                             $30.9k     $6.2k       92%
42  55  Chico & Rita                           $56.2k     $5.3k       93%
43  54  Pariah                                  $0.7M     $1.5k       96%


Biggest Debut: Ghost Rider: Spirit of Vengeance (3)
Weakest Debut: Undefeated (40)
Biggest Gain: In Darkness (20 places)
Biggest Loss: Underworld Awakening (6 places)

代码:

my $pageToGrab = "http://www.rottentomatoes.com/movie/box_office.php";
my $command = "/usr/bin/lynx -dump -width=150 $pageToGrab";
my $tempPageFile = `$command`;


print "##  "."##  "."Movie Title                           "."Weekend      "."Cume   "."T-Meter  \n";
do
{
        if ($tempPageFile =~ /(\d+)\s+(\d+|\-\-)\s+(\d+\%)\s+\[\d+\](.*)\s+(\d+)\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\d+)/g)
        {
            $curweek[$i] = $1;
            $lastweek[$i] = $2;
            $tmeter[$i] = $3;
            $title[$i] = $4;
            $weekend[$i] = $7;
            $cume[$i] = $8;
            printf("%-4s%-4s%-38s%7s%10s%10s\n",$curweek[$i], $lastweek[$i], $title[$i], $weekend[$i], $cume[$i], $tmeter[$i]);

            if ($lastweek[$i] ne '--')
            {
                    $gain = $lastweek[$i] - $curweek[$i];
            }

            if( $gain > $largest)
            {
                    $largest = $gain;
                    $biggestgaintitle = $title[$i];
            }

            if( $gain < $smallest)
            {
                    $smallest = $gain;
                    $biggestlosstitle = $title[$i];
            }

            if( $lastweek[$i] eq '--')
            {
                    $moviedebut[$j] = $curweek[$i];
                    $lastmovie = $title[$i];
                    $j++;
            }
            $i++;
    }
}
while($i < 38);
regex perl
1个回答
2
投票

这是 18:

18 12 2% [82]One for the Money 4 $0.8M $25.5M $830 933

请注意,第三美元金额 (

$830
) 没有
M
k
后缀。使用
[Mk]?
,也许对于所有 3 美元金额:

if ($tempPageFile =~ /(\d+)\s+(\d+|\-\-)\s+(\d+\%)\s+\[\d+\](.*)\s+(\d+)\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk]?)\s+(\d+)/g) {

截断:

$title =[$i] = substr $4, 0, 35;

perldoc -f substr

© www.soinside.com 2019 - 2024. All rights reserved.