nutch 1.16在文件系统抓取中跳过文件:/目录样式的链接

问题描述 投票:0回答:1

我试图使用从主教程(https://cwiki.apache.org/confluence/display/nutch/FAQ#FAQ-HowdoIindexmylocalfilesystem?)以及其他来源获得的示例在某些本地目录上作为爬行器运行。 Nutch完全能够毫无问题地爬网,但是由于某种原因,它拒绝扫描本地目录。

我的配置文件如下:

regex-urlfilter:

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip http: ftp: and mailto: urls
-^(http|ftp|mailto):

# This change is not necessary but may make your life easier.  
# Any file types you do not want to index need to be added to the list otherwise 
# Nutch will often try to parse them and fail in doing so as it doesnt know 
# how to deal with a lot of binary file types.:
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS
#|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov
#|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|asp|ASP|xxx|XXX|yyy|YYY
#|cs|CS|dll|DLL|refresh|REFRESH)$

# skip URLs longer than 2048 characters, see also db.max.outlink.length
#-^.{2049,}

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# For safe web crawling if crawled content is exposed in a public search interface:
# - exclude private network addresses to avoid that information
#   can be leaked by placing links pointing to web interfaces of services
#   running on the crawling machines (e.g., HDFS, Hadoop YARN)
# - in addition, file:// URLs should be either excluded by a URL filter rule
#   or ignored by not enabling protocol-file
#
# - exclude localhost and loop-back addresses
#     http://localhost:8080
#     http://127.0.0.1/ .. http://127.255.255.255/
#     http://[::1]/
#-^https?://(?:localhost|127(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3}|\[::1\])(?::\d+)?(?:/|$)
#
# - exclude private IP address spaces
#     10.0.0.0/8
#-^https?://(?:10(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3})(?::\d+)?(?:/|$)
#     192.168.0.0/16
#-^https?://(?:192\.168(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
#     172.16.0.0/12
#-^https?://(?:172\.(?:1[6789]|2[0-9]|3[01])(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)

# accept anything else
+.

nutch-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
 <name>http.agent.name</name>
 <value>NutchSpiderTest</value>
</property>

<property>
  <name>http.robots.agents</name>
  <value>NutchSpiderTest,*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value>I am just testing nutch, please tell me if it's bothering your website</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>plugin.includes</name>
  <value>protocol-file|protocol-http|protocol-httpclient|urlfilter-(regex|validator)|parse-(html|tika|text)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|index-more</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  By default Nutch includes plugins to crawl HTML and various other
  document formats via HTTP/HTTPS and indexing the crawled content
  into Solr.  More plugins are available to support more indexing
  backends, to fetch ftp:// and file:// URLs, for focused crawling,
  and many other use cases.
  </description>
</property>

<property>
 <name>file.content.limit</name>
 <value>-1</value>
 <description> Needed to stop buffer overflow errors - Unable to read.....</description>
</property>

<property>
  <name>file.crawl.parent</name>
  <value>false</value>
  <description>The crawler is not restricted to the directories that you specified in the
    Urls file but it is jumping into the parent directories as well. For your own crawlings you can
    change this behavior (set to false) the way that only directories beneath the directories that you specify get
    crawled.</description>
</property>

</configuration>

最后,我注释了regex-normalize.xml的这一部分:

<!-- removes duplicate slashes but -->
<!-- * allow 2 slashes after colon ':' (indicating protocol) -->
<!-- we do not need this with files
<regex>
  <pattern>(?&lt;!:)/{2,}</pattern>
  <substitution>/</substitution>
</regex>
-->

使用以下命令在运行时/本地目录中使用Cygwin构建的发行版上的ant,Windows 10上的运行坚果:

bin/crawl -s dirs dircrawl 2 >& dircrawl.log

使用Dirs时,文件夹中包含以下seed.txt文件(我尝试包含链接的不同版本,因为哪种版本的链接似乎不一致,但是我可以将其归因于我没有找到明确的答案=:

/cygdrive/c/Users/abc/Desktop/adirectory/
file:/cygdrive/c/Users/abc/Desktop/adirectory/
file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
file://cygdrive/c/Users/abc/Desktop/anotherdirectory/
file:///cygdrive/c/Users/abc/Desktop/anotherdirectory/

dircrawl是我想将爬网保存到的目录,并指示转数/最大深度为'2'。几秒钟后,nutch爬网将输出以下hadoop.txt日志文件:

2020-03-24 14:08:58,184 INFO  crawl.Injector - Injector: starting at 2020-03-24 14:08:58
2020-03-24 14:08:58,184 INFO  crawl.Injector - Injector: crawlDb: dircrawl/crawldb
2020-03-24 14:08:58,184 INFO  crawl.Injector - Injector: urlDir: dirs
2020-03-24 14:08:58,184 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
2020-03-24 14:08:58,948 INFO  crawl.Injector - Injecting seed URL file file:/C:/Users/abc/Desktop/nutch/runtime/local/dirs/seed.txt
2020-03-24 14:08:59,011 WARN  impl.MetricsConfig - Cannot locate configuration: tried hadoop-metrics2-jobtracker.properties,hadoop-metrics2.properties
2020-03-24 14:08:59,888 INFO  mapreduce.Job - The url to track the job: http://localhost:8080/
2020-03-24 14:08:59,890 INFO  mapreduce.Job - Running job: job_local1269520609_0001
2020-03-24 14:09:00,897 WARN  crawl.Injector - Skipping /cygdrive/c/Users/abc/Desktop/adirectory/:java.net.MalformedURLException: no protocol: /cygdrive/c/Users/abc/Desktop/adirectory/
2020-03-24 14:09:00,902 INFO  regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2020-03-24 14:09:00,906 INFO  mapreduce.Job - Job job_local1269520609_0001 running in uber mode : false
2020-03-24 14:09:00,908 INFO  mapreduce.Job -  map 0% reduce 0%
2020-03-24 14:09:01,158 WARN  impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2020-03-24 14:09:01,447 WARN  zlib.ZlibFactory - Failed to load/initialize native-zlib library
2020-03-24 14:09:01,461 INFO  crawl.Injector - Injector: overwrite: false
2020-03-24 14:09:01,461 INFO  crawl.Injector - Injector: update: false
2020-03-24 14:09:01,924 INFO  mapreduce.Job -  map 100% reduce 100%
2020-03-24 14:09:01,926 INFO  mapreduce.Job - Job job_local1269520609_0001 completed successfully
2020-03-24 14:09:01,951 INFO  mapreduce.Job - Counters: 31
    File System Counters
        FILE: Number of bytes read=1857050
        FILE: Number of bytes written=3067581
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
    Map-Reduce Framework
        Map input records=5
        Map output records=0
        Map output bytes=0
        Map output materialized bytes=6
        Input split bytes=289
        Combine input records=0
        Combine output records=0
        Reduce input groups=0
        Reduce shuffle bytes=6
        Reduce input records=0
        Reduce output records=0
        Spilled Records=0
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=13
        Total committed heap usage (bytes)=402653184
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    injector
        urls_filtered=5
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=239
2020-03-24 14:09:02,022 INFO  crawl.Injector - Injector: Total urls rejected by filters: 5
2020-03-24 14:09:02,023 INFO  crawl.Injector - Injector: Total urls injected after normalization and filtering: 0
2020-03-24 14:09:02,023 INFO  crawl.Injector - Injector: Total urls injected but already in CrawlDb: 0
2020-03-24 14:09:02,023 INFO  crawl.Injector - Injector: Total new urls injected: 0
2020-03-24 14:09:02,054 INFO  crawl.Injector - Injector: finished at 2020-03-24 14:09:02, elapsed: 00:00:03
2020-03-24 14:09:08,708 INFO  crawl.Generator - Generator: starting at 2020-03-24 14:09:08
2020-03-24 14:09:08,708 INFO  crawl.Generator - Generator: Selecting best-scoring urls due for fetch.
2020-03-24 14:09:08,708 INFO  crawl.Generator - Generator: filtering: false
2020-03-24 14:09:08,708 INFO  crawl.Generator - Generator: normalizing: true
2020-03-24 14:09:08,715 INFO  crawl.Generator - Generator: topN: 50000
2020-03-24 14:09:08,879 WARN  impl.MetricsConfig - Cannot locate configuration: tried hadoop-metrics2-jobtracker.properties,hadoop-metrics2.properties
2020-03-24 14:09:10,418 INFO  mapreduce.Job - The url to track the job: http://localhost:8080/
2020-03-24 14:09:10,424 INFO  mapreduce.Job - Running job: job_local828841059_0001
2020-03-24 14:09:11,450 INFO  mapreduce.Job - Job job_local828841059_0001 running in uber mode : false
2020-03-24 14:09:11,453 INFO  mapreduce.Job -  map 0% reduce 0%
2020-03-24 14:09:11,784 INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2020-03-24 14:09:11,784 INFO  crawl.AbstractFetchSchedule - defaultInterval=2592000
2020-03-24 14:09:11,784 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2020-03-24 14:09:11,816 WARN  zlib.ZlibFactory - Failed to load/initialize native-zlib library
2020-03-24 14:09:12,073 WARN  impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2020-03-24 14:09:12,475 INFO  mapreduce.Job -  map 100% reduce 100%
2020-03-24 14:09:12,505 WARN  impl.MetricsSystemImpl - JobTracker metrics system already initialized!
2020-03-24 14:09:13,485 INFO  mapreduce.Job - Job job_local828841059_0001 completed successfully
2020-03-24 14:09:13,502 INFO  mapreduce.Job - Counters: 30
    File System Counters
        FILE: Number of bytes read=2784859
        FILE: Number of bytes written=4605489
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
    Map-Reduce Framework
        Map input records=0
        Map output records=0
        Map output bytes=0
        Map output materialized bytes=28
        Input split bytes=156
        Combine input records=0
        Combine output records=0
        Reduce input groups=0
        Reduce shuffle bytes=28
        Reduce input records=0
        Reduce output records=0
        Spilled Records=0
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=15
        Total committed heap usage (bytes)=603979776
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=98
    File Output Format Counters 
        Bytes Written=16
2020-03-24 14:09:13,502 INFO  crawl.Generator - Generator: number of items rejected during selection:
2020-03-24 14:09:13,521 WARN  crawl.Generator - Generator: 0 records selected for fetching, exiting ...

虽然dirclaw.txt日志产生:

Injecting seed URLs
/cygdrive/c/Users/abc/Desktop/nutch/runtime/local/bin/nutch inject dircrawl/crawldb dirs
Injector: starting at 2020-03-24 14:08:58
Injector: crawlDb: dircrawl/crawldb
Injector: urlDir: dirs
Injector: Converting injected urls to crawl db entries.
Injecting seed URL file file:/C:/Users/abc/Desktop/nutch/runtime/local/dirs/seed.txt
Skipping /cygdrive/c/Users/abc/Desktop/adirectory/:java.net.MalformedURLException: no protocol: /cygdrive/c/Users/abc/Desktop/adirectory/
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 5
Injector: Total urls injected after normalization and filtering: 0
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 0
Injector: finished at 2020-03-24 14:09:02, elapsed: 00:00:03
24 Mar 2020 14:09:02 : Iteration 1 of 2
Generating a new segment
/cygdrive/c/Users/abc/Desktop/nutch/runtime/local/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true dircrawl/crawldb dircrawl/segments -topN 50000 -numFetchers 1 -noFilter
Generator: starting at 2020-03-24 14:09:08
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: number of items rejected during selection:
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now

所以基本上现在我有点卡住了。我试图撤消某些更改,但是无论如何我都无法使配置与本地目录一起使用。有人知道我在做什么错吗?

regex web-crawler nutch
1个回答
0
投票

爬网文件的麻烦:URL以及为什么斜杠很重要NUTCH-1483中描述:

  • 这些种子URL应该有效:
© www.soinside.com 2019 - 2024. All rights reserved.