使用保存在本地系统中的html来抓取文件

Question

例如，我有一个网站"www.example.com"其实我想通过保存到本地系统刮掉这个网站的HTML。所以对于测试，我将该页面保存在桌面上作为example.html

现在我已经为此编写了蜘蛛代码，如下所示

class ExampleSpider(BaseSpider):
   name = "example"
   start_urls = ["example.html"]

   def parse(self, response):
       print response
       hxs = HtmlXPathSelector(response)

但是当我运行上面的代码时，我收到如下错误

ValueError: Missing scheme in request url: example.html

最后我的意思是刮掉由我的本地系统中保存的example.html html代码组成的www.example.com文件

任何人都可以建议我如何在start_urls中分配该example.html文件

提前致谢

Answer 1

您可以使用以下格式的网址抓取本地文件：

 file:///127.0.0.1/path/to/file.html

它不需要在您的计算机上安装http服务器。

Answer 2

您可以使用HTTPCacheMiddleware，它将使您能够从缓存运行蜘蛛。 HTTPCacheMiddleware设置的文档位于here。

基本上，将以下设置添加到settings.py将使其工作：

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0 # Set to 0 to never expire

但是，这需要从Web执行初始蜘蛛运行以填充缓存。

Answer 3

在scrapy中，您可以使用以下方法抓取本地文件：

class ExampleSpider(BaseSpider):
   name = "example"
   start_urls = ["file:///path_of_directory/example.html"]

   def parse(self, response):
       print response
       hxs = HtmlXPathSelector(response)

我建议您使用scrapy shell'file：///path_of_directory/example.html'进行检查

Answer 4

scrapy shell "file:E:\folder\to\your\script\Scrapy\teste1\teste1.html"

今天在Windows 10上这对我有用。我必须在没有////的情况下放置完整路径。

Answer 5

如果您查看scrapy的源代码请求，例如github。您可以了解scrapy向http服务器发送请求的内容以及从服务器获取响应所需的页面。您的文件系统不是http服务器。对于scrapy的测试目的，您必须设置http服务器。然后你可以指定像sches一样的网址

http://127.0.0.1/example.html

使用保存在本地系统中的html来抓取文件

问题描述投票：19回答：5

5个回答

最新问题

使用保存在本地系统中的html来抓取文件

问题描述 投票：19回答：5

5个回答

最新问题

问题描述投票：19回答：5