我已经在Windows计算机上,关于这一点我想提出一个新的Scrapy履带Visual Studio代码。履带工作正常,但我想调试代码,为此我在我的launch.json
文件中添加此:
{
"name": "Scrapy with Integrated Terminal/Console",
"type": "python",
"request": "launch",
"stopOnEntry": true,
"pythonPath": "${config:python.pythonPath}",
"program": "C:/Users/neo/.virtualenvs/Gers-Crawler-77pVkqzP/Scripts/scrapy.exe",
"cwd": "${workspaceRoot}",
"args": [
"crawl",
"amazon",
"-o",
"amazon.json"
],
"console": "integratedTerminal",
"env": {},
"envFile": "${workspaceRoot}/.env",
"debugOptions": [
"RedirectOutput"
]
}
不过,我不能打任何断点。 PS:我把JSON脚本从这里:http://www.stevetrefethen.com/blog/debugging-a-python-scrapy-project-in-vscode
runner.py
模块:
import os
from scrapy.cmdline import execute
os.chdir(os.path.dirname(os.path.realpath(__file__)))
try:
execute(
[
'scrapy',
'crawl',
'SPIDER NAME',
'-o',
'out.json',
]
)
except SystemExit:
pass
runner.py
我做到了。最简单的方法是让一个亚军脚本runner.py
import scrapy
from scrapy.crawler import CrawlerProcess
from g4gscraper.spiders.g4gcrawler import G4GSpider
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'FEED_FORMAT': 'json',
'FEED_URI': 'data.json'
})
process.crawl(G4GSpider)
process.start() # the script will block here until the crawling is finished
然后,我添加断点蜘蛛而我发起调试该文件。参考:https://doc.scrapy.org/en/latest/topics/practices.html
不需要修改launch.json,默认的“巨蟒:当前文件(综合码头)”的作品完美。对于Python3项目,请记得将runner.py文件放置在同一水平scrapy.cfg文件(这是项目的根)。
所述runner.py代码@naqushab上述一样。注意processs.crawl(类名),其中的类名是你要设置的断点的蜘蛛类。