我想使用Scrapy Cloud来部署爬虫。我使用
main.py
从 CrawlerProcess
文件运行我的蜘蛛。然后我使用 pandas
对爬取的数据进行一些操作。最后,我将清理后的数据发布到 Google BigQuery 上的表中,并发送 电子邮件通知。
我面临的问题是最后两个步骤。要与 GCP 或 Gmail 交互,我需要两个包含我的帐户凭据的 JSON 文件。我的问题是这两个 JSON 文件的路径。在我的本地计算机上,JSON 文件位于项目目录下,因此我只需使用
os.getcwd() + "/credentials.json"
来引用它们
当我将它们部署到 Scrapy Cloud 时,我收到下面屏幕截图中显示的错误
我按照Scrapy Cloud的文档中显示的步骤进行操作,但仍然遇到相同的错误。
这是我的项目树的屏幕截图
我的setup.py文件
from setuptools import setup, find_packages
setup(
name = 'project',
version = '1.0',
packages = find_packages(),
scripts = ['main_scraper_1.py'],
package_data = {'indeed': ['*.json']},
entry_points = {'scrapy': ['settings = indeed.settings']},
zip_safe=False,
)
...和一个 代码片段 显示管道失败的确切行。我的困境是如何在我的脚本中引用这些 JSON 文件。我应该给哪条路?
import os
from google.cloud import bigquery
from google.oauth2 import service_account
import yagmail
from datetime import datetime
# Upload the results to bigquery
# First, set the credentials
key_path_local = os.getcwd() + "/bq_credentials.json" # <-- This works locally but does not work on Scrapy Cloud
credentials = service_account.Credentials.from_service_account_file(
key_path_local, scopes=["https://www.googleapis.com/auth/cloud-platform"],
)
# Now, instantiate the client and upload the table to BigQuery
client = bigquery.Client(project="web-scraping-371310", credentials=credentials)
job_config = bigquery.LoadJobConfig(
schema = [
bigquery.SchemaField("job_title_name", "STRING"),
bigquery.SchemaField("job_type", "STRING"),
bigquery.SchemaField("company_name", "STRING"),
bigquery.SchemaField("company_indeed_url", "STRING"),
bigquery.SchemaField("city", "STRING"),
bigquery.SchemaField("remote", "STRING"),
bigquery.SchemaField("salary", "STRING"),
bigquery.SchemaField("crawled_page_rank", "INT64"),
bigquery.SchemaField("job_page_url", "STRING"),
bigquery.SchemaField("listing_page_url", "STRING"),
bigquery.SchemaField("job_description", "STRING"),
bigquery.SchemaField("crawled_timestamp", "TIMESTAMP"),
bigquery.SchemaField("salary_type", "STRING"),
bigquery.SchemaField("salary_low", "FLOAT64"),
bigquery.SchemaField("salary_high", "FLOAT64"),
bigquery.SchemaField("crawler_name", "STRING"),
]
)
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
# Upload the table
client.load_table_from_dataframe(
dataframe=df,
destination="web-scraping-371310.crawled_datasets.chris_indeed_workflow",
job_config=job_config
).result()
# Step 16: Send success E-mail
yag = yagmail.SMTP("[email protected]", oauth2_file=os.getcwd() + "/email_authentication.json") # <-- This works locally but does not work on Scrapy Cloud
contents = [
f"This is an automatic notification to inform you that the Indeed crawler ran successfully"
]
yag.send(["[email protected]"], f"The Indeed crawler ran successfully at {datetime.now()} CET", contents)
非常感谢任何见解,因为我在网上找不到任何解决方案。谢谢你们!!
在
MANIFEST.in
文件所在的目录中创建一个名为 setup.py
的新文件。
里面写着:
include *.json
您还可以从 setup.py 文件中删除
package_data
行,因为它没有执行任何操作。
然后重新打包您的项目并重新安装在谷歌云中。
按照支持文章从 Zyte(Scrapy 云)部署非代码文件,我能够成功使用 Google 云 API 进行身份验证。
请注意我对
credentials
文件的放置以及 setup.py
文件中的匹配配置。
通过上述配置,我可以读取 Zyte 云中的凭据文件,我可以使用以下方法向 Google Cloud 进行身份验证:
data = pkgutil.get_data("contractor", "resources/pipedrive-automations-381a9162f013.json")
data_str = data.decode('utf-8')
credentials = service_account.Credentials.from_service_account_info(
json.loads(
data_str,
strict=False # set strict flat to false if the JSON string contains control characters that will fail the JSON parser.
),
scopes=['https://www.googleapis.com/auth/spreadsheets']
)
service = discovery.build('sheets', 'v4', credentials=credentials)