将蜘蛛部署到 Scrapy Cloud 时,包含我的帐户凭据的 JSON 文件应该使用哪个路径?

问题描述 投票:0回答:2

我想使用Scrapy Cloud来部署爬虫。我使用

main.py
CrawlerProcess
文件运行我的蜘蛛。然后我使用
pandas
对爬取的数据进行一些操作。最后,我将清理后的数据发布到 Google BigQuery 上的表中,并发送 电子邮件通知

我面临的问题是最后两个步骤。要与 GCPGmail 交互,我需要两个包含我的帐户凭据的 JSON 文件。我的问题是这两个 JSON 文件的路径。在我的本地计算机上,JSON 文件位于项目目录下,因此我只需使用

os.getcwd() + "/credentials.json"

来引用它们

当我将它们部署到 Scrapy Cloud 时,我收到下面屏幕截图中显示的错误

我按照Scrapy Cloud的文档中显示的步骤进行操作,但仍然遇到相同的错误。

这是我的项目树的屏幕截图

我的setup.py文件

from setuptools import setup, find_packages

setup(
    name         = 'project',
    version      = '1.0',
    packages     = find_packages(),
    scripts      = ['main_scraper_1.py'],
    package_data = {'indeed': ['*.json']},
    entry_points = {'scrapy': ['settings = indeed.settings']},
    zip_safe=False,
)

...和一个 代码片段 显示管道失败的确切行。我的困境是如何在我的脚本中引用这些 JSON 文件。我应该给哪条路?

import os
from google.cloud import bigquery
from google.oauth2 import service_account
import yagmail
from datetime import datetime

# Upload the results to bigquery
# First, set the credentials
key_path_local = os.getcwd() + "/bq_credentials.json" # <-- This works locally but does not work on Scrapy Cloud
credentials = service_account.Credentials.from_service_account_file(
    key_path_local, scopes=["https://www.googleapis.com/auth/cloud-platform"],
)

# Now, instantiate the client and upload the table to BigQuery
client = bigquery.Client(project="web-scraping-371310", credentials=credentials)
job_config = bigquery.LoadJobConfig(
    schema = [
        bigquery.SchemaField("job_title_name", "STRING"), 
        bigquery.SchemaField("job_type", "STRING"), 
        bigquery.SchemaField("company_name", "STRING"), 
        bigquery.SchemaField("company_indeed_url", "STRING"), 
        bigquery.SchemaField("city", "STRING"), 
        bigquery.SchemaField("remote", "STRING"), 
        bigquery.SchemaField("salary", "STRING"), 
        bigquery.SchemaField("crawled_page_rank", "INT64"),  
        bigquery.SchemaField("job_page_url", "STRING"), 
        bigquery.SchemaField("listing_page_url", "STRING"), 
        bigquery.SchemaField("job_description", "STRING"), 
        bigquery.SchemaField("crawled_timestamp", "TIMESTAMP"), 
        bigquery.SchemaField("salary_type", "STRING"), 
        bigquery.SchemaField("salary_low", "FLOAT64"),
        bigquery.SchemaField("salary_high", "FLOAT64"),
        bigquery.SchemaField("crawler_name", "STRING"),
    ]
)
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND

# Upload the table
client.load_table_from_dataframe(
    dataframe=df,
    destination="web-scraping-371310.crawled_datasets.chris_indeed_workflow",
    job_config=job_config
).result()

# Step 16: Send success E-mail
yag = yagmail.SMTP("[email protected]", oauth2_file=os.getcwd() + "/email_authentication.json") # <-- This works locally but does not work on Scrapy Cloud
contents = [
    f"This is an automatic notification to inform you that the Indeed crawler ran successfully"
]
yag.send(["[email protected]"], f"The Indeed crawler ran successfully at {datetime.now()} CET", contents)

非常感谢任何见解,因为我在网上找不到任何解决方案。谢谢你们!!

google-cloud-platform deployment scrapy
2个回答
0
投票

MANIFEST.in
文件所在的目录中创建一个名为
setup.py
的新文件。

里面写着:

include *.json

您还可以从 setup.py 文件中删除

package_data
行,因为它没有执行任何操作。

然后重新打包您的项目并重新安装在谷歌云中。


0
投票

按照支持文章从 Zyte(Scrapy 云)部署非代码文件,我能够成功使用 Google 云 API 进行身份验证。

请注意我对

credentials
文件的放置以及
setup.py
文件中的匹配配置。

通过上述配置,我可以读取 Zyte 云中的凭据文件,我可以使用以下方法向 Google Cloud 进行身份验证:

data = pkgutil.get_data("contractor", "resources/pipedrive-automations-381a9162f013.json")
data_str = data.decode('utf-8')

credentials = service_account.Credentials.from_service_account_info(
    json.loads(
        data_str,
        strict=False # set strict flat to false if the JSON string contains control characters that will fail the JSON parser.
    ),
    scopes=['https://www.googleapis.com/auth/spreadsheets']
)

service = discovery.build('sheets', 'v4', credentials=credentials)
© www.soinside.com 2019 - 2024. All rights reserved.