无法使用请求模块从registry.terraform中抓取内容

问题描述 投票:0回答:2

我正在尝试使用请求模块从

此网页
中抓取azurerm_provider左侧的内容。

我已经探索了开发工具来查找包含预期结果的任何链接,但我未能找到任何内容。我还查找了页面源中的内容,以防它位于某些脚本标记内,但我也没有找到任何内容。

我已经发现使用硒成功获取内容,所以我不想走这条路。

这是我使用请求模块的失败尝试:

import requests
from bs4 import BeautifulSoup

link = 'https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/api_management'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
}

with requests.Session() as session:
    session.headers.update(headers)
    res = session.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("ul.provider-docs-menu-list .menu-list-category"):
        category_name = item.select_one("a.menu-list-category-link span.menu-list-category-link-title").get_text(strip=True)
        category_content = [i.get_text(strip=True) for i in item.select("li.menu-list-link > a")]
        print(category_name,category_content)

预期输出:

Azure Provider: Authenticating via a Service Principal and a Client Certificate
Azure Provider: Authenticating via a Service Principal and a Client Secret
Azure Provider: Authenticating via a Service Principal and OpenID Connect
Azure Provider: Authenticating via Managed Identity
Azure Provider: Authenticating via the Azure CLI
python python-3.x web-scraping beautifulsoup python-requests
2个回答
2
投票

Hashicorp 通过 API 调用动态构建文档。

您需要获取提供程序版本,然后获取提供程序文档。 最后,您可以使用提供者文档链接请求文档正文。

例如:

import requests
from tabulate import tabulate

provider_versions = "https://registry.terraform.io/v2/provider-versions/38614?include=provider-docs"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
}

with requests.Session() as session:
    session.headers.update(headers)
    provider_docs = session.get(provider_versions).json()
    docs = [
        [
            doc['attributes']['title'],
            f"https://registry.terraform.io{doc['links']['self']}",
        ]
        for doc in provider_docs['included']
    ]
    print(tabulate(docs, headers=['Title', 'Link']))

这应该输出:

Title                                                                            Link
-------------------------------------------------------------------------------  ------------------------------------------------------
private_dns_resolver_dns_forwarding_ruleset                                      https://registry.terraform.io/v2/provider-docs/2530275
sentinel_threat_intelligence_indicator                                           https://registry.terraform.io/v2/provider-docs/2530362
proximity_placement_group                                                        https://registry.terraform.io/v2/provider-docs/2529517
bot_channel_sms                                                                  https://registry.terraform.io/v2/provider-docs/2529741
elastic_cloud_elasticsearch                                                      https://registry.terraform.io/v2/provider-docs/2529924
key_vault_certificate_contacts                                                   https://registry.terraform.io/v2/provider-docs/2530018
netapp_volume                                                                    https://registry.terraform.io/v2/provider-docs/2530213
site_recovery_hyperv_replication_policy                                          https://registry.terraform.io/v2/provider-docs/2530388
subscription_policy_assignment                                                   https://registry.terraform.io/v2/provider-docs/2530499
automation_webhook                                                               https://registry.terraform.io/v2/provider-docs/2529715
dev_test_linux_virtual_machine                                                   https://registry.terraform.io/v2/provider-docs/2529897
express_route_connection                                                         https://registry.terraform.io/v2/provider-docs/2529945
kusto_iothub_data_connection                                                     https://registry.terraform.io/v2/provider-docs/2530039
logz_sub_account_tag_rule                                                        https://registry.terraform.io/v2/provider-docs/2530094
private_dns_txt_record                                                           https://registry.terraform.io/v2/provider-docs/2529511
log_analytics_cluster                                                            https://registry.terraform.io/v2/provider-docs/2530063
machine_learning_workspace                                                       https://registry.terraform.io/v2/provider-docs/2529442
dev_test_virtual_network                                                         https://registry.terraform.io/v2/provider-docs/2529900
private_endpoint                                                                 https://registry.terraform.io/v2/provider-docs/2530284

and much more ...

然后,使用链接您可以获得文档的内容。

例如:

    # Get the first doc and its body
    first_doc = session.get(docs[0][1]).json()
    print(first_doc['data']['attributes']['content'])

这应该给你:

---
subcategory: "IoT Hub"
layout: "azurerm"
page_title: "Azure Resource Manager: azurerm_iothub_device_update_instance"
description: |-
  Manages an IoT Hub Device Update Instance.
---

# azurerm_iothub_device_update_instance

Manages an IoT Hub Device Update Instance.

## Example Usage

>> truncated <<

0
投票

provider_versions =“https://registry.terraform.io/v2/provider-versions/38614?include=provider-docs” Azure 为 38614。

你们中有人知道 OCI(Oracle Cloud)、AWS 和 GCP 的提供商版本号是多少吗?

© www.soinside.com 2019 - 2024. All rights reserved.