我正在尝试使用请求模块从
此网页中抓取
azurerm_provider
左侧的内容。
我已经探索了开发工具来查找包含预期结果的任何链接,但我未能找到任何内容。我还查找了页面源中的内容,以防它位于某些脚本标记内,但我也没有找到任何内容。
我已经发现使用硒成功获取内容,所以我不想走这条路。
这是我使用请求模块的失败尝试:
import requests
from bs4 import BeautifulSoup
link = 'https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/api_management'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
}
with requests.Session() as session:
session.headers.update(headers)
res = session.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("ul.provider-docs-menu-list .menu-list-category"):
category_name = item.select_one("a.menu-list-category-link span.menu-list-category-link-title").get_text(strip=True)
category_content = [i.get_text(strip=True) for i in item.select("li.menu-list-link > a")]
print(category_name,category_content)
预期输出:
Azure Provider: Authenticating via a Service Principal and a Client Certificate
Azure Provider: Authenticating via a Service Principal and a Client Secret
Azure Provider: Authenticating via a Service Principal and OpenID Connect
Azure Provider: Authenticating via Managed Identity
Azure Provider: Authenticating via the Azure CLI
Hashicorp 通过 API 调用动态构建文档。
您需要获取提供程序版本,然后获取提供程序文档。 最后,您可以使用提供者文档链接请求文档正文。
例如:
import requests
from tabulate import tabulate
provider_versions = "https://registry.terraform.io/v2/provider-versions/38614?include=provider-docs"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
}
with requests.Session() as session:
session.headers.update(headers)
provider_docs = session.get(provider_versions).json()
docs = [
[
doc['attributes']['title'],
f"https://registry.terraform.io{doc['links']['self']}",
]
for doc in provider_docs['included']
]
print(tabulate(docs, headers=['Title', 'Link']))
这应该输出:
Title Link
------------------------------------------------------------------------------- ------------------------------------------------------
private_dns_resolver_dns_forwarding_ruleset https://registry.terraform.io/v2/provider-docs/2530275
sentinel_threat_intelligence_indicator https://registry.terraform.io/v2/provider-docs/2530362
proximity_placement_group https://registry.terraform.io/v2/provider-docs/2529517
bot_channel_sms https://registry.terraform.io/v2/provider-docs/2529741
elastic_cloud_elasticsearch https://registry.terraform.io/v2/provider-docs/2529924
key_vault_certificate_contacts https://registry.terraform.io/v2/provider-docs/2530018
netapp_volume https://registry.terraform.io/v2/provider-docs/2530213
site_recovery_hyperv_replication_policy https://registry.terraform.io/v2/provider-docs/2530388
subscription_policy_assignment https://registry.terraform.io/v2/provider-docs/2530499
automation_webhook https://registry.terraform.io/v2/provider-docs/2529715
dev_test_linux_virtual_machine https://registry.terraform.io/v2/provider-docs/2529897
express_route_connection https://registry.terraform.io/v2/provider-docs/2529945
kusto_iothub_data_connection https://registry.terraform.io/v2/provider-docs/2530039
logz_sub_account_tag_rule https://registry.terraform.io/v2/provider-docs/2530094
private_dns_txt_record https://registry.terraform.io/v2/provider-docs/2529511
log_analytics_cluster https://registry.terraform.io/v2/provider-docs/2530063
machine_learning_workspace https://registry.terraform.io/v2/provider-docs/2529442
dev_test_virtual_network https://registry.terraform.io/v2/provider-docs/2529900
private_endpoint https://registry.terraform.io/v2/provider-docs/2530284
and much more ...
然后,使用链接您可以获得文档的内容。
例如:
# Get the first doc and its body
first_doc = session.get(docs[0][1]).json()
print(first_doc['data']['attributes']['content'])
这应该给你:
---
subcategory: "IoT Hub"
layout: "azurerm"
page_title: "Azure Resource Manager: azurerm_iothub_device_update_instance"
description: |-
Manages an IoT Hub Device Update Instance.
---
# azurerm_iothub_device_update_instance
Manages an IoT Hub Device Update Instance.
## Example Usage
>> truncated <<
provider_versions =“https://registry.terraform.io/v2/provider-versions/38614?include=provider-docs” Azure 为 38614。
你们中有人知道 OCI(Oracle Cloud)、AWS 和 GCP 的提供商版本号是多少吗?