使用美丽的汤用动态javascript抓取网站

问题描述 投票:0回答:1

我正在尝试 IBM 文档。以下是我正在查看的网址。我想知道如何以编程方式展开左侧窗格上的所有切换,以便我可以获得所有 URL 并获取数据。

https://www.ibm.com/docs/en/b2b-integrator/6.1.0

看来 RPA 是一种可行的方法,可以扩展每个切换按钮并使用 Selenium 之类的库来扩展它并抓取数据。

但是有人可以提供任何想法吗?

感谢和问候

python selenium-webdriver beautifulsoup web-crawler
1个回答
0
投票

尝试:

import json
import requests

doc_url = "https://www.ibm.com/docs/api/v1/toc/b2b-integrator/6.1.0?lang=en"


def print_topics(o, lvl=0):
    if isinstance(o, dict):
        print("\t" * lvl, o.get("label"), "->", o.get("href", ""))
        for t in o.get("topics", []):
            print_topics(t, lvl + 1)
    elif isinstance(o, list):
        for v in o:
            print_topics(v, lvl)


data = requests.get(doc_url).json()
# print(json.dumps(data, indent=4))

print_topics(data["toc"])

打印:

 IBM Sterling B2B Integrator -> SS3JSW_6.1.0
     IBM Sterling B2B Integrator v6.1.0 documentation -> SS3JSW_6.1.0/kc_welcome_b2bi.html
     What's new in the release? -> 
         IBM Sterling B2B Integrator -> 
             What's new in 6.1.0.0 -> SS3JSW_6.1.0/whatsnew/whats_new/integrator/integrator_whats_new.html
                 What's new in 6.1.0.1 -> SS3JSW_6.1.0/whatsnew/whats_new/integrator/integrator_whats_new_6101.html
                 What's new in 6.1.0.3 -> SS3JSW_6.1.0/whatsnew/whats_new/integrator/integrator_whats_new_6103.html
                 What's new in 6.1.0.4 -> SS3JSW_6.1.0/whatsnew/whats_new/integrator/integrator_whats_new_6104.html
                 What's new in 6.1.0.5 -> SS3JSW_6.1.0/whatsnew/whats_new/integrator/integrator_whats_new_6105.html
                 What's new in 6.1.0.5_1 -> SS3JSW_6.1.0/whatsnew/whats_new/integrator/integrator_whats_new_6105_1.html
                 What's new in 6.1.0.5_2 -> SS3JSW_6.1.0/whatsnew/whats_new/integrator/integrator_whats_new_6105_2.html
                 What's new in 6.1.0.6 -> SS3JSW_6.1.0/whatsnew/whats_new/integrator/integrator_whats_new_6106.html
                 What's new in 6.1.0.7 -> SS3JSW_6.1.0/whatsnew/whats_new/integrator/integrator_whats_new_6107.html
                 What's new in 6.1.0.8 -> SS3JSW_6.1.0/whatsnew/whats_new/integrator/integrator_whats_new_6108.html
             What's deprecated -> SS3JSW_6.1.0/whatsnew/whats_new/integrator/integrator_whats_deprecated.html
                 What's deprecated in 6.1.0.1 -> SS3JSW_6.1.0/whatsnew/whats_new/integrator/integrator_whats_deprecated_6101.html
             Support policy for container delivery models -> SS3JSW_6.1.0/whatsnew/whats_new/integrator/integrator_support_policy.html
             Resolved issues -> SS3JSW_6.1.0/whatsnew/whats_new/integrator/integrator_resolved_issues.html
             Known issues -> SS3JSW_6.1.0/whatsnew/whats_new/integrator/integrator_known_issues.html
         IBM Global Mailbox -> 
             Known issues -> SS3JSW_6.1.0/whatsnew/whats_new/globalmailbox/gm_known_issues.html
     Release Notes -> SS3JSW_6.1.0/ReleaseNotes.html
     APAR Fixes -> SS3JSW_6.1.0/APAR_Fixes.html
     Quick Start Guide -> SS3JSW_6.1.0/QuickStartGuide.html
     Downloading installation media and components -> SS3JSW_6.1.0/IBMB2BIntegratorDownloadDoc.html
     Overview -> 
         IBM Sterling B2B Integrator Overview -> 
             System requirements -> SS3JSW_6.1.0/overview/overview/integrator/SI_system_requirements.html
             Sterling B2B Integrator overview -> SS3JSW_6.1.0/overview/overview/integrator/si_overview.html
                 Introduction to Sterling B2B Integrator -> SS3JSW_6.1.0/overview/overview/integrator/SI_Introduction.html
                     Evolving Business and Integration Objectives  -> SS3JSW_6.1.0/overview/overview/integrator/SI_EvolvBusIntObj.html

...
© www.soinside.com 2019 - 2024. All rights reserved.