网页抓取 Google Scholar 输出的链接比预期多,而且似乎无法过滤它们

问题描述 投票:0回答:1

我总体上是Python新手(大家好)。我正在尝试对 Google Scholar 进行网络抓取,到目前为止,进展相当顺利(有很多令人头疼的问题,但我正在实现目标)。我现在的问题是关于输出。

这是我到目前为止的代码。

from bs4 import BeautifulSoup
import request
url = 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=clinical+practice+guidelines+esc&oq=clinical+practice+guidelines+'

data = requests.get(url).text

soup = BeautifulSoup(data,'html5lib')

Object = soup.find_all('h3')

num = 0
lnk = 0
for line in Object:
    for word in line:
        num=num+1
        if word.text != "":
            print('Article',num)
            print(word.text)

for link in soup.find_all('a', href=True):
    link1 = link.get('href')
    if  link1.startswith('/')==False | link1.startswith('java')==False | link1.startswith('https://account') == False:
        lnk = lnk+1
        print('Link', lnk)
        print(link1)

我无法发布结果,因为网页将其识别为垃圾邮件(fml),但它基本上是所有文章标题(有一些错误)和链接。两人都一一列举。链接比文章多,我似乎无法过滤多余的链接(多余的链接是链接3、6、9、11、14。它们与文章标题右侧的链接重合。(This is what I mean)


I know there are mistakes (Article 5 reads \[HTML\]\[HTML\] and Article 6 is literally blank) but my main issue are the repeating links that end up making it hard to join article and link.

Please if anyone could give me some guidance it would be extremely useful and I would be very grateful.

Here is an example of an output of the variable link:

ESC 临床实践指南 2020 年成人先天性心脏病管理


I have tried using the data-clk attribute to filter but they are individual for each link. I have also tried using the ct=res&amp but it appears they are not reachable cause I get a bunch of None's
python web-scraping beautifulsoup
1个回答
0
投票

尝试:

import requests
from bs4 import BeautifulSoup

url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=clinical+practice+guidelines+esc&oq=clinical+practice+guidelines+"

response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

for i, a in enumerate(soup.select("h3 a"), 1):
    print(f"{i}.", "\t", a.text)
    print(a["href"])
    print("-" * 80)

打印:

1.       The ESC clinical practice guidelines for the management of adult congenital heart disease 2020
https://academic.oup.com/eurheartj/article-abstract/41/43/4153/5944166
--------------------------------------------------------------------------------
2.       Type of evidence supporting ACC/AHA and ESC clinical practice guidelines for acute coronary syndrome
https://link.springer.com/article/10.1007/s00392-023-02262-9
--------------------------------------------------------------------------------
3.       ESC Committee for Practice Guidelines: providing knowledge to everyday clinical practice
https://academic.oup.com/cardiovascres/article-abstract/116/11/e146/5897471
--------------------------------------------------------------------------------
4.       Application of hypertension guidelines in clinical practice: implementation of the 2007 ESH/ESC European practice Guidelines in Spain
https://journals.lww.com/jhypertension/fulltext/2009/06003/2007_ESH_ESC_Practice_Guidelines_for_the.5.aspx
--------------------------------------------------------------------------------
5.       … coronary syndromes in patients presenting without persistent ST-segment elevation: key points from the ESC 2020 Clinical Practice Guidelines for the general …
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8002777/
--------------------------------------------------------------------------------
6.       Updates to the ACCF/AHA and ESC STEMI and NSTEMI guidelines: putting guidelines into clinical practice
https://www.ajconline.org/article/S0002-9149(15)00034-X/abstract
--------------------------------------------------------------------------------
7.       … Care Excellence (NICE) and European Society of Cardiology (ESC) guidelines for the diagnosis and management of stable angina: implications for clinical practice
https://openheart.bmj.com/content/3/1/e000406.abstract
--------------------------------------------------------------------------------
8.       … Force on Practice Guidelines and the European Society of Cardiology Committee for Practice Guidelines and Policy Conferences (Committee to Develop Guidelines …
https://www.jacc.org/doi/abs/10.1016/S0735-1097(01)01586-8
--------------------------------------------------------------------------------
9.       Cardiovascular prevention in clinical practice (ESC and German guidelines 2007)
https://search.proquest.com/openview/a84efbf29cc4fd5b3b4c6fd7cd274f25/1?pq-origsite=gscholar&cbl=38117
--------------------------------------------------------------------------------
10.      ESC Clinical Practice Guidelines on the management of valvular heart disease-2017 update
https://academic.oup.com/eurheartj/article-abstract/38/36/2697/4209312
--------------------------------------------------------------------------------
© www.soinside.com 2019 - 2024. All rights reserved.