抓取 Google 搜索结果 Python BeautifulSoup

问题描述 投票:0回答:1

我有一个谷歌查询,它显示了 8000 个带有链接的结果,我只想抓取搜索结果中的链接(url),我能够获取首页链接,有没有任何方法可以抓取下一页。这是我的代码

for page in range(0,7):
    linkedin_urls = [url.text for url in linkedin_urls]
    #print(linkedin_urls)
        #loop to iterate through all links in the google search query
    for gol_url in linkedin_urls:
        print(gol_url)
        #driver.get(Xmen_url)
        #sel = Selector(text = driver.page_source)
        sleep(3)
         #Go back to google search
        driver.get('https://www.gooogle.com')
        sleep(3)
        #locate search form by name
        search_query = driver.find_element(By.NAME, 'q')
        sleep(3)
        #Input search words
        search_query.send_keys('inurl:https://www.ama-assn.org/system/files')

        #Simulate return key
        search_query.send_keys(Keys.RETURN)

    #find next page icon in Google search
    #Next_Google_page = driver.find_element_by_link_text("Next").click()
    Next_Google_page = driver.find_element(By.LINK_TEXT, "Next").click()

    page += 1

python web-scraping beautifulsoup google-search-api
1个回答
0
投票

谷歌搜索现在没有分页,而是无限滚动。您需要滚动到页面末尾并等待它自动加载更多结果,直到到达页面末尾,您必须单击“更多结果”才能查看更多结果。

这里是一个使用selenium滚动直到谷歌搜索结束的示例代码。

import time
from selenium import webdriver
search_query_link = 'google_search_query_link'
driver = webdriver.Chrome()
driver.get(search_query_link)
current_height = driver.execute_script("return document.body.scrollHeight")
page_end = True
while page_end:
      driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")   
      time.sleep(5)
      new_height = driver.execute_script("return document.body.scrollHeight")
      if current_height == new_height:
            page_end = False
      else:
            current_height = new_height
# Your code to extract all the links goes here
driver.quit()

您可以进一步将此代码封装在循环中,以便每次遇到它时单击“更多结果”。

© www.soinside.com 2019 - 2024. All rights reserved.