使用 beautifulSoup 进行网页抓取进行谷歌搜索

Question

以下代码：

import requests
import sys
import webbrowser
from bs4 import BeautifulSoup

headers = {
    'User-agent':
    "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0"
}

res = requests.get('https://google.com/search?q='+''.join(sys.argv[1:]),
                    headers=headers,
                    cookies={'CONSENT':'YES+'})
res.raise_for_status()

soup = BeautifulSoup(res.text, 'html.parser')
link_elems = soup.select('.r a')
num_to_open = min(5, len(link_elems))

for i in range(num_to_open):
    webbrowser.open('https://google.com' + link_elems[i].get('href'))

不打开任何页面。

link_elems

是一个空列表。

这是书上的练习（https://nostarch.com/automatestuff2/）“自动化无聊的事情”。

我还不确定如何使用

BeautifulSoup

。书中给出的例子比这个练习容易得多。仅复制代码是行不通的 - 这没关系，但可能还有很多额外的细节，我添加了行

soup.select('.r a')

而不是原始代码中包含的

soup.select('.package-snippet')

，但它不起作用。

问题出在哪里？

Answer 1

了解问题和解决方案

最初的问题在于 Google 搜索结果的结构和检索方式。 Google 的 HTML 结构很复杂并且可能经常变化，这使得解析结果变得困难。

以下是问题的细分以及您应用的解决方案：

HTML 结构和选择器：

原代码中是用soup.select('.r a')来查找search的
结果链接。然而，Google 的 HTML 结构经常发生变化，并且 .r 选择器可能无法可靠地找到所需的链接。
更新后的代码使用 soup.xpath('//h3/parent::a') 来查找链接，这是一种更稳健的方法，但仍然不能保证，因为 Google 的动态内容和潜在变化。

处理 Google 的反机器人措施：

Google 经常会提出验证码或其他挑战来防止自动抓取。这就是 requests.get 有时失败的原因检索完整内容。
更新后的代码使用 Selenium 来处理 requests.get 的情况不起作用，允许用户手动解决验证码。

使用和删除的库

添加的库：

```
lxml.html
```
：用于使用XPath解析HTML内容，可以是比 BeautifulSoup 的 CSS 选择器更精确、更灵活。
```
selenium
```
：用于自动化网络浏览器交互，特别有用用于处理验证码和需要 JavaScript 的动态内容执行。

删除的库：

```
BeautifulSoup
```
来自
```
bs4
```
：更新的解决方案不使用 BeautifulSoup 不再用于解析 HTML。

更新了代码说明

这是带有解释的更新代码：

import requests
import sys
import webbrowser
from lxml.html import fromstring
from selenium import webdriver

headers = {
    'User-agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0"
}

res = requests.get('https://google.com/search?q=' + 'new song', headers=headers, cookies={'CONSENT': 'YES+'})

# Check if the request was successful
if res.status_code != 200:
    print('Failed to fetch page')
    driver = webdriver.Chrome()  # Initialize Selenium WebDriver
    driver.get(res.url)  # Open the URL with the CAPTCHA in the browser
    input('[*] Solve the CAPTCHA and press enter to continue')  # Wait for user to solve CAPTCHA
    html = driver.page_source  # Get the HTML content after CAPTCHA is solved
    driver.quit()  # Close the browser
else:
    html = res.text  # Use the response text if no CAPTCHA was encountered

# Parse the HTML content
soup = fromstring(html)
link_elems = soup.xpath('//h3/parent::a')  # XPath to find the links in search results
num_to_open = min(5, len(link_elems))  # Limit to opening a maximum of 5 links

# Open the found links in web browser
for i in range(num_to_open):
    print(link_elems[i].get('href'))
    link = 'https://google.com' + link_elems[i].get('href')
    webbrowser.open(link)

打印：

https://www.youtube.com/watch?v=3oa5ao0u1Ys
https://www.youtube.com/watch?v=B4wDwdtq7mc
https://www.youtube.com/playlist?list=PLcVfz1-_0rj-t2TzP5iBg1Bq_2bncnAk_
https://www.youtube.com/watch?v=DLZD47lj82o
https://gaana.com/newrelease

这种方法结合了 HTTP 请求的简单性和 Selenium 处理交互式挑战的强大功能，为抓取 Google 搜索结果提供了更强大的解决方案。

使用 beautifulSoup 进行网页抓取进行谷歌搜索

问题描述投票：0回答：1

1个回答

了解问题和解决方案

HTML 结构和选择器：

处理 Google 的反机器人措施：

使用和删除的库

添加的库：

删除的库：

更新了代码说明

最新问题

使用 beautifulSoup 进行网页抓取进行谷歌搜索

问题描述 投票：0回答：1

1个回答

了解问题和解决方案

HTML 结构和选择器：

处理 Google 的反机器人措施：

使用和删除的库

添加的库：

删除的库：

更新了代码说明

最新问题

问题描述投票：0回答：1