使用 Python 抓取搜索结果

Question

我准确地使用 Python 抓取了 Google 搜索结果，获取每个关键字的搜索结果数量并将它们保存在 CSV 文件中。但是在将近 100 个关键字搜索之后，它显示 [Not Found] 而不是数字。我想我被谷歌或其他东西屏蔽了，我听说了 IP 屏蔽和 API 限制，但我不知道这些是否是问题所在。

这是我使用的代码：

import requests
import csv
from bs4 import BeautifulSoup

# Read the keywords from a file
with open("keywords.txt", "r") as file:
    keywords = file.read().splitlines()

# Define the User-Agent header
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}

# Create a new CSV file and write the headers
with open("results.csv", "w", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(["Keyword", "Total Results"])

    # Perform the search for each keyword and write the total number of results to the CSV file
    for keyword in keywords:
        response = requests.get(f"https://www.google.com/search?q={keyword}", headers=headers)
        soup = BeautifulSoup(response.content, "html.parser")
        result_stats = soup.find("div", {"id": "result-stats"})
        if result_stats:
            total_results = result_stats.get_text().split()[1].replace(",", "")
            writer.writerow([keyword, total_results])
            print(f"Keyword: {keyword}, Total Results: {total_results}")
        else:
            writer.writerow([keyword, "Not found"])
            print(f"Keyword: {keyword}, Total Results: Not found")

如果代码有问题，请修复它，如果不是，请告诉我该怎么做

我试过 Python bueatyfulsoup、bs4、requests

Answer 1

有可能你遇到了谷歌的IP封锁、API限制等反抓取措施，正如你所怀疑的。当您在短时间内向 Google 发送过多请求时，Google 可能会开始阻止您的 IP 地址或验证您的请求以防止进一步抓取。

为避免这些问题，您可以尝试以下方法：

1)放慢你的请求：你可以在每个请求之间添加一个延迟，以避免在短时间内发送太多请求。您可以使用 time.sleep() 函数来添加延迟。

2)使用代理：您可以使用代理从不同的IP地址发出请求，避免被谷歌屏蔽。您可以使用许多代理服务。

3) 使用 Google API：除了抓取 Google 搜索结果，您还可以使用 Google Custom Search API 或 Google Search Console API 来检索搜索结果。这些 API 就是为此目的而设计的，不太可能触发反抓取措施。

使用 Python 抓取搜索结果

问题描述投票：0回答：1

1个回答

最新问题

使用 Python 抓取搜索结果

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1