在Python中无需API密钥以编程方式搜索google

问题描述 投票:0回答:2

有没有一种方法可以在没有 API 密钥的情况下向 Google 发出请求? 我已经尝试了几个Python包,它们工作得很好,除了当谷歌说没有找到结果时它们也会提供链接。如果链接下面的简短页面描述也能提供,那就太好了。

python-3.x
2个回答
1
投票

要抓取 Google 搜索结果,您可以使用 BeautifulSoup 网络抓取库。

如果您使用

requests
,请求可能会被阻止,因为
requests
中的默认用户代理库是 python-requests,因为 Google 可能会认为您是机器人。

要绕过可能的阻止,您可以将

headers
与您真实的 User-Agent 添加到代码中。

下一步可以是 旋转

user-agent
,例如,在 PC、移动设备和平板电脑之间以及浏览器(例如浏览器)之间切换。 Chrome、Firefox、Safari、Edge 等。最可靠的方法是使用旋转代理、用户代理和验证码求解器。

如果我们需要从所有可能的页面动态提取所有结果,我们需要使用带有特定条件的 while 循环来退出循环这是非基于令牌的分页。无论有多少页,它都会遍历所有内容。基本上,我们不会将页码硬编码为从 N 页到 N 页。

在线IDE中检查代码。

from bs4 import BeautifulSoup
import requests, json, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "web scraping", # query example
    "hl": "en",          # language
    "gl": "uk",          # country of the search, UK -> United Kingdom
    "start": 0,          # number page by default up to 0
    #"num": 100          # parameter defines the maximum number of results to return.
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}

page_limit = 10           # page limit, if you do not need to parse all pages
page_num = 0

data = []

while True:
    page_num += 1
    print(f"page: {page_num}")
        
    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, 'lxml')
    
    for result in soup.select(".tF2Cxc"):
        title = result.select_one(".DKV0Md").text
        try:
           snippet = result.select_one(".lEBKkf span").text
        except:
           snippet = None
        links = result.select_one(".yuRUbf a")["href"]
      
        data.append({
          "title": title,
          "snippet": snippet,
          "links": links
        })

    if page_num == page_limit:
        break
    if soup.select_one(".d6cvqb a[id=pnnext]"):
        params["start"] += 10
    else:
        break
print(json.dumps(data, indent=2, ensure_ascii=False))

输出示例:

[
  {
    "title": "Web scraping - Wikipedia",
    "snippet": "Scraping a web page involves fetching it and extracting from it. Fetching is the downloading of a page (which a browser does when a user views a page).",
    "links": "https://en.wikipedia.org/wiki/Web_scraping"
  },
  {
    "title": "What Is Scraping | About Price & Web Scraping Tools - Imperva",
    "snippet": "Web scraping is the process of using bots to extract content and data from a website. Unlike screen scraping, which only copies pixels displayed onscreen, web ...",
    "links": "https://www.imperva.com/learn/application-security/web-scraping-attack/"
  },
  other results...
]

此外,您还可以使用第三方 API,例如来自 SerpApi 的 Google 搜索引擎结果 API。它是一个带有免费计划的付费 API。但它需要 API 密钥。它是一个 API,返回的结果与您在浏览时看到的结果相同。

不同之处在于,它将绕过 Google 的区块(包括验证码),无需创建解析器并维护它。

代码示例:

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os

params = {
  "api_key": "...",                  # serpapi key from https://serpapi.com/manage-api-key
  "engine": "google",                # serpapi parser engine
  "q": "web scraping",               # search query
  "gl": "uk",                        # country of the search, UK -> United Kingdom
  "num": "100"                       # number of results per page (100 per page in this case)
  # other search parameters: https://serpapi.com/search-api#api-parameters
}

search = GoogleSearch(params)      # where data extraction happens

page_limit = 10
organic_results_data = []
page_num = 0

while True:
    results = search.get_dict()    # JSON -> Python dictionary
    
    page_num += 1
    
    for result in results["organic_results"]:
        organic_results_data.append({
            "title": result.get("title"),
            "snippet": result.get("snippet"),
            "link": result.get("link")
        })

    if page_num == page_limit:
        break
      
    if "next_link" in results.get("serpapi_pagination", []):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
    else:
        break
    
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))

输出:与之前的解决方案完全相同。


0
投票

不需要导入 lxml,因为它是 beautifulsoup 内部使用的,尽管我确实需要

pip install lxml --upgrade
。奇怪。

所以无论如何,这是我对 Denis 的非 serpapi 代码稍作修改的版本:

from bs4 import BeautifulSoup
import json
import random
import requests


countrycode = "DE"
params = {
    "q": f"web scraping {countrycode}",       # NOTE query NOTE
    "start": 0,
    "num": 4,       # NOTE parameter defines the maximum number of results to return NOTE
}

uas_list = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like 
Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 
(KHTML, like Gecko) Version/13.1.1 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 
Firefox/77.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, 
like Gecko) Chrome/83.0.4103.97 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 
Firefox/77.0",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like 
Gecko) Chrome/83.0.4103.97 Safari/537.36",
]
header = {"User-Agent": random.choice(uas_list)}

data = []

while True:
    print(f"header: {header['User-Agent']}")

    html = requests.get("https://www.google.com/search", params=params, 
headers=header, timeout=30)
    soup = BeautifulSoup(html.text, 'lxml')

    for result in soup.select(".tF2Cxc"):
        title_elem = result.select_one(".DKV0Md")
        link_elem = result.select_one(".yuRUbf a")
        snippet_elem = result.select_one(".VwiC3b span")

    # check if the elements before accessing attributes
        if title_elem:
            title = title_elem.text
        else:
            title = "Title not found"
        if snippet_elem:
            snippet = snippet_elem.text
        else:
            snippet = "Snippet not found"
        if link_elem and "href" in link_elem.attrs:
            link = link_elem["href"]
        else:
            link = "Link not found"

        data.append({
            "title": title,
            "link": link,
            "snippet": snippet,
        })

    if soup.select_one(".d6cvqb a[id=pnnext]"):
        params["start"] += 10
    else:
        break

print(json.dumps(data, indent=2, ensure_ascii=False))

我最想指出的是,我在“snippet”中得到了“null”,所以在对开发控制台中的代码进行了大量挖掘和检查之后,我发现了一个他们用来包含描述信息的新类(snippet_elem = result .select_one(".VwiC3b span")).

抱歉,如果我的答案格式不正确,我不习惯这种没有视觉提示的格式化代码的方式。 干杯!

© www.soinside.com 2019 - 2024. All rights reserved.