我怎样才能使用Python从任何网站(例如AI)抓取数据

问题描述 投票:0回答:1

我想从网站上删除文本并根据我的需要对其进行存储。想要使用 google ai/ml 服务用 python 来做这件事

我从头开始尝试过:

导入请求

从 bs4 导入 BeautifulSoup

def scrape_website(url): 响应 = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup
else:
    print(f"Error: Unable to fetch the URL. Status code: {response.status_code}")
    return None

def extract_information(汤,查询):

# Your HTML parsing logic here to extract information based on the query

# For demonstration, let's extract the title of the page

if query.lower() == "project name":
    project_name = soup.title.text.strip()
    return f"Project Name: {project_name}"
else:
    return "Query not supported."

if name == "main":

url = input("Enter the URL: ")

# Scrape website content
webpage_content = scrape_website(url)

if webpage_content:
    while True:
        query = input("Enter your question (e.g., 'Project Name', 'Status'): ")

        if query.lower() == "exit":
            break

        result = extract_information(webpage_content, query)
        print(result)

上面的代码给了我下面给出的输出,但它不符合我的期望:

输入网址:https://h2v.eu/Hydrogen-valleys/crystal-brook-Hydrogen-superhub

输入您的问题(例如“项目名称”、“状态”):项目名称

项目名称:氢谷|水晶溪氢超级中心

输入您的问题(例如“项目名称”、“状态”):状态

不支持查询。

我也尝试过:

将 tkinter 导入为 tk

从 tkinter 导入 ttk

从 bs4 导入 BeautifulSoup

从 google.cloud 导入 language_v1

导入请求

def scrape_and_analyze(url):

# Web scraping
try:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    text_content = soup.get_text()
except Exception as e:
    return f"Error in web scraping: {str(e)}"

# Google Cloud Natural Language API
try:
    client = language_v1.LanguageServiceClient()
    document = language_v1.Document(content=text_content, type_=language_v1.Document.Type.PLAIN_TEXT)
    annotations = client.analyze_entities(document=document)
    entities = annotations.entities
except Exception as e:
    return f"Error in text analysis: {str(e)}"

# Filter entities of interest (customize this based on your needs)
filtered_entities = [entity.name for entity in entities if entity.type_ == language_v1.Entity.Type.PERSON]

return filtered_entities

def on_submit():

url = url_entry.get()

result = scrape_and_analyze(url)

result_text.delete(1.0, tk.END)

result_text.insert(tk.END, "\n".join(result))

用户界面设置

root = tk.Tk()

root.title(“网页抓取和文本分析”)

网址输入

url_label = ttk.Label(root, text="输入网址:")

url_label.pack(pady=10)

url_entry = ttk.Entry(root, width=50)

url_entry.pack(pady=10)

提交按钮

submit_button = ttk.Button(root, text="Submit", command=on_submit)

submit_button.pack(pady=10)

结果文本

result_text = tk.Text(根,高度=10,宽度=50,wrap=“word”)

result_text.pack(pady=10)

结束

root.mainloop()

这也会出错。

python google-chrome google-apps-script google-assistant-sdk google-assist-api
1个回答
0
投票

网络抓取涉及从网站提取数据,它可能是一个有用的工具。在抓取之前,请务必检查网站的

robots.txt
文件和服务条款。

这是一个简单的示例,使用 Python 和

BeautifulSoup
库进行 HTML 解析,并使用
requests
发出 HTTP 请求。确保首先安装这些库:

pip install beautifulsoup4
pip install requests

现在,您可以使用以下示例作为网页抓取的起点:

import requests
from bs4 import BeautifulSoup

def scrape_website(url):
    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.text, 'html.parser')

        # Example: Extract all the links on the page
        links = soup.find_all('a')
        for link in links:
            print(link.get('href'))

    else:
        print(f"Error: Unable to retrieve the page. Status code: {response.status_code}")

# Example usage
url_to_scrape = 'https://example.com'
scrape_website(url_to_scrape)

记住:

  • 尊重服务条款和
    robots.txt
  • 网络抓取应该以负责任且符合道德的方式进行。避免对服务器施加过多负载,并考虑在代码中加入延迟。
  • 某些网站可能采取措施来防止或限制抓取。尊重这些措施。
  • 网站可以更改其结构,因此如果网站的布局发生变化,您的抓取代码可能需要更新。
© www.soinside.com 2019 - 2024. All rights reserved.