我想从网站上删除文本并根据我的需要对其进行存储。想要使用 google ai/ml 服务用 python 来做这件事
我从头开始尝试过:
导入请求
从 bs4 导入 BeautifulSoup
def scrape_website(url): 响应 = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
return soup
else:
print(f"Error: Unable to fetch the URL. Status code: {response.status_code}")
return None
def extract_information(汤,查询):
# Your HTML parsing logic here to extract information based on the query
# For demonstration, let's extract the title of the page
if query.lower() == "project name":
project_name = soup.title.text.strip()
return f"Project Name: {project_name}"
else:
return "Query not supported."
if name == "main":
url = input("Enter the URL: ")
# Scrape website content
webpage_content = scrape_website(url)
if webpage_content:
while True:
query = input("Enter your question (e.g., 'Project Name', 'Status'): ")
if query.lower() == "exit":
break
result = extract_information(webpage_content, query)
print(result)
上面的代码给了我下面给出的输出,但它不符合我的期望:
输入网址:https://h2v.eu/Hydrogen-valleys/crystal-brook-Hydrogen-superhub
输入您的问题(例如“项目名称”、“状态”):项目名称
项目名称:氢谷|水晶溪氢超级中心
输入您的问题(例如“项目名称”、“状态”):状态
不支持查询。
我也尝试过:
将 tkinter 导入为 tk
从 tkinter 导入 ttk
从 bs4 导入 BeautifulSoup
从 google.cloud 导入 language_v1
导入请求
def scrape_and_analyze(url):
# Web scraping
try:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
text_content = soup.get_text()
except Exception as e:
return f"Error in web scraping: {str(e)}"
# Google Cloud Natural Language API
try:
client = language_v1.LanguageServiceClient()
document = language_v1.Document(content=text_content, type_=language_v1.Document.Type.PLAIN_TEXT)
annotations = client.analyze_entities(document=document)
entities = annotations.entities
except Exception as e:
return f"Error in text analysis: {str(e)}"
# Filter entities of interest (customize this based on your needs)
filtered_entities = [entity.name for entity in entities if entity.type_ == language_v1.Entity.Type.PERSON]
return filtered_entities
def on_submit():
url = url_entry.get()
result = scrape_and_analyze(url)
result_text.delete(1.0, tk.END)
result_text.insert(tk.END, "\n".join(result))
root = tk.Tk()
root.title(“网页抓取和文本分析”)
url_label = ttk.Label(root, text="输入网址:")
url_label.pack(pady=10)
url_entry = ttk.Entry(root, width=50)
url_entry.pack(pady=10)
submit_button = ttk.Button(root, text="Submit", command=on_submit)
submit_button.pack(pady=10)
result_text = tk.Text(根,高度=10,宽度=50,wrap=“word”)
result_text.pack(pady=10)
root.mainloop()
这也会出错。
网络抓取涉及从网站提取数据,它可能是一个有用的工具。在抓取之前,请务必检查网站的
robots.txt
文件和服务条款。
这是一个简单的示例,使用 Python 和
BeautifulSoup
库进行 HTML 解析,并使用 requests
发出 HTTP 请求。确保首先安装这些库:
pip install beautifulsoup4
pip install requests
现在,您可以使用以下示例作为网页抓取的起点:
import requests
from bs4 import BeautifulSoup
def scrape_website(url):
# Send a GET request to the URL
response = requests.get(url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Parse the HTML content of the page
soup = BeautifulSoup(response.text, 'html.parser')
# Example: Extract all the links on the page
links = soup.find_all('a')
for link in links:
print(link.get('href'))
else:
print(f"Error: Unable to retrieve the page. Status code: {response.status_code}")
# Example usage
url_to_scrape = 'https://example.com'
scrape_website(url_to_scrape)
记住:
robots.txt
。