这个问题与我之前的问题相关,所以在这里我假设我能够从 esco 的这个网页打开所有“加号”。 一旦我扩展了“处理和处置废物和危险材料”下的加号(这是上面链接指向的技能),我如何从那里(扩展页面)移动到一个包含两列的数据框,其中一列称为“技能”和其他“技能代码”在“技能”列上具有技能名称,在“技能代码”列上具有“父节点”的代码?在上面的模拟 uri 中,这应该会导致类似于我应该具有的“技能”列中的内容:
“技能代码”上的技能代码为S6.13.0。
我也尝试过这样做,但惨败了:
import pandas as pd
from bs4 import BeautifulSoup
# Replace with your actual HTML file path
file_path = '/Users/federiconutarelli/Desktop/esco.html'
# Read the HTML file
with open(file_path, 'r', encoding='utf-8') as file:
soup = BeautifulSoup(file, 'html.parser')
# Initialize lists to store skills and codes
skills = []
codes = []
# Find elements containing skills and their codes
# Replace 'skill_class_name' and 'code_class_name' with the actual class names or use other selectors based on your HTML structure
for skill_element in soup.find_all('div', class_='classification_item'):
skill_name = skill_element.get_text(strip=True)
skills.append(skill_name)
# Assuming the code is in a close relation to the skill element, you might need to adjust the method of finding it
code_element = skill_element.find_next_sibling('div', class_='main_item')
if code_element:
skill_code = code_element.get_text(strip=True)
codes.append(skill_code)
else:
codes.append('') # Append an empty string or None if no code is found
# Create a DataFrame
df = pd.DataFrame({
'Skill': skills,
'Code': codes
})
print(df)
注意:我将 uri 称为“模拟 uri”,因为实际上我有更多 uri,并且必须一遍又一遍地重复相同的过程。
我发现锚标签(
<a>
)比div
更容易刮:
import re
pattern = re.compile("(.+) - (.+)")
data = []
for anchor in soup.find_all('a', class_='change_right_content'):
skill_name = anchor.get_text(strip=True)
if m := pattern.match(skill_name):
data.append(m.groups())
df = pd.DataFrame(data, columns=['Code', 'Skill'])
您可以细化正则表达式模式以匹配
S1.0
但不匹配 S
等
:=
被称为“海象运算符”,需要 Python 3.8 或更高版本。