使用 Selenium 和 BeautifulSoup 抓取 LinkedIn 期间的 HTML 标签更改

问题描述 投票:0回答:1

我遇到一个问题,我无法使用 Selenium 和 BeautifulSoup 抓取 LinkedIn 个人资料中的教育、体验部分。

现在,我已经成功地抓取了名称、标题和位置。但是对于教育和体验部分,我注意到当我打开检查时 html 标签发生了变化,这让我无法识别这些部分并使用 beautifulSoup 进行提取。有人有解决方案吗? 这里的代码示例:

experience = soup.find("section", {"id": "experience-section"}).find('ul')

print(experience)

li_tags = experience.find('div')
a_tags = li_tags.find("a")
job_title = a_tags.find("h3").get_text().strip()
 
print(job_title)
 
company_name = a_tags.find_all("p")[1].get_text().strip()
print(company_name)
 
joining_date = a_tags.find_all("h4")[0].find_all("span")[1].get_text().strip()
    employment_duration = a_tags.find_all("h4")[1].find_all("span")[1].get_text().strip()
 
print(joining_date + ", " + employment_duration)

here you can see the section id, where the number is changing

the inspect that i expect should be like this

python selenium-webdriver web-scraping beautifulsoup linkedin
1个回答
0
投票

您可能会发现它有帮助。下面的脚本首先使用邮件和密码登录 LinkedIn,然后通过单击个人资料头像进入个人资料部分,最后获取个人资料的页面源以使用 beautifulsoup 解析它。

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import ChromeOptions, Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

options = ChromeOptions()

# maximized and disable forbar
options.add_argument("--start-maximized")
options.add_experimental_option("useAutomationExtension", False)
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option(
    "prefs",
    {
        "credentials_enable_service": False,
        "profile.password_manager_enabled": False,
        "profile.default_content_setting_values.notifications": 2
        # with 2 should disable/block notifications and 1 to allow
    },
)

driver = webdriver.Chrome(options=options)

url = "https://www.linkedin.com/uas/login"
driver.get(url)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID,"organic-div")))
container = driver.find_element(By.ID, "organic-div")

# login: fill the email account, password
email = container.find_element(By.ID, 'username')
password = container.find_element(By.ID, 'password')
email.send_keys("xxxxxxxxxxxxxxxx")
password.send_keys("xxxxxxxxxxxxxx")
password.send_keys(Keys.ENTER)
time.sleep(2)

WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "authentication-outlet")))
driver.find_element(By.CLASS_NAME, 'share-box-feed-entry__avatar').click()

time.sleep(2)

soup = BeautifulSoup(driver.page_source, 'lxml')

experience_div = soup.find('div', {"id": "experience"})
exp_list = experience_div.findNext('div').findNext('div', {"class": "pvs-list__outer-container"}).findChild('ul').findAll('li')

experiences = []

for each_exp in exp_list:

    company_logo = each_exp.findNext('img').get('src')
    col = each_exp.findNext("div", {"class": "display-flex flex-column full-width"})
    profile_title = col.findNext('div').findNext('span').findNext('span').text
    company_name = col.findNext('span', {"class": "t-14 t-normal"}).findNext('span').text
    timeframe = col.findAll('span', {"class": "t-14 t-normal t-black--light"})[0].findNext('span').text
    location = col.findAll('span', {"class": "t-14 t-normal t-black--light"})[1].findNext('span').text

    experiences.append({
        "company_logo": company_logo,
        "profile_title": profile_title.replace('\n', '').strip(),
        "company_name": company_name.replace('\n', '').strip(),
        "timeframe": timeframe.replace('\n', '').strip(),
        "location": location.replace('\n', '').strip(),
    })

print(experiences)

你可以按照我对经验部分所做的相同方式解析其他部分,如教育、证书等。

© www.soinside.com 2019 - 2024. All rights reserved.