Python。使用Selenium进行Web Scrapping。在表上迭代并检索数据。

Question

我正在学习Python，并决定做一个web-scrapping项目，其中我使用Beautifulsoup和Selenium。

网站。https:/careers.amgen.comListJobs。?

目标：检索与工作添加相关的所有变量。 检索所有与新增工作相关的变量。确定的变量。ID，职位，URL，城市，州，邮编，国家，工作岗位的日期。

问题：我设法从第一页提取数据。 我设法从表的第一页中提取了数据但我不能从表格的所有其他页面提取数据。而且我也使用了转到下一页的选项。

任何帮助将是非常感激的。

请在下面找到我的代码。

import re
import os
import selenium
import pandas as pd

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import presence_of_element_located
from bs4 import BeautifulSoup


#driver = webdriver.Chrome(ChromeDriverManager().install())
browser = webdriver.Chrome("") #path needed to execute chromedriver. Check your own 
#path
browser.get('https://careers.amgen.com/ListJobs?')
browser.implicitly_wait(100)
soup = BeautifulSoup(browser.page_source, 'html.parser')
code_soup = soup.find_all('tr', attrs={'role': 'row'})

# creating data set
df =pd.DataFrame({'id':[],
                  'jobs':[],
                 'url':[],
                 'city':[],
                 'state':[],
                  'zip':[],
                  'country':[],
                 'added':[]
                 })
d = code_soup

next_page = browser.find_element_by_xpath('//*[@id="jobGrid0"]/div[2]/a[3]/span')



for i in range(2,12): #catch error, out of bonds?
    df = df.append({'id' : d[i].find_all("td", {"class": "DisplayJobId-cell"}),
                     "jobs" : d[i].find_all("td", {"class":"JobTitle-cell"}),
                     "url" : d[i].find("a").attrs['href'],
                     "city" : d[i].find_all("td", {"class": "City-cell"}),
                     "state" : d[i].find_all("td", {"class": "State-cell"}),
                     "zip" : d[i].find_all("td", {"class": "Zip-cell"}),
                     "country" : d[i].find_all("td", {"class": "Country-cell"}),
                     "added" : d[i].find_all("td", {"class": "AddedOn-cell"})}, ignore_index=True)

df['url'] = 'https://careers.amgen.com/' + df['url'].astype(str)
df["company"] = "Amgen"
df

#iterate through the pages

next_page = browser.find_element_by_xpath('//*[@id="jobGrid0"]/div[2]/a[3]/span')
for p in range(1,7): #go from page 1 to 6
    next_page.click()
    browser.implicitly_wait(20)
    print(p)

>quote 
I tried multiple things, this is my last multiple attempt. It did not work:

```
p = 0
next_page = browser.find_element_by_xpath('//*[@id="jobGrid0"]/div[2]/a[3]/span')

for p in range(1,7):   
    for i in range(2,12):
        df1 = df.append({'id' : d[i].find_all("td", {"class": "DisplayJobId-cell"}),
                         "jobs" : d[i].find_all("td", {"class":"JobTitle-cell"}),
                         "url" : d[i].find("a").attrs['href'],
                         "city" : d[i].find_all("td", {"class": "City-cell"}),
                         "state" : d[i].find_all("td", {"class": "State-cell"}),
                         "zip" : d[i].find_all("td", {"class": "Zip-cell"}),
                         "country" : d[i].find_all("td", {"class": "Country-cell"}),
                         "added" : d[i].find_all("td", {"class": "AddedOn-cell"})}, ignore_index=True)
        p += 1
        next_page.click()
    print(p)

Answer 1

import requests
import re
import pandas as pd


params = {
    'sort': 'AddedOn-desc',
    'page': '1',
    'pageSize': '1000',
    'group': '',
    'filter': '',
    'fields': 'JobTitle,DisplayJobId,City,State,Zip,Country,AddedOn,UrlJobTitle'

}

headers = {
    "Origin": 'https://careers.amgen.com'
}


def main(url):
    r = requests.get(url)
    api = re.search('JobsApiUrl="(.*?)\"', r.text).group(1)
    r = requests.get(api, params=params, headers=headers).json()
    df = pd.DataFrame(r['Data'])
    print(df)
    df.to_csv("data.csv", index=False)


main("https://careers.amgen.com/ListJobs")

输出。在线查看

样品。

Answer 2

改变你的代码在一行将为你做工作.而不是现有的xpath，你正在使用选择 "下一个 "箭头改变表，你可以使用以下xpath。

>>> next_page = browser.find_element_by_xpath('//a[@class="k-link k-pager-nav"]//following::a[@class="k-link k-pager-nav"]')

Python。使用Selenium进行Web Scrapping。在表上迭代并检索数据。

问题描述投票：0回答：2

2个回答

最新问题

Python。使用Selenium进行Web Scrapping。在表上迭代并检索数据。

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2