如何跳过不满足BeautifulSoup和Pandas中所有要求的数据输出?

问题描述 投票:0回答:1

我对Python和BeautifulSoup很陌生。我正在尝试编写一个脚本,询问您想要的工作/职业以及您想在什么位置工作。然后确实进行抓取并将数据输出到excel文件中。一切正常,除了某些工作清单没有在Indeed本身上列出薪水,因此,我认为BeautifulSoup仍然会刮取标题和位置,但显然不会薪水,那么我发现的是-全部的数据在输出时会混合在一起,因为在工作旁边会有错误的薪水,因为BeautifulSoup仍在解析所有职称和位置,但不是所有薪水,因此,至少在术语上,数据是混合的薪水。

现在,我有两个选择-我可以尝试使脚本输出所有数据,除了工作清单的薪水以外,其他都没有,因此excel电子表格将“工资”显示为空白,但是我不知道该怎么做。

或者,对于没有列出“薪水”的数据,我可以完全跳过对所有数据的分析。并且仅解析满足所有三个条件(职位,职位,薪水)的条件。但是同样,我不知道该怎么做。

这是我的代码,我知道它非常凌乱-我是Python的新手。

 import requests
 from bs4 import BeautifulSoup
 import pandas

 new_jobs = []
 data = []
 new_wage = []
 new_location = []

 indeed_search = input("What jobs are you interested in?")
 for chars in indeed_search:
     chars.replace(" ","+")

 location = input("Where do you wanna work?")
 for loc in location:
     loc.replace(" ","+")

 file_name = input("What's your file going to be called?") + ".xlsx"

 for i in range(1,10):
     page = requests.get("https://www.indeed.co.uk/jobs?q=" + indeed_search + "&l=" + "&start={}".format(i))
     soup = BeautifulSoup(page.content, "html.parser")
     for wage in soup.find_all(class_="salaryText"):
         new_wage.append(wage.text.strip())

     for jobs in soup.find_all(class_="title"):
         new_jobs.append(jobs.text.strip())

     for location in soup.find_all(class_="location accessible-contrast-color-location"):
         new_location.append(location.text.strip())

     data_new = list(zip(new_jobs, new_location, new_wage))
     d = pandas.DataFrame(data_new, columns=["Job", "Location", "Salary"])
     d.to_excel("C://Users//steve//Downloads//" + file_name)
python pandas beautifulsoup
1个回答
0
投票

首先,我认为您需要找到工作卡,然后遍历所有卡,在每张卡中都可以找到每个职位,薪水和位置,同时还要检查它们是否存在,否则请用np替换它们.NaN或其他内容。

所以我对您的代码进行了一些调整,如下所示:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

data = []

indeed_search = input("What jobs are you interested in?")
for chars in indeed_search:
     chars.replace(" ","+")

location = input("Where do you wanna work?")
for loc in location:
     loc.replace(" ","+")

file_name = input("What's your file going to be called?") + ".xlsx"

for i in range(0, 20, 10):
    url = "https://www.indeed.co.uk/jobs?q=" + indeed_search + "&l=" + location + "&start={}".format(i)
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    for job in soup.find_all(class_='jobsearch-SerpJobCard'):
        title = job.find(class_="title")
        salary = job.find(class_="salaryText")
        loc = job.find(class_="location")

        title = title.get_text().strip() if title else np.NaN
        salary = salary.get_text().strip() if salary else np.NaN
        loc = loc.get_text().strip() if loc else np.NaN
        data.append((title,salary,loc))


d = pd.DataFrame(data, columns=['title','salary','location'])
d.to_excel("C://Users//steve//Downloads//" + file_name)
© www.soinside.com 2019 - 2024. All rights reserved.