遍历URL的Dataframe列并解析出html标签

问题描述 投票:0回答:1

这不应该太难,尽管我不知道,我敢打赌我犯了一个愚蠢的错误。

这是在单个链接上工作并返回zestimate的代码(req_headers变量防止抛出验证码):

req_headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.8',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}

link = 'https://www.zillow.com/homedetails/1404-Clearwing-Cir-Georgetown-TX-78626/121721750_zpid/'
test_soup = BeautifulSoup(requests.get(link, headers=req_headers).content, 'html.parser')
results = test_soup.select_one('h4:contains("Home value")').find_next('p').get_text(strip=True)
print(results)

这是我要开始工作的代码,并为每个链接返回zestimate并添加到新的dataframe列中,但是我得到了AttributeError: 'NoneType' object has no attribute 'find_next'(此外,假设我有不同zillow house链接的dataframe列):

req_headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.8',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}

for link in df['links']:
    test_soup = BeautifulSoup(requests.get(link, headers=req_headers).content, 'html.parser')
    results = test_soup.select_one('h4:contains("Home value")').find_next('p').get_text(strip=True)
    df['zestimate'] = results

感谢您的帮助。

python beautifulsoup html-parsing
1个回答
0
投票

我在dataframe列中的链接之前和之后都有一个空格:/。就是这样该代码工作正常。只是我的疏忽。谢谢大家

© www.soinside.com 2019 - 2024. All rights reserved.