在列表上使用BeautifulSoup而不丢失url属性

问题描述 投票:0回答:1

我已经成功地抓取了所有 li 的页面并创建了一个数据框。我遇到困难的部分是提取并保留每行中的“url_for_rowN”部分。我想要一个 .csv 来捕获两个文本字段以及应用于其中一个的 url;下面的代码中有示例格式。

如果我能以某种方式从 h3 捕获“日期”并将其保留在 .csv 中,那么页面中的所有列表项都在一个由插入的 H3 标点的列表中,这将是一个额外的好处。

我当前的代码:

from bs4 import BeautifulSoup
import pandas as pd

#
# Sample format of file I want to extract 
"""
<h2>Date1</h2>
    <ol>
        <li>Part1_of_row1: <a href="url_for_row1">Part2_of row1</a></li>
        .
        .
        .
        <li>Part1_of_rowN: <a href="url_for_rown">Part2_of rown</li>
<h2>Date2</h2>
        <li>Part1_of_rowNplus1: <a href="url_for_rown">Part2_of rowNplus1</a></li>
.
.

""" 
# Desired output is a .csv where each line has ["date","part1_of_rowN","Part2_of_rowN","url_for_rowN"]

with open("myfile.html", "r") as og_file:
    page = str(og_file.read())

soup = BeautifulSoup(page, "html5lib")

list_items = soup.find_all('li')

#separate each li in the soup object into columns
list_output = []
for li in list_items:
    company = li.find_all('span')
    row = [li.text for li in company]
    list_output.append(row)
df = pd.DataFrame(list_output, columns=["Company", "Role","C","D","E"])

#clean up some crap in the Company column
df["Company"] = df["Company"].str.replace(':','')
df["Company"] = df["Company"].str.strip()


with open('scrape.csv', 'w', newline='') as file:
    df.to_csv('scrape.csv')
python html pandas beautifulsoup etl
1个回答
0
投票

尝试:

import pandas as pd
from bs4 import BeautifulSoup

text = """\
<h2>Date1</h2>
    <ol>
        <li>Part1_of_row1: <a href="url_for_row1">Part2_of row1</a></li>
        <li>Part1_of_rowN: <a href="url_for_rown">Part2_of rown</li>
<h2>Date2</h2>
        <li>Part1_of_rowNplus1: <a href="url_for_rown">Part2_of rowNplus1</a></li>
    </ol>
"""

soup = BeautifulSoup(text, "html.parser")


all_data = []
for li in soup.select("li"):
    d = li.find_previous("h2")
    d = d.text if d else "-"

    text2 = li.a.text
    url = li.a["href"]
    li.a.extract()
    text1 = li.text
    all_data.append((d, text1, text2, url))

df = pd.DataFrame(
    all_data, columns=["date", "part1_of_rowN", "Part2_of_rowN", "url_for_rowN"]
)
df["part1_of_rowN"] = df["part1_of_rowN"].str.strip(" :")
print(df)

打印:

    date       part1_of_rowN       Part2_of_rowN  url_for_rowN
0  Date1       Part1_of_row1       Part2_of row1  url_for_row1
1  Date1       Part1_of_rowN       Part2_of rown  url_for_rown
2  Date2  Part1_of_rowNplus1  Part2_of rowNplus1  url_for_rown
© www.soinside.com 2019 - 2024. All rights reserved.