我已经成功地抓取了所有 li 的页面并创建了一个数据框。我遇到困难的部分是提取并保留每行中的“url_for_rowN”部分。我想要一个 .csv 来捕获两个文本字段以及应用于其中一个的 url;下面的代码中有示例格式。
如果我能以某种方式从 h3 捕获“日期”并将其保留在 .csv 中,那么页面中的所有列表项都在一个由插入的 H3 标点的列表中,这将是一个额外的好处。
我当前的代码:
from bs4 import BeautifulSoup
import pandas as pd
#
# Sample format of file I want to extract
"""
<h2>Date1</h2>
<ol>
<li>Part1_of_row1: <a href="url_for_row1">Part2_of row1</a></li>
.
.
.
<li>Part1_of_rowN: <a href="url_for_rown">Part2_of rown</li>
<h2>Date2</h2>
<li>Part1_of_rowNplus1: <a href="url_for_rown">Part2_of rowNplus1</a></li>
.
.
"""
# Desired output is a .csv where each line has ["date","part1_of_rowN","Part2_of_rowN","url_for_rowN"]
with open("myfile.html", "r") as og_file:
page = str(og_file.read())
soup = BeautifulSoup(page, "html5lib")
list_items = soup.find_all('li')
#separate each li in the soup object into columns
list_output = []
for li in list_items:
company = li.find_all('span')
row = [li.text for li in company]
list_output.append(row)
df = pd.DataFrame(list_output, columns=["Company", "Role","C","D","E"])
#clean up some crap in the Company column
df["Company"] = df["Company"].str.replace(':','')
df["Company"] = df["Company"].str.strip()
with open('scrape.csv', 'w', newline='') as file:
df.to_csv('scrape.csv')
尝试:
import pandas as pd
from bs4 import BeautifulSoup
text = """\
<h2>Date1</h2>
<ol>
<li>Part1_of_row1: <a href="url_for_row1">Part2_of row1</a></li>
<li>Part1_of_rowN: <a href="url_for_rown">Part2_of rown</li>
<h2>Date2</h2>
<li>Part1_of_rowNplus1: <a href="url_for_rown">Part2_of rowNplus1</a></li>
</ol>
"""
soup = BeautifulSoup(text, "html.parser")
all_data = []
for li in soup.select("li"):
d = li.find_previous("h2")
d = d.text if d else "-"
text2 = li.a.text
url = li.a["href"]
li.a.extract()
text1 = li.text
all_data.append((d, text1, text2, url))
df = pd.DataFrame(
all_data, columns=["date", "part1_of_rowN", "Part2_of_rowN", "url_for_rowN"]
)
df["part1_of_rowN"] = df["part1_of_rowN"].str.strip(" :")
print(df)
打印:
date part1_of_rowN Part2_of_rowN url_for_rowN
0 Date1 Part1_of_row1 Part2_of row1 url_for_row1
1 Date1 Part1_of_rowN Part2_of rown url_for_rown
2 Date2 Part1_of_rowNplus1 Part2_of rowNplus1 url_for_rown