使用BeautifulSoup选择多个元素并单独管理它们

问题描述 投票:0回答:1

我正在使用BeautifulSoup来解析一个诗歌网页。诗歌分为h3为诗歌题名,.line为诗歌的每一行。我可以获取这两个元素并将它们添加到列表中。但我想操纵h3为大写并指示换行符,然后将其插入行列表中。

    linesArr = []
    for lines in full_text:
        booktitles = lines.select('h3')
        for booktitle in booktitles:
            linesArr.append(booktitle.text.upper())
            linesArr.append('')
        for line in lines.select('h3, .line'):
            linesArr.append(line.text)

此代码将所有书籍标题附加到列表的开头,然后继续获取h3.line项目。我试过插入这样的代码:

    linesArr = []
    for lines in full_text:
        for line in lines.select('h3, .line'):
            if line.find('h3'):
                linesArr.append(line.text.upper())
                linesArr.append('')
            else:
                linesArr.append(line.text)
python web-scraping beautifulsoup html-parsing
1个回答
0
投票

我不确定你要做什么,但是通过这种方式,你可以得到一个大写的标题数组和你的所有行:

#!/usr/bin/python3
# coding: utf8

from bs4 import BeautifulSoup
import requests

page = requests.get("https://quod.lib.umich.edu/c/cme/CT/1:1?rgn=div2;view=fulltext")
soup = BeautifulSoup(page.text, 'html.parser')

title = soup.find('h3')
full_lines = soup.find_all('div',{'class':'line'})

linesArr = []
linesArr.append(title.get_text().upper())
for line in full_lines:
    linesArr.append(line.get_text())

# Print full array with the title and text
print(linesArr)

# Print text here with line break
for linea in linesArr:
    print(linea + '\n')
© www.soinside.com 2019 - 2024. All rights reserved.