如何将网络抓取的段落与维基百科的最新抓取的标题配对

问题描述 投票:1回答:1

我目前正在抓取Wikipedia页面以查找每个段落,但是,我也正在抓取所有标题,以便可以将它们放在一起。然后,我通过汇总器将其发送,以获取重要信息。

我正在尝试将每个标题与相关的段落配对,但是,如果每个标题有多个段落,它将不知道,并且当我将所有信息写入文本文件时,它会放置一个标题,然后一个段落会不知道是否链接了它们。我不确定我所需要的是否清楚,请随时提出问题。

我正在使用的代码:

from bs4 import BeautifulSoup
import requests
from summarizer import summarize
# Here, we're just importing both Beautiful Soup and the Requests library

page_link = 'https://en.wikipedia.org/wiki/England'
# this is the url that we've already determined is safe and legal to scrape from.#

page_response = requests.get(page_link, timeout=5)
# here, we fetch the content from the url, using the requests library

page_content = BeautifulSoup(page_response.content, "html.parser")
#we use the html parser to parse the url content and store it in a variable.

# VVV this is where i find the paragraphs and the headings.
textContent = []
for i in range(0,100):
    paragraphs = page_content.find_all("p")[i].text
    while True:
        try:
            headings = page_content.find_all("h2")[i].text
            textContent.append(headings)
            break
        except IndexError:
            break
    textContent.append(paragraphs)
# this is the summariser
for i in range(len(textContent)):
    textContent[i] = summarize("{}".format(i),textContent[i], count=2)
# write to file here
with open('test.txt', 'w') as f:
    for item in textContent:
        f.write("%s\n" % item)
        f.write("\n")

我得到的当前输出是这样的:['地名']

[''\\ xa0在欧洲\ xa0(绿色&\ xa0深灰色)– \ xa0在英国\ xa0(绿色)']

['历史']

['[5] [6] [7]它与西部的威尔士和北部的苏格兰共享陆地边界。','英格兰北邻东海,英吉利海峡与欧洲大陆隔开欧洲大陆南部。']

等,等,等等,最后只有一簇不能与标题配对的段落。

谢谢。

python python-3.x web-scraping beautifulsoup
1个回答
0
投票

尝试下面的code.find_all('h2')标签,然后使用find_next_siblings('p')p之后获取h2标签,直到找到下一个h2

from bs4 import BeautifulSoup
import requests

page_link = 'https://en.wikipedia.org/wiki/England'
page_response = requests.get(page_link,verify=False, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
textContent = []
for tag in page_content.find_all('h2')[1:]:
    texth2=tag.text.strip()
    textContent.append(texth2)
    for item in tag.find_next_siblings('p'):
        if texth2 in item.find_previous_siblings('h2')[0].text.strip():
            textContent.append(item.text.strip())


print(textContent)

控制台输出:

 ['Toponymy', 'The name "England" is derived from the Old English name Englaland, which means "land of the Angles".[15] The Angles were one of the Germanic tribes that settled in Great Britain during the Early Middle Ages. The Angles came from the Anglia peninsula in the Bay of Kiel area (present-day German state of Schleswig–Holstein) of the Baltic Sea.[16] The earliest recorded use of the term, as "Engla londe", is in the late-ninth-century translation into Old English of Bede\'s Ecclesiastical History of the English People. The term was then used in a different sense to the modern one, meaning "the land inhabited by the English", and it included English people in what is now south-east Scotland but was then part of the English kingdom of Northumbria. The Anglo-Saxon Chronicle recorded that the Domesday Book of 1086 covered the whole of England, meaning the English kingdom, but a few years later the Chronicle stated that King Malcolm III went "out of Scotlande into Lothian in Englaland", thus using it in the more ancient sense.[17]', 'The earliest attested reference to the Angles occurs in the 1st-century work by Tacitus, Germania, in which the Latin word Anglii is used.[18] The etymology of the tribal name itself is disputed by scholars; it has been suggested that it derives from the shape of the Angeln peninsula, an angular shape.[19] How and why a term derived from the name of a tribe that was less significant than others, such as the Saxons, came to be used for the entire country and its people is not known, but it seems this is related to the custom of calling the Germanic people in Britain Angli Saxones or English Saxons to distinguish them from continental Saxons (Eald-Seaxe) of Old Saxony between the Weser and Eider rivers in Northern Germany.[20] In Scottish Gaelic, another language which developed on the island of Great Britain, the Saxon tribe gave their name to the word for England (Sasunn);[21] similarly, the Welsh name for the English language is "Saesneg". A romantic name for England is Loegria, related to the Welsh word for England, Lloegr, and made popular by its use in Arthurian legend. Albion is also applied to England in a more poetic capacity,[22] though its original meaning is the island of Britain as a whole.', 'History', 'The earliest known evidence of human presence in the area now known as England was that of Homo antecessor, dating to approximately 780,000 years ago. The oldest proto-human bones discovered in England date from 500,000\xa0years ago.[23] Modern humans are known to have inhabited the area during the Upper Paleolithic period, though permanent settlements were only established within the last 6,000 years.[24][25]\nAfter the last ice age only large mammals such as mammoths, bison and woolly rhinoceros remained. Roughly 11,000\xa0years ago, when the ice sheets began to recede, humans repopulated the area; genetic research suggests they came from the northern part of the Iberian Peninsula.[26] The sea level was lower than now and Britain was connected by land bridge to Ireland and Eurasia.[27]\nAs the seas rose, it was separated from Ireland 10,000\xa0years ago and from Eurasia two millennia later.', 
    ....so on]
© www.soinside.com 2019 - 2024. All rights reserved.