beautifulsoup 相关问题

所以，我尝试从 https://www.goodreads.com/list/show/83612.NY_Times_Fiction_Best_Sellers_2015 进行网络抓取。这是我从 html 解析中得到的 html：因此，我尝试从 https://www.goodreads.com/list/show/83612.NY_Times_Fiction_Best_Sellers_2015 进行网络抓取。这是我从 html 解析得到的 html : <td valign="top" width="100%"> <a class="bookTitle" href="/book/show/18143977-all-the-light-we-cannot-see" itemprop="url"> All the Light We Cannot See </a> by <div class="authorName__container"> <a class="authorName" href="https://www.goodreads.com/author/show/28186.Anthony_Doerr" itemprop="url"> Anthony Doerr </a> (Goodreads Author) </div> <div> 4.32 avg rating — 1,622,659 ratings </div> <div style="margin-top: 5px"> <a href="#" onclick="Lightbox.showBoxByID('score_explanation', 300); return false;"> score: 12,706 </a>, and <a href="#" id="loading_link_849498" onclick="new Ajax.Request('/list/list_book/4645304', {asynchronous:true, evalScripts:true, onFailure:function(request){Element.hide('loading_anim_849498');$('loading_link_849498').innerHTML = 'ERRORtry again';$('loading_link_849498').show();;Element.hide('loading_anim_849498');}, onLoading:function(request){;Element.show('loading_anim_849498');Element.hide('loading_link_849498')}, onSuccess:function(request){Element.hide('loading_anim_849498');Element.show('loading_link_849498');}, parameters:'authenticity_token=' + encodeURIComponent('6lhx6sV5qy11Lg1m0knlIiiBHvcno5EeLkBs5xzr5p9R3JgvjO2eYcBCjOVqr2bMfLnqe+8H9kU+pGpxJ4wsPw==')}); return false;"> 128 people voted </a> <img alt="Loading trans" class="loading" id="loading_anim_849498" src="https://s.gr-assets.com/assets/loading-trans-ced157046184c3bc7c180ffbfc6825a4.gif" style="display:none" /> </div> </td> 这是我用于网页抓取的代码 import bs4 from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup myurl = "https://www.goodreads.com/list/show/83612.NY_Times_Fiction_Best_Sellers_2015" myurl uClient = uReq(myurl) page_html = uClient.read() uClient.close() page_soup = soup(page_html, "html.parser") tds = page_soup.findAll("td", {"width":"100%","valign":"top"}) for td in tds : juduls = td.a.select("span") judul = juduls[0].text print(judul) authors = td.div.select("a") author = authors[0].text print(author) rates = td.findAll("span", {"class":"greyText smallText uitext"}) rating = rates[0].text.strip().replace(",",".") print(rating) scores = td.findAll("div", {"style":"margin-top: 5px"}) score = scores[0].span.a.text.strip().replace("score: ","").replace(",",".") print(score) votes = td.findAll("div", {"style":"margin-top: 5px"}) vote = votes[0].find("a", {"id":"loading_link_849498"}).text.strip() print(vote) 结果： All the Light We Cannot See Anthony Doerr 4.32 avg rating — 1.622.659 ratings 12.706 128 people voted The Nightingale Kristin Hannah 4.62 avg rating — 1.307.242 ratings 10.824 --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) e:\project\webscrapping\scrap_for.py in line 38 36 print(score) 37 votes = td.findAll("div", {"style":"margin-top: 5px"}) ---> 38 vote = votes[0].find("a", {"id":"loading_link_849498"}).text.strip() 39 print(vote) AttributeError: 'NoneType' object has no attribute 'text' 我知道每本书的“id”：“loading_link_849498”都是不同的。我不知道如何获取具有不同 id 的标签中的文本。有什么解决办法吗？之前谢谢你。 PS：我很抱歉我的英语不好，因为英语不是我的母语我更喜欢 requests api 并对其进行了如下测试。 import requests from bs4 import BeautifulSoup # URL of the Goodreads list url = "https://www.goodreads.com/list/show/83612.NY_Times_Fiction_Best_Sellers_2015" # Fetch the page response = requests.get(url) page_content = response.text # Parse the HTML soup = BeautifulSoup(page_content, "html.parser") # Find all book entries. This might need adjustments based on actual page structure. books = soup.find_all("tr", itemtype="http://schema.org/Book") for book in books: # Extracting title, author, and rating title = book.find("a", class_="bookTitle").text.strip() author = book.find("a", class_="authorName").text.strip() rating = book.find("span", class_="minirating").text.strip() # Extracting score score_tag = book.find("a", onclick=lambda x: x and "Lightbox.showBoxByID('score_explanation'" in x) score = score_tag.text.split(': ')[1].replace(',', '') if score_tag else "N/A" # Extracting votes votes_tag = book.find("a", id=lambda x: x and x.startswith("loading_link_")) votes = votes_tag.text.split(' ')[0] if votes_tag else "N/A" # Print the extracted details print(f"Title: {title}") print(f"Author: {author}") print(f"Rating: {rating}") print(f"Score: {score}") print(f"Votes: {votes}") print("-" * 40)

python web-scraping beautifulsoup

回答 1 投票 0

使用 Python 从电子表格中导出值进行网络抓取

A。我的目标：使用 Python 从 Excel 电子表格中提取唯一的 OCPO ID，并使用这些 ID 来网络抓取相应的公司名称和 NIN ID。（注：NIN 和 OCPO ID 都是唯一的...

python python-3.x web-scraping beautifulsoup openpyxl

回答 1 投票 0

当页面源不包含我在浏览器上看到的内容时，网络抓取网站

假设我想抓取这个网站中的数据（马拉松跑步时间）：https://www.valenciaciudaddelrunning.com/en/marathon/2021-marathon-ranking/ 当我右键单击并选择“

python html selenium-webdriver web-scraping beautifulsoup

回答 2 投票 0

下载以.zip结尾的链接时出现问题

我正在尝试从网页批量下载一堆超链接压缩文件。我正在使用 Python 和 beautifulsoup。我在记事本中编写了代码并将其保存为 .py 。然后我运行了代码...