使用Python中的BeautifulSoup从溢出的span标签中提取文本

Question

我是Python的新手，紧跟练习练习。从HTML span标记提取文本时，某些部分位于“阅读更多”下，除非我在其中单击，否则span标记不会随现有文本一起更新。这意味着当我为标签和类运行BeautifulSoup和findAll时，没有“更多内容”部分的唯一第一部分将作为摘录返回。无法弄清楚该怎么办？这是用于酒店评论的文本挖掘练习。代码如下，未提供完整部分：

url_soup=soup(url_html,"html.parser")
profiles = url_soup.findAll("div",{"class":"hotels-community-tab-common-Card__card--ihfZB hotels-community-tab-common-Card__section--4r93H"})   
for profile in profiles:
     Review_Body = profile.findAll("q",{"class":"location-review-review-list-parts-ExpandableReview__reviewText--gOmRC"})
     Review_Body = Review_Body[0].text.replace(",","").replace("\r\n","").strip(" ")

Page without clicking "read more"Page after clicking "read more", when the entire text till end is visible

如上所述，这仅返回部分，而不单击“更多”，后跟“ ...”。请帮忙。PS：我尚未安装并且正在使用Srapy或Selenium模块。他们会更容易吗？

Answer 1

我认为您正在使用的网站（不是您提供的网站，而是：link正在调用不同的网站结构，因此很遗憾，我将无济于事。但是，如果万一有问题，您可以执行此操作并为每次迭代更改代码（我很想知道是否有更好的解决方案）。因此，再次提供帮助：


from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup as _BS

url_html = "https://www.tripadvisor.com/Profile/HollyABC"


def get_web_request(url_to_open: str):
    my_header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
    request = requests.get(url=url_to_open, headers=my_header)
    return request


web_page = get_web_request(url_to_open=url_html)
my_soup = _BS(web_page.text, "lxml")

container_tag = my_soup.find_all('div', attrs={'id': 'content'})
if len(container_tag) > 1:
    exit('Error with the defined container: too many answers(len shoud be one).')
print('len of container_tag', len(container_tag))

row_tags = container_tag[0].find_all('div', attrs={
    'class': 'social-section-core-CardSection__card_section--33UYa ui_card section'})
print('len of rows_tag', len(row_tags))

if row_tags is None or len(row_tags) == 0:
    exit('No result found in container')

href_url_list = []
for row_tag in row_tags:
    # find trip advisor href
    href_tag = row_tag.find_all('a', href=True)
    href = href_tag[2].get('href')
    href_url = urljoin(url_html, href)
    href_url_list.append(href_url)

print(href_url_list)

for href_url in href_url_list:
    web_page = get_web_request(url_to_open=href_url)
    my_soup = _BS(web_page.text, "lxml")
    # assuming it is always the 1st post box...
    text_tag = my_soup.find('div', attrs={'class':'firstPostBox'})
    body_tag = text_tag.find('div', attrs={'class':'postBody'}).find('p')
    print(body_tag.get_text())

因此，您应该获得每个酒店的实际网址，但随后您将不得不处理每个不同的Web结构的内容。我是第一个这样做的人，但公平地说，这似乎不是一个很好的解决方案。希望它能与您或社区一起滚滚球。最佳

NB：

我使用的是'lxml'，您需要进行pip安装，但我认为它可以与'html.parser'一起使用（这里不是问题）。
我不认为Selenium可以解决问题，因为单击后仍将具有不同的Web结构-一种选择是收集href / url（就像我一样），但也收集部分文本，然后在新的最后一个url循环中查找部分文本。该工作。

使用Python中的BeautifulSoup从溢出的span标签中提取文本

问题描述投票：0回答：1

1个回答

最新问题

使用Python中的BeautifulSoup从溢出的span标签中提取文本

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1