我是Python的新手,紧跟练习练习。从HTML span标记提取文本时,某些部分位于“阅读更多”下,除非我在其中单击,否则span标记不会随现有文本一起更新。这意味着当我为标签和类运行BeautifulSoup和findAll时,没有“更多内容”部分的唯一第一部分将作为摘录返回。无法弄清楚该怎么办?这是用于酒店评论的文本挖掘练习。代码如下,未提供完整部分:
url_soup=soup(url_html,"html.parser")
profiles = url_soup.findAll("div",{"class":"hotels-community-tab-common-Card__card--ihfZB hotels-community-tab-common-Card__section--4r93H"})
for profile in profiles:
Review_Body = profile.findAll("q",{"class":"location-review-review-list-parts-ExpandableReview__reviewText--gOmRC"})
Review_Body = Review_Body[0].text.replace(",","").replace("\r\n","").strip(" ")
Page without clicking "read more"Page after clicking "read more", when the entire text till end is visible
如上所述,这仅返回部分,而不单击“更多”,后跟“ ...”。请帮忙。PS:我尚未安装并且正在使用Srapy或Selenium模块。他们会更容易吗?
我认为您正在使用的网站(不是您提供的网站,而是:link正在调用不同的网站结构,因此很遗憾,我将无济于事。但是,如果万一有问题,您可以执行此操作并为每次迭代更改代码(我很想知道是否有更好的解决方案)。因此,再次提供帮助:
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup as _BS
url_html = "https://www.tripadvisor.com/Profile/HollyABC"
def get_web_request(url_to_open: str):
my_header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
request = requests.get(url=url_to_open, headers=my_header)
return request
web_page = get_web_request(url_to_open=url_html)
my_soup = _BS(web_page.text, "lxml")
container_tag = my_soup.find_all('div', attrs={'id': 'content'})
if len(container_tag) > 1:
exit('Error with the defined container: too many answers(len shoud be one).')
print('len of container_tag', len(container_tag))
row_tags = container_tag[0].find_all('div', attrs={
'class': 'social-section-core-CardSection__card_section--33UYa ui_card section'})
print('len of rows_tag', len(row_tags))
if row_tags is None or len(row_tags) == 0:
exit('No result found in container')
href_url_list = []
for row_tag in row_tags:
# find trip advisor href
href_tag = row_tag.find_all('a', href=True)
href = href_tag[2].get('href')
href_url = urljoin(url_html, href)
href_url_list.append(href_url)
print(href_url_list)
for href_url in href_url_list:
web_page = get_web_request(url_to_open=href_url)
my_soup = _BS(web_page.text, "lxml")
# assuming it is always the 1st post box...
text_tag = my_soup.find('div', attrs={'class':'firstPostBox'})
body_tag = text_tag.find('div', attrs={'class':'postBody'}).find('p')
print(body_tag.get_text())
因此,您应该获得每个酒店的实际网址,但随后您将不得不处理每个不同的Web结构的内容。我是第一个这样做的人,但公平地说,这似乎不是一个很好的解决方案。希望它能与您或社区一起滚滚球。最佳
NB: