想要提取审查,但遇到一些问题

问题描述 投票:1回答:2

我用来提取其中一本书的评论的脚本是:

网址:www.goodreads.com/book/show/2657.To_Kill_a_Mockingbird

from selenium import webdriver
import time

driver = webdriver.Chrome()
time.sleep(3)

driver.get('https://www.goodreads.com/book/show/2657.To_Kill_a_Mockingbird')

time.sleep(5)

reviews = driver.find_elements_by_css_selector("div.reviewText")
for r in reviews:
    spanText = r.find_element_by_css_selector("span.readable:nth-child(2)").text
    print("Span text:", spanText)

我面临的问题是我无法从div.reviewText> span中提取整个文本,因为在div> span中有两个嵌套的spans,一个包含小文本(用于获取全文需要单击...更多链接)不完整的一个和div中的第二个span.contains全文,所以我想得到文本frm第二个跨度。有谁可以帮助我吗?

HTML(或者你可以访问网站链接如上)

<div class="reviewText stacked">
    <span id="reviewTextContainer35272288" class="readable">
        <span id="freeTextContainer13558188749606170457">If I could give this no stars, I would. This is possibly one of my least favorite books in the world, one that I would happily take off of shelves and stow in dark corners where no one would ever have to read it again.
            <br>
                <br>I think that To Kill A Mockingbird has such a prominent place in (American) culture because it is a naive, idealistic piece of writing in which naivete and idealism are ultimately rewarded. It's a saccharine, rose-tinted eulogy for the nineteen thirties from an orator who comes not
                </span>
                <span id="freeText13558188749606170457" style="display:none">If I could give this no stars, I would. This is possibly one of my least favorite books in the world, one that I would happily take off of shelves and stow in dark corners where no one would ever have to read it again.
                    <br>
                        <br>I think that To Kill A Mockingbird has such a prominent place in (American) culture because it is a naive, idealistic piece of writing in which naivete and idealism are ultimately rewarded. It's a saccharine, rose-tinted eulogy for the nineteen thirties from an orator who comes not to bury, but to praise. Written in the late fifties, TKAM is free of the social changes and conventions that people at the time were (and are, to some extent) still grating at. The primary dividing line in TKAM is not one of race, but is rather one of good people versus bad people -- something that, of course, Atticus and the children can discern effortlessly. 
                            <br>
                                <br>The characters are one dimensional. Calpurnia is the Negro who knows her place and loves the children; Atticus is a good father, wise and patient; Tom Robinson is the innocent wronged; Boo is the kind eccentric; Jem is the little boy who grows up; Scout is the precocious, knowledgable child. They have no identity outside of these roles. The children have no guile, no shrewdness--there is none of the delightfully subversive slyness that real children have, the sneakiness that will ultimately allow them to grow up. Jem and Scout will be children forever, existing in a world of black and white in which lacking knowledge allows people to see the truth in all of its simple, nuanceless glory. 
                                    <br>
                                        <br>I think that's why people find it soothing: TKAM privileges, celebrates, even, the child's point of view. Other YA classics--Huckleberry Finn; Catcher in the Rye; A Wrinkle in Time; The Day No Pigs Would Die; Are You There, God? It's Me, Margaret; Bridge to Terabithia--feature protagonists who are, if not actively fighting to become adults, at least fighting to find themselves as people. There is an active struggle throughout each of those books to make sense of the world, to define the world as something larger than oneself, as something that the protagonist can somehow be a part of. To Kill A Mockingbird has no struggle to become part of the world--in it, the children *are* the world, and everything else is just only relevant in as much as it affects them. There's no struggle to make sense of things, because to them, it already makes sense; there's no struggle to be a part of something, because they're already a part of everything. There's no sense of maturation--their world changes, but it leaves them, in many ways, unchanged, and because of that, it fails as a story for me. The whole point of a coming of age story--which is what TKAM is generally billed as--is that the characters come of age, or at least mature in some fashion, and it just doesn't happen. 
                                            <br>
                                                <br>All thematic issues aside, I think that the writing is very, er, uneven, shall we say? Overwhelmingly episodic, not terribly consistent, and largely as dimensionless as the characters.
                                                    <br>
                                                    </span>
                                                    <a data-text-id="13558188749606170457" href="#" onclick="swapContent($(this));; return false;">...more</a>
                                                </span>
                                            </div>
python-3.x selenium selenium-webdriver
2个回答
0
投票

第二个跨度是隐藏的,因此您无法使用text属性获取其内容。

你需要试试

spanText = r.find_elements_by_css_selector("span.readable > span")[-1].get_attribute('textContent')

获取隐藏元素的内容


0
投票

使用get_attribute()提取隐藏的内容,你不需要不必要的睡眠

driver = webdriver.Chrome()

driver.get('https://www.goodreads.com/book/show/2657.To_Kill_a_Mockingbird')

reviews = driver.find_elements_by_css_selector("span.readable span:nth-child(2)")
for r in reviews:
    spanText = r.get_attribute('textContent')
    print("Span text:", spanText)
© www.soinside.com 2019 - 2024. All rights reserved.