使用 Selenium 抓取带有嵌入式 Javascript 的网站

问题描述 投票:0回答:1

我是 Selenium 新手,正在尝试抓取 此网站 的内容。但是,该网站似乎基于一个模板和一个运行来填充它的 Javascript,我不知道如何访问我看到的内容,例如标题 (Auf dem Bahnhof) 或目标等。硒。

我可以通过浏览 Web 开发人员工具找到所需元素的标签,但在运行下面的示例脚本后,它们没有返回任何内容:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import Select,WebDriverWait


class Demo():

    def demo_get_contents(self):

        # create webdriver object
        service = Service(executable_path=ChromeDriverManager().install())
        driver = webdriver.Chrome(service=service)

        driver.get('https://gloss.dliflc.edu/LessonViewer.aspx?lessonId=26143&lessonName=ger_soc434&linkTypeId=0')
        element = WebDriverWait(driver, 2).until(EC.visibility_of_all_elements_located((By.CLASS_NAME,'gloss_Overview')))
        print(element.get_attribute('text'))


demo = Demo()
demo.demo_get_contents()

我使用的是Python3.8

查看页面源代码,我可以看到可能运行 accessActivity() 函数的 Javascript 和 iframe,但不知道如何使用 Selenium 运行该函数来访问实际页面内容。

javascript python selenium-webdriver web-scraping selenium-chromedriver
1个回答
0
投票

实际上,作为替代方案,没有必要使用

Selenium
。如果您检查网络调用,您将看到数据可作为 XML 文件从

获取
https://gloss.dliflc.edu/GlossHtml/templates/linksLO/glossLOs/ger_soc434.xml

您可以使用Python内置的

ElementTree
库来抓取正确的测验数据。

import requests
import xml.etree.ElementTree as ET


url = 'https://gloss.dliflc.edu/GlossHtml/templates/linksLO/glossLOs/ger_soc434.xml'


def get_element_text(element):
    return ''.join(element.itertext()).strip()


def find_elements_texts(root, tag):
    elements = root.findall(f".//{tag}[@dir='ltr'][@esbox='0']")
    return [get_element_text(elem) for elem in elements]


response = requests.get(url).content
root = ET.fromstring(response)

objectives_texts = find_elements_texts(root, "OBJECTIVES")
descriptions_texts = find_elements_texts(root, "ACTY_DESCRIPTION")

print(f"Objective:\n {''.join(objectives_texts)}\n")

print(f"Descriptions:\n {descriptions_texts}")

打印:

Objective:
 Strengthen listening skills and improve comprehension by focusing on terms related to train travel in an audio about a family at a train station before a trip.

Descriptions:
 ['Identify relevant vocabulary and get a more detailed idea of the topic.', 'Preview useful terms and expressions that appear in the upcoming dialogue.', 'Become familiar with the specifics of the situation by listening to several dialogues.', 'Transcribe portions of another dialogue.', 'Assess your knowledge by matching questions with answers.']
© www.soinside.com 2019 - 2024. All rights reserved.