我正在尝试从网站上抓取一些内容(代码中的网址)。我能够抓取品牌名称和 SDR,但似乎任何低于 SDR 的内容,我似乎都无法抓取。我只在第一个结果上进行测试,一旦我设法弄清楚它,我就会使它变得动态。希望人们只需要在他们的项目和 chrome 派生程序中包含 selenium,然后他们就可以复制/粘贴此代码。
以下代码给出以下错误:
Exception in thread "main" org.openqa.selenium.TimeoutException: Expected condition failed: waiting for visibility of element located by By.xpath: /html/body/app-root/ecl-app/div[2]/app-search-page/app-search-container/div/div/section/div/app-elec-display-search-result/app-search-result/eui-block-content/div/app-search-result-item[1]/article/div[2]/div/app-elec-display-search-result-parameters/app-search-parameter-item[4]/div[2]/div/div[2]/div/div[1]/span (tried for 10 second(s) with 500 milliseconds interval)
代码:
public void scrape() throws InterruptedException {
System.out.println("Starting Scrape!");
String url = "https://eprel.ec.europa.eu/screen/product/electronicdisplays";
WebDriver driver = new ChromeDriver();
WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(10));
driver.get(url);
driver.manage().window().maximize();
WebElement until = wait.until(ExpectedConditions.presenceOfElementLocated(By.className("eui-block-content__wrapper")));
//The results have been loaded now
//Click on accept cookie page:
new WebDriverWait(driver, Duration.ofSeconds(3
)).until(ExpectedConditions.elementToBeClickable(By.linkText("Accept all cookies"))).click();
String moreButton = "/html/body/app-root/ecl-app/div[2]/app-search-page/app-search-container/div/div/section/div/app-elec-display-search-result/app-search-result/eui-block-content/div/app-search-result-item[1]/article/div[3]/div/a";
String xPathBrandName = "/html/body/app-root/ecl-app/div[2]/app-search-page/app-search-container/div/div/section/div/app-elec-display-search-result/app-search-result/eui-block-content/div/app-search-result-item[1]/article/div[1]/div/div/div[1]/span[1]";
String xPathSDR = "/html/body/app-root/ecl-app/div[2]/app-search-page/app-search-container/div/div/section/div/app-elec-display-search-result/app-search-result/eui-block-content/div/app-search-result-item[1]/article/div[2]/div/app-elec-display-search-result-parameters/app-search-parameter-item[3]/div[1]/div/div[2]/div/div[1]/span";
String energyRatingString = "/html/body/app-root/ecl-app/div[2]/app-search-page/app-search-container/div/div/section/div/app-elec-display-search-result/app-search-result/eui-block-content/div/app-search-result-item[1]/article/div[2]/div/app-elec-display-search-result-parameters/app-search-parameter-item[4]/div[2]/div/div[2]/div/div[1]/span";
//Clicking on more button to load more results to be visible
driver.findElement(By.xpath(moreButton)).click();
WebElement SDR = driver.findElement(By.xpath(xPathSDR));
//Using this logic to scroll to each of the result so it's visible on the web-page
JavascriptExecutor js = (JavascriptExecutor) driver;
js.executeScript("arguments[0].scrollIntoView();", SDR);
WebElement brandName = driver.findElement(By.xpath(xPathBrandName));
WebElement energyRating = wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath(energyRatingString)));
System.out.println("Brand name: " + brandName.getText());
System.out.println("SDR name: " + SDR.getText());
System.out.println("energyRating: " + energyRating.getText());
}
但是切换到替换
WebElement energyRating = wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath(energyRatingString)));
到
WebElement energyRating = driver.findElement(By.xpath(energyRatingString ));
给出以下输出:
Starting Scrape!
Brand name: Samsung
SDR name: 63
energyRating:
所以我很困惑为什么 energyRating 缺失并且没有给出 NoSuchElementException
您遇到的问题是每个字段都有 2 个,一个可见,一个隐藏。您的 XPath 指向隐藏元素之一,因为它永远不可见,所以等待超时。
我编写了自己的代码来完成您所描述的任务。
String url = "https://eprel.ec.europa.eu/screen/product/electronicdisplays";
driver = new ChromeDriver();
driver.manage().window().maximize();
driver.get(url);
List<WebElement> results = driver.findElements(By.cssSelector("app-search-result-item"));
String brandName = "";
String sdr = "";
String energyRating = "";
for (WebElement result : results) {
result.findElement(By.xpath("//a[text()=' More ']")).click();
brandName = result.findElement(By.cssSelector("span.ecl-u-type-2xl")).getText();
sdr = result.findElement(By.cssSelector("app-search-parameter-item[label='field.electronic-display.powerOnModeSDRV2'] div.ecl-u-d-l-block span.ecl-u-type-bold")).getText();
energyRating = result.findElement(By.cssSelector("app-search-parameter-item[label='field.electronic-display.energyClassHDR'] div.ecl-u-d-l-block span.ecl-u-type-bold")).getText();
result.findElement(By.xpath("//a[text()=' Less ']")).click();
System.out.println("Brand name: " + brandName);
System.out.println("SDR name: " + sdr);
System.out.println("energyRating: " + energyRating);
}
它输出
Brand name: Samsung
SDR name: 63
energyRating: G
Brand name: Samsung
SDR name: 63
energyRating: G
...
一些反馈...
/html
开始的路径)、过长(多个元素级别)和索引(/div[2]
等)都是有风险的,因为对页面的最小更改都会破坏您的定位器。我确信您是新手,这是最好的起点,但如果您打算继续编写脚本,学习编写自己的定位器将非常有价值。<app-search-result-item>
。将这些内容放在一个列表中,您可以对其进行迭代,这将使此类任务变得更容易。在每个循环中,您从该锚元素开始搜索,以便只查找适用于该产品的数据等。这就是为什么在我的代码中您会看到很多 result.findElement()
,因为 result
是列表中的产品我循环浏览的产品。WebDriverWait
是一个很好的做法,但它们并不总是必需的。