我有一个Spring Boot应用程序,它可以抓取一个网站,我可以使其正常运行,但似乎无法弄清楚如何访问内部html元素以进行迭代。好像我总是只到达外面的html东西。下面是我的代码
private void fetchData() {
try {
Document doc =
Jsoup.connect("https://www.i-90motorsports.com/default.asp?page=xPreOwnedInventory").userAgent("Mozilla/17.0").get();
Elements elementList = doc.select("doc.info");
Elements vehBody = doc.select("div#VehBody");
if(doc.getAllElements().isEmpty()) {
aStringBuilder.append("Nothing found for " );
addLineBreak();
return;
}
for(Element anElement : vehBody) {
//if(isElementValid(anElement)) {
aStringBuilder.append(anElement.getElementsByTag("a").first().text() );
addLineBreak();
//}
}
aStringBuilder.append("Finished Scraping websites. Found "+ elementList.size() +" elements");
addLineBreak();
} catch (IOException e) {
e.printStackTrace( );
}
}
这是我要获取的html。我试图遍历每行车辆,不确定在“选择”部分中输入哪个值。建议?
<div id="VehBody">
<div class="vehicle_row dspYear-2020 dspCondition-PREOWNED dspBodyType-SNOWMOBILE dspSubType-MOUNTAIN dspMake-SKI-DOO dspModel-SUMMIT-X-850-E-TEC-175-SS-POWDERMAX-LIGHT-3-0-S-LEV-BLUE images-1" rel="8382204"><div class="unitImage"><div class="imageRow">
在这种情况下,<div id="VehBody">
是通过JavaScript填充]的空元素,因此jsoup无法读取它(用于读取页面源,而不是读取由JavaScript修改的DOM) )。相反,您可以通过解析the JavaScript file that is included at the end of the header:获得所需的数据。
<script id='jsCachedFile' src='/imglib/Inventory/cache/2366/UVehInv.js?v=1892194' type='text/javascript' ></script></head>
请注意,每次请求HTML页面时,参数
v
都会更改。