连接到产品页面URL Jsoup

问题描述 投票:2回答:1

我有一个网站,我需要从中解析数据。我需要通过关键字结果进行一些搜索。但是,并非所有字段都在产品预览中可见。似乎这些字段(产品颜色,描述,旧价格)只能从每个产品页面中删除。产品页面的网址看起来像这样https://www.aboutyou.de/p/new-look/basecap-in-satin-optik-3649077 SI不知道如何以通用方式调用它,所以我不必浏览每个产品。我可以找到项目的名称和品牌,但我不知道如何构建网址 - 将所有字母设置为大写并在字词之间加上破折号?我可以通过以下方式获得品牌名称和产品名称:Satin-Optik中的新LOOK Basecap。

那么我如何定义每个产品的网址?

这是我到目前为止的代码:

String url = "https://www.aboutyou.de/frauen/accessoires/huete-und-muetzen/caps";
Document doc = Jsoup.connect(url).get();

System.out.println("Title: " + doc.title());

String mainPath = "section.layout_11glwo1-o_O-stretchLayout_1jug6qr > " +
        "div.content_1jug6qr > " +
        "div.container > " +
        "div.mainContent_10ejhcu > " +
        "div.productStream_6k751k > " +
        "div > " +
        "div.wrapper_8yay2a > " +
        "div.col-sm-6.col-md-4 > " +
        "div.wrapper_1eu800j > " +
        "div > " +
        "div.categoryTileWrapper_e296pg";

String searchPath = mainPath + " > a.anchor_wgmchy > " +
        "div.details_197iil9 > " +
        "div.meta_1ihynio";
String linksPath = mainPath + " > a.anchor_wgmchy";
String brandPath = mainPath + " > a.anchor_wgmchy > " +
        "div.details_197iil9 > " +
        "div.meta_1ihynio > " +
        "div.description_ya0ltb > " +
        "strong.brand_ke66rm";

Elements result = doc.body().select("main#app");
for(Element element : result) {
    Elements products = element.select(searchPath);
    Elements links = element.select(linksPath);

    Elements brands = element.select(brandPath);
    for(Element product : products){
      System.out.println(product.text());
    }

    String[] linksText = null;
    for(Element link : links){
        String linkHref = link.attr("href");
        String linkText = link.text();
        linksText = linkHref.split("[\\-]");
        String id = linksText[linksText.length-1];
        System.out.println("id: " + id);
        System.out.print("link attr:" + linkHref + ", ");
    }
    System.out.print("\nbrands" + brands.text());
}

也许,有一些图书馆吗?我会很感激任何建议!

java web-scraping jsoup
1个回答
0
投票

大多数所需的细节都可以从div中抓取,如下所示:

<div class="details_..." ...>

抓住这些div的文本会给你类似的东西:

-10%9,90€ -10 % EXTRA8,90€ NEW LOOK Basecap in Satin-Optik 8,01€

从产品页面中分离一些细节和颜色细节子请求的示例代码:

String url = "https://www.aboutyou.de/frauen/accessoires/huete-und-muetzen/caps";
String userAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36";

try {
    Document doc = Jsoup.connect(url).userAgent(userAgent).get();
    Elements elements = doc.select("div[class^='categoryTileWrapper_']");

    for (Element element : elements) {

        String brand = element.select("strong[class^='brand_']").first().text();
        String name = element.select("p[class^='name_']").first().text();
        System.out.println(brand + " - " + name);

        String href = element.select("a[class^='anchor_']").first().absUrl("href");
        Document subDoc = Jsoup.connect(href).userAgent(userAgent).get();
        String color = subDoc.select("div[class^='attributeWrapper_']").first().text();     
        System.out.println("\t"+href);
        System.out.println("\t"+color);

        String finalPrice = element.select("div[class^='finalPrice_']").first().text();

        if( element.select("ul").size()>0 ){
            for (Element listItems : element.select("ul").first().select("li")) {
                System.out.println("\tpriece was: " + listItems.select("span[class^='price_']").first().text());
            }
        }
        System.out.println("\tfinal priece: " + finalPrice);
    }
} catch (IOException e) {
    e.printStackTrace();
}

输出:

NEW LOOK - Basecap in Satin-Optik
    https://www.aboutyou.de/p/new-look/basecap-in-satin-optik-3649077
    Textil Unifarben
    priece was: 9,90€
    priece was: 8,90€
    final priece: 8,01€
WOOD WOOD - Weiche 'Baseball cap'
    https://www.aboutyou.de/p/wood-wood/weiche-baseball-cap-3687779
    Logoprint
    priece was: 39,90€
    priece was: 29,90€
    final priece: 20,93€
[... truncated]
© www.soinside.com 2019 - 2024. All rights reserved.