在Java中使用Scanner进行Web Scraping

问题描述 投票:0回答:2

我应该使用URL和扫描仪类进行网络抓取,并且从网站上的HTML代码中仅选出过去8天内的能量消耗数量。所以我有一个24x8的数组,以适应所有数字。我使用.findInLine来识别小时ex:我在这里使用第一部分来识别第一小时的数字块。

while (in.findInLine("00-01") == null) in.nextLine();
in.nextLine() // skip rest of the line containing "00-01"

<td>00-01</td>
<td align="right"> 11872</td>
<td align="right"> 12146</td>
<td align="right"> 12861</td>
<td align="right"> 12561</td>
<td align="right"> 13493</td>
<td align="right"> 13386</td>
<td align="right"> 12732</td>
<td align="right"> <b>12249</b></td>

我的问题是我无法弄清楚如何提取这些数字并将它们放入数组中,因为我有24个这样的部分。

java java.util.scanner
2个回答
1
投票

给定输入,以下将提取每行的数字。

  Pattern pattern = Pattern.compile("\\d+");
    while (in.hasNext())
    {
      String str = in.nextLine();
      Matcher m = pattern.matcher(str);
      while (m.find())
      {
        //Change this to add to add to an array
        System.out.println(m.group());           

      }
    }

0
投票

鉴于您的输入有限,我只使用纯扫描仪接口:

public class Scrap {

private final static String HOUR_PATTERN = "<td>\\d{2}-\\d{2}</td>";
private final static String TD_DELIMETER = "\\s|<|>";

public static void main(String[] args) {
    Scanner in = new Scanner(Scrap.class.getResourceAsStream("/input"));
    List<Integer> res = new ArrayList<>();
    while (in.hasNext()) {
        if (!in.hasNext(HOUR_PATTERN)) {
            System.out.println(in.next());
            continue;
        }
        String found = in.next(HOUR_PATTERN);
        Pattern delim = in.delimiter();
        in.useDelimiter(TD_DELIMETER);
        for (int i = 0; i < 8; i++) {// you wrote it is going to be 8 entries
            while (in.hasNext()) {
                if (in.hasNextInt()) {
                    res.add(in.nextInt());
                } else {
                    System.out.println(in.next());
                }
            }
        }
        in.useDelimiter(delim);
    }
    System.out.println(res);
}
}

给定输入

blelblebll

<td>00-01</td>
<td align="right"> 11872</td>
<td align="right"> 12146</td>
<td align="right"> 12861</td>
<td align="right"> 12561</td>
<td align="right"> 13493</td>
<td align="right"> 13386</td>
<td align="right"> 12732</td>
<td align="right"> <b>12249</b></td>

<td>00-01</td>
<td align="right"> 11872</td>
<td align="right"> 12146</td>
<td align="right"> 12861</td>
<td align="right"> 12561</td>
<td align="right"> 13493</td>
<td align="right"> 13386</td>
<td align="right"> 12732</td>
<td align="right"> <b>12249</b></td>


<td>00-01</td>
<td align="right"> 11872</td>
<td align="right"> 12146</td>
<td align="right"> 12861</td>
<td align="right"> 12561</td>
<td align="right"> 13493</td>
<td align="right"> 13386</td>
<td align="right"> 12732</td>
<td align="right"> <b>12249</b></td>


<td>00-01</td>
<td align="right"> 11872</td>
<td align="right"> 12146</td>
<td align="right"> 12861</td>
<td align="right"> 12561</td>
<td align="right"> 13493</td>
<td align="right"> 13386</td>
<td align="right"> 12732</td>
<td align="right"> <b>12249</b></td>

产生

[11872, 12146, 12861, 12561, 13493, 13386, 12732, 12249, 11872, 12146, 12861, 12561, 13493, 13386, 12732, 12249, 11872, 12146, 12861, 12561, 13493, 13386, 12732, 12249, 11872, 12146, 12861, 12561, 13493, 13386, 12732, 12249]

它基于您的输入示例,因此它现在可以在实时标记上工作。

或者,您可以使用<.*?>作为分隔符,并仅关注数字模式。

© www.soinside.com 2019 - 2024. All rights reserved.