我应该使用URL和扫描仪类进行网络抓取,并且从网站上的HTML代码中仅选出过去8天内的能量消耗数量。所以我有一个24x8的数组,以适应所有数字。我使用.findInLine来识别小时ex:我在这里使用第一部分来识别第一小时的数字块。
while (in.findInLine("00-01") == null) in.nextLine();
in.nextLine() // skip rest of the line containing "00-01"
<td>00-01</td>
<td align="right"> 11872</td>
<td align="right"> 12146</td>
<td align="right"> 12861</td>
<td align="right"> 12561</td>
<td align="right"> 13493</td>
<td align="right"> 13386</td>
<td align="right"> 12732</td>
<td align="right"> <b>12249</b></td>
我的问题是我无法弄清楚如何提取这些数字并将它们放入数组中,因为我有24个这样的部分。
给定输入,以下将提取每行的数字。
Pattern pattern = Pattern.compile("\\d+");
while (in.hasNext())
{
String str = in.nextLine();
Matcher m = pattern.matcher(str);
while (m.find())
{
//Change this to add to add to an array
System.out.println(m.group());
}
}
鉴于您的输入有限,我只使用纯扫描仪接口:
public class Scrap {
private final static String HOUR_PATTERN = "<td>\\d{2}-\\d{2}</td>";
private final static String TD_DELIMETER = "\\s|<|>";
public static void main(String[] args) {
Scanner in = new Scanner(Scrap.class.getResourceAsStream("/input"));
List<Integer> res = new ArrayList<>();
while (in.hasNext()) {
if (!in.hasNext(HOUR_PATTERN)) {
System.out.println(in.next());
continue;
}
String found = in.next(HOUR_PATTERN);
Pattern delim = in.delimiter();
in.useDelimiter(TD_DELIMETER);
for (int i = 0; i < 8; i++) {// you wrote it is going to be 8 entries
while (in.hasNext()) {
if (in.hasNextInt()) {
res.add(in.nextInt());
} else {
System.out.println(in.next());
}
}
}
in.useDelimiter(delim);
}
System.out.println(res);
}
}
给定输入
blelblebll
<td>00-01</td>
<td align="right"> 11872</td>
<td align="right"> 12146</td>
<td align="right"> 12861</td>
<td align="right"> 12561</td>
<td align="right"> 13493</td>
<td align="right"> 13386</td>
<td align="right"> 12732</td>
<td align="right"> <b>12249</b></td>
<td>00-01</td>
<td align="right"> 11872</td>
<td align="right"> 12146</td>
<td align="right"> 12861</td>
<td align="right"> 12561</td>
<td align="right"> 13493</td>
<td align="right"> 13386</td>
<td align="right"> 12732</td>
<td align="right"> <b>12249</b></td>
<td>00-01</td>
<td align="right"> 11872</td>
<td align="right"> 12146</td>
<td align="right"> 12861</td>
<td align="right"> 12561</td>
<td align="right"> 13493</td>
<td align="right"> 13386</td>
<td align="right"> 12732</td>
<td align="right"> <b>12249</b></td>
<td>00-01</td>
<td align="right"> 11872</td>
<td align="right"> 12146</td>
<td align="right"> 12861</td>
<td align="right"> 12561</td>
<td align="right"> 13493</td>
<td align="right"> 13386</td>
<td align="right"> 12732</td>
<td align="right"> <b>12249</b></td>
产生
[11872, 12146, 12861, 12561, 13493, 13386, 12732, 12249, 11872, 12146, 12861, 12561, 13493, 13386, 12732, 12249, 11872, 12146, 12861, 12561, 13493, 13386, 12732, 12249, 11872, 12146, 12861, 12561, 13493, 13386, 12732, 12249]
它基于您的输入示例,因此它现在可以在实时标记上工作。
或者,您可以使用<.*?>
作为分隔符,并仅关注数字模式。