我正在创建电子邮件抓取工具。但是,当我尝试使用一个特定的URL时,matcher.find()
没有给出任何boolean
结果。如我所见,它冻结了。但是对于其他一些URL,该代码也可以正常工作。
这是我的代码,
private Matcher matcher;
private Pattern pattern = null;
private final String emailPattern = "([\\w\\-]([\\.\\w])+[\\w]+@([\\w\\-]+\\.)+[A-Za-z]{2,4})";
public void scrape() {
pattern = Pattern.compile(emailPattern);
Document documentTwo = null;
try {
documentTwo = Jsoup.connect("https://www.mercurynews.com/2020/03/21/how-can-i-get-tested-for-covid-19-in-the-bay-area/")
.ignoreHttpErrors(true)
.userAgent(RandomUserAgent.getRandomUserAgent())
.header("Content-Language", "en-US")
.get();
} catch (IOException ex) {
break;
}
String pageBody = documentTwo.toString();
matcher = pattern.matcher(pageBody);
while (matcher.find()) {
// this will never execute for the above web address
}
}
要检查,我在while循环上方添加了System.out.println(matcher.find());
,它卡在了那里而没有打印任何值。那么我在这里做错了吗?我尝试了许多不同的电子邮件正则表达式模式,但以上模式是有效的模式。那么有人可以帮助我吗?我对此表示高度赞赏。谢谢。
您的正则表达式有问题。下面给出的是带有正则表达式的代码:
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Main {
public static void main(String[] args) {
Document documentTwo = null;
try {
documentTwo = Jsoup
.connect(
"https://www.mercurynews.com/2020/03/21/how-can-i-get-tested-for-covid-19-in-the-bay-area/")
.header("Content-Language", "en-US").get();
} catch (IOException e) {
e.printStackTrace();
}
String pageBody = documentTwo.toString();
Pattern pattern = Pattern.compile(
"([a-zA-Z0-9\\+\\.\\_\\%\\-\\+]{1,256}\\@[a-zA-Z0-9][a-zA-Z0-9\\-]{0,64}(\\.[a-zA-Z0-9][a-zA-Z0-9\\-]{0,25})+)");
Matcher matcher = pattern.matcher(pageBody);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
输出:
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]