我正在启动网络抓取器,想要抓取一些我可能会使用一次的东西。 作为一个例子,我想抓取这个图像(https://thebarchive.com/b/full_image/1707085883033680.jpg) 使用 WC DownloadFile 函数。 它只是返回错误 403。 正如你在下面看到的,我添加了大量的标头,我只是扔掉了其中一些标头,但我尝试复制当我尝试正常访问图像时发送的大部分标头(我用 fiddler 找到了它们) )。 我越来越绝望了,也许有人可以帮我找出问题所在。 重要的一点是,就在昨天,httpClient 和 WebClient 都在不添加 headers 的情况下运行良好,但今天他们拒绝了
wc.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36");
wc.Headers.Add("Host", "thebarchive.com");
wc.Headers.Add("Content-Type","application/x-www-form-urlencoded");
wc.Headers.Add("Cache-Control","max-age=0");
wc.Headers.Add("Content-Length","0");
wc.Headers.Add("origin", "thebarchive.com");
wc.Headers.Add("upgrade-insecure-requests","1");
wc.Headers.Add("accept-encoding","gzip, deflate, br");
wc.Headers.Add("cookie","cf_chl_3=0e3bd5e051a1c24");
// i pass both WC and httpClient into the method where i use this code to download picture
// below is just me getting my way to the image, it works perfectly fine, the only problem is with the dowloading the picture
var html = httpClient.GetStringAsync(ThreadLink).Result;
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var piclinks = htmlDocument.DocumentNode.Descendants("div")
.Where(node => node.GetAttributeValue("class", "")
.Contains("thread_image_box")).ToList();
foreach(var imagelink in piclinks)
{
string link = imagelink.InnerHtml;
// get the picture name because somehow the link is stored not as direct thebarchive link
// but as a archived.moe reference to that link
// which is not even how its stored on the website if you navigate to picture using ctrl+shit+c
string linkfull = link.Substring(link.IndexOf("t/")+2,link.IndexOf("target=")-2-(link.IndexOf("t/")+2));
string downloadlink = "https://thebarchive.com/b/full_image/" + linkfull;
try
{
wc.DownloadFile(downloadlink, Path.Combine(folder,linkfull));
}
catch (Exception ex)
{
Console.WriteLine("Couldn't load file, most likely a video");
Console.WriteLine(ex);
}
添加标题。它与 httpClient 一起工作,我得到了图像链接,但下载图片的进度为 0。
问题不在于任何标题或任何东西。该网站突然获得了 cloudflare 反机器人保护。 为了绕过它FlareSolverrSharp并使用了私人代理。
为了使用 FlareSolverrSharp 而不是 wc.DownloadFile(),使用了 httpClient 下载方法:
wc.DownloadFile(downloadlink, Path.Combine(folder,linkfull));
// became
byte[] imageBytes = await httpClient.GetByteArrayAsync(downloadlink);
File.WriteAllBytes(Path.Combine(folder,linkfull), imageBytes);