WebClient 在尝试使用 DownloadFile() 时返回 403

问题描述 投票:0回答:1

我正在启动网络抓取器,想要抓取一些我可能会使用一次的东西。 作为一个例子,我想抓取这个图像(https://thebarchive.com/b/full_image/1707085883033680.jpg) 使用 WC DownloadFile 函数。 它只是返回错误 403。 正如你在下面看到的,我添加了大量的标头,我只是扔掉了其中一些标头,但我尝试复制当我尝试正常访问图像时发送的大部分标头(我用 fiddler 找到了它们) )。 我越来越绝望了,也许有人可以帮我找出问题所在。 重要的一点是,就在昨天,httpClient 和 WebClient 都在不添加 headers 的情况下运行良好,但今天他们拒绝了

            wc.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36");
            wc.Headers.Add("Host", "thebarchive.com");
            wc.Headers.Add("Content-Type","application/x-www-form-urlencoded");
            wc.Headers.Add("Cache-Control","max-age=0");
            wc.Headers.Add("Content-Length","0");
            wc.Headers.Add("origin", "thebarchive.com");
            wc.Headers.Add("upgrade-insecure-requests","1");
            wc.Headers.Add("accept-encoding","gzip, deflate, br");
            wc.Headers.Add("cookie","cf_chl_3=0e3bd5e051a1c24");


            // i pass both WC and httpClient into the method where i use this code to download picture
            // below is just me getting my way to the image, it works perfectly fine, the only problem is with the dowloading the picture


            var html = httpClient.GetStringAsync(ThreadLink).Result;
            var htmlDocument = new HtmlDocument();
            htmlDocument.LoadHtml(html);
            var piclinks = htmlDocument.DocumentNode.Descendants("div")
                .Where(node => node.GetAttributeValue("class", "")
                .Contains("thread_image_box")).ToList();

            foreach(var imagelink in piclinks)
            {
                string link = imagelink.InnerHtml;

                // get the picture name because somehow the link is stored not as direct thebarchive link
                // but as a archived.moe reference to that link
                // which is not even how its stored on the website if you navigate to picture using ctrl+shit+c

                string linkfull = link.Substring(link.IndexOf("t/")+2,link.IndexOf("target=")-2-(link.IndexOf("t/")+2));
                string downloadlink = "https://thebarchive.com/b/full_image/" + linkfull;

                try 
                {
                    wc.DownloadFile(downloadlink, Path.Combine(folder,linkfull));
                }
                catch (Exception ex)
                {
                    Console.WriteLine("Couldn't load file, most likely a video");
                    Console.WriteLine(ex);
                }

添加标题。它与 httpClient 一起工作,我得到了图像链接,但下载图片的进度为 0。

c# web-scraping httpclient webclient
1个回答
0
投票

问题不在于任何标题或任何东西。该网站突然获得了 cloudflare 反机器人保护。 为了绕过它FlareSolverrSharp并使用了私人代理。

为了使用 FlareSolverrSharp 而不是 wc.DownloadFile(),使用了 httpClient 下载方法:

wc.DownloadFile(downloadlink, Path.Combine(folder,linkfull));

// became

byte[] imageBytes = await httpClient.GetByteArrayAsync(downloadlink);
File.WriteAllBytes(Path.Combine(folder,linkfull), imageBytes); 
© www.soinside.com 2019 - 2024. All rights reserved.