我不久前发现了 Greg 编辑的 R 代码(此处),它在很长一段时间内运行得很好。不幸的是,前段时间,它停止工作了(至少对我来说),我想知道是否有人可以帮助解决这个问题(如果可能的话)。
library(rvest)
webpage <- read_html("https://www.macrotrends.net/stocks/charts/azo/autozone/pe-ratio")
html <- rvest::html_nodes(webpage, "thead+ thead th , #style-1 td")
results <- rvest::html_text(html)
Date <- results[seq(5, length(results), 4)]
`Stock Price` <- results[seq(6, length(results), 4)]
`TTM Net EPS` <- results[seq(7, length(results), 4)]
`PE Ratio` <- results[seq(8, length(results), 4)]
results <- data.frame(Date, `Stock Price`, `TTM Net EPS`, `PE Ratio`, stringsAsFactors = FALSE)
回归
head(results)
Date Stock.Price TTM.Net.EPS PE.Ratio
1 2020-04-27 1060.52 16.25
2 2020-02-29 1032.51 $65.27 15.82
3 2019-11-30 1177.92 $64.37 18.30
4 2019-08-31 1101.69 $63.54 17.34
5 2019-05-31 1027.11 $55.97 18.35
6 2019-02-28 938.97 $53.40 17.58
但正如我所说,遗憾的是似乎不再起作用了。如果有人可以提供帮助,这对 R 世界来说将是非常好的。
至少在 Windows 上运行的 R & RStudio 中,
User-Agent
中的 rvest::read_html()
请求标头类似于:
RStudio Desktop (2023.6.2.529); R (4.2.3 x86_64-w64-mingw32 x86_64 mingw32)
显然他们真的不喜欢
RStudio Desktop
部分,为了方便起见,使用 httr2
进行演示。RStudio Desktop
的内容,请求将失败:
library(httr2)
url_ <- "https://www.macrotrends.net/stocks/charts/azo/autozone/pe-ratio"
request(url_) |>
req_user_agent("RStudio Desktop") |>
req_perform(verbosity = 1)
#> -> GET /stocks/charts/azo/autozone/pe-ratio HTTP/1.1
#> -> Host: www.macrotrends.net
#> -> User-Agent: RStudio Desktop
#> -> Accept: */*
#> -> Accept-Encoding: deflate, gzip
#> ->
#> <- HTTP/1.1 403 Forbidden
#> <- Connection: close
#> <- Content-Length: 420
#> <- Server: Varnish
#> <- Retry-After: 0
#> <- Content-Type: text/html; charset=utf-8
#> <- Accept-Ranges: bytes
#> <- Date: Fri, 22 Sep 2023 13:16:06 GMT
#> <- Via: 1.1 varnish
#> <- X-Served-By: cache-hel1410024-HEL
#> <- X-Cache: MISS
#> <- X-Cache-Hits: 0
#> <- X-Timer: S1695388567.783692,VS0,VE0
#> <-
#> Error in `req_perform()`:
#> ! HTTP 403 Forbidden.
#> Backtrace:
#> ▆
#> 1. └─httr2::req_perform(req_user_agent(request(url_), "RStudio Desktop"), verbosity = 1)
#> 2. └─httr2:::resp_abort(resp, error_body(req, resp), call = error_call)
#> 3. └─rlang::abort(...)
更改 User-Agent 中的一个字符,我们就值得状态 200 和内容:
request(url_) |>
req_user_agent("RStudio.Desktop") |>
req_perform(verbosity = 1)
#> -> GET /stocks/charts/azo/autozone/pe-ratio HTTP/1.1
#> -> Host: www.macrotrends.net
#> -> User-Agent: RStudio.Desktop
#> -> Accept: */*
#> -> Accept-Encoding: deflate, gzip
#> ->
#> <- HTTP/1.1 200 OK
#> <- Connection: keep-alive
#> <- Content-Length: 14962
#> <- Server: Apache/2.4.18 (Ubuntu)
#> <- Cache-Control: no-cache, no-store, must-revalidate
#> <- Pragma: no-cache
#> <- Expires: 0
#> <- Content-Encoding: gzip
#> <- Content-Type: text/html; charset=UTF-8
#> <- Via: 1.1 varnish, 1.1 varnish
#> <- Accept-Ranges: bytes
#> <- Date: Fri, 22 Sep 2023 13:16:07 GMT
#> <- Age: 1763
#> <- X-Served-By: cache-iad-kjyo7100051-IAD, cache-hel1410025-HEL
#> <- X-Cache: MISS, HIT
#> <- X-Cache-Hits: 0, 2
#> <- X-Timer: S1695388567.044209,VS0,VE0
#> <- Vary: Accept-Encoding
#> <-
#> <httr2_response>
#> GET https://www.macrotrends.net/stocks/charts/azo/autozone/pe-ratio
#> Status: 200 OK
#> Content-Type: text/html
#> Body: In memory (60149 bytes)
切换到
rvest::session()
时,事情会神奇地发挥作用,因为 session()
请求的默认用户代理是不同的:
libcurl/7.84.0 r-curl/5.0.2 httr/1.4.7
因此,要么按照评论中的建议使用
session()
而不是 read_html()
,要么使用其他方式通过不同的用户代理请求页面内容,您仍然可以使用 rvest
解析响应。这是一个带有 httr2
的示例:
library(rvest)
library(httr2)
# make a request with httr2 ..
request("https://www.macrotrends.net/stocks/charts/azo/autozone/pe-ratio") |>
req_user_agent("libcurl") |>
req_perform() |>
resp_body_html() |>
#... and parse with rvest:
html_elements("thead+ thead th , #style-1 td")
#> {xml_nodeset (232)}
#> [1] <th style="text-align:center;">Date</th>
#> [2] <th style="text-align:center;">Stock Price</th>
#> [3] <th style="text-align:center;">TTM Net EPS</th>
#> [4] <th style="text-align:center;">PE Ratio</th>
#> [5] <td style="text-align:center;">2023-09-21</td>
#> [6] <td style="text-align:center;">2530.76</td>
#> [7] <td style="text-align:center;"></td>
#> [8] <td style="text-align:center;">29.36</td>
#> [9] <td style="text-align:center;">2023-08-31</td>
#> [10] <td style="text-align:center;">2531.33</td>
#> [11] <td style="text-align:center;">$86.21</td>
#> [12] <td style="text-align:center;">29.36</td>
#> [13] <td style="text-align:center;">2023-05-31</td>
#> [14] <td style="text-align:center;">2386.84</td>
#> [15] <td style="text-align:center;">$126.72</td>
#> [16] <td style="text-align:center;">18.84</td>
#> [17] <td style="text-align:center;">2023-02-28</td>
#> [18] <td style="text-align:center;">2486.54</td>
#> [19] <td style="text-align:center;">$121.63</td>
#> [20] <td style="text-align:center;">20.44</td>
#> ...
创建于 2023-09-22,使用 reprex v2.0.2