如何使用 R 抓取 google 财经,其中多个页面的页面 url 不会更改?

问题描述 投票:0回答:1

我想用R网络抓取不同年份的股票财务表。但是,我可以获得去年的财务表,该表显示为默认值。但我也想获取往年的数据。我怎样才能实现这个目标?这是我使用的代码:

# Load libraries 

library(tidyverse)
library(rvest)
library(readxl)
library(magrittr)

google_finance <- read_html("https://www.google.com/finance/quote/AAPL:NASDAQ?") |> 
  html_node(".UulDgc") |> 
  html_table()

结果是:

> google_finance |> 
+   head(5)
# A tibble: 5 × 3
  `(USD)`          Mar 2024infoFiscal Q…¹ `Y/Y change`
  <chr>            <chr>                  <chr>       
1 "RevenueThe tot… 90.75B                 -4.31%      
2 "Operating expe… 14.37B                 5.22%       
3 "Net incomeComp… 23.64B                 -2.17%      
4 "Net profit mar… 26.04                  2.20%       
5 "Earnings per s… 1.53                   0.66% 

如您所见,我们只能看到最后一个时期(2024年3月)的财务表格。既然如此,我们该怎么做才能把历年的财务表都刮下来呢?

r web-scraping html-table rvest google-finance
1个回答
0
投票

我认为您需要为此使用

RSelenium
,它将启动浏览器并为您单击按钮。这里我使用 Firefox 作为浏览器,您可能需要更改一些默认设置才能使浏览器设置正确。您还需要安装Java SDK

library(RSelenium)
library(rvest)
library(glue)

# Initiate a Remote Driver using forefox; this step may also install some pre 
# and post binary files. 
rd <- rsDriver(browser = "firefox", chromever = NULL)

# Assign client
remDr <- rd$client

url <- "https://www.google.com/finance/quote/AAPL:NASDAQ"

# Extract names of buttons
aapl_html <- read_html(url)

btn_names <- aapl_html %>% 
  html_node(".zsnTKc") %>% 
  html_attr("aria-owns") %>% 
  strsplit(., split = " ") %>% 
  unlist()

# Using the Remote Driver, navigate to url of interest  
remDr$navigate(url)

# In a loop, find button of interest by its xpath, click and extract table

df_ls <- lapply(
  X = btn_names
  ,FUN = function(x) {
    
    # Find button using xPath
    btn <- remDr$findElement(using = "xpath", glue("//*[@id='{x}']"))
    
    # Nifty trick to visually see which button is being clicked
    btn$highlightElement()  
    
    # Click the button
    btn$clickElement()
    
    # Wait for elements to complete loading
    Sys.sleep(1)
    
    # Read HTML after each button is clicked
    rem_aapl_html <- remDr$getPageSource()[[1]]
    
    # Extract table
    aapl_tbl <- rem_aapl_html %>% 
      read_html() %>% 
      html_node(".slpEwd") %>% 
      html_table()
    
  }
)
© www.soinside.com 2019 - 2024. All rights reserved.