尝试使用 rvest 从网页中抓取表格时清空变量

Question

我是 r 中编码的新手，正在尝试将下表抓取到数据框中：

https://www.zyxware.com/articles/5363/list-of-fortune-500-companies-and-their-websites-2015

应该相当简单，但我的变量有 0 个观察值，我不确定我做错了什么。我使用的代码是：

library(tidyverse)
library(rvest)

#set the url of the website
url <- read_html("https://www.zyxware.com/articles/5363/list-of-fortune-500-companies-and-their-websites-2015")

#Scrape variables
rank <- url %>% html_nodes(".td:nth-child(1)") %>% html_text()
company <- url %>% html_nodes(".td:nth-child(2)") %>% html_text()
website <- url %>% html_nodes(".td~ td+ td") %>% html_text()

#Create dataframe
fortune500 <- data.frame(company,rank,website)

试图遵循这个walkthrough。非常感谢任何帮助:)

Answer 1

您可以通过在

html_table()

上调用

url

并选择第一个元素来完成。

library(tidyverse)
library(rvest)
url <- read_html("https://www.zyxware.com/articles/5363/list-of-fortune-500-companies-and-their-websites-2015")
url %>% html_table() %>% pluck(1)
#> # A tibble: 500 × 3
#>     Rank Company            Website                  
#>    <int> <chr>              <chr>                    
#>  1     1 Walmart            www.walmart.com          
#>  2     2 Exxon Mobil        www.exxonmobil.com       
#>  3     3 Chevron            www.chevron.com          
#>  4     4 Berkshire Hathaway www.berkshirehathaway.com
#>  5     5 Apple              www.apple.com            
#>  6     6 General Motors     www.gm.com               
#>  7     7 Phillips 66        www.phillips66.com       
#>  8     8 General Electric   www.ge.com               
#>  9     9 Ford Motor         www.ford.com             
#> 10    10 CVS Health         www.cvshealth.com        
#> # … with 490 more rows

^{创建于 2023-03-01 与 reprex v2.0.2}

或者，您的原始代码也可以使用，您只需要删除

td

前面的句点。

标识一个对象类，因此您试图标识类

td

的对象。如果前面没有

，它会寻找名为

td

的标签，这就是你想要的。

library(tidyverse)
library(rvest)
url <- read_html("https://www.zyxware.com/articles/5363/list-of-fortune-500-companies-and-their-websites-2015")
rank <- url %>% html_nodes("td:nth-child(1)") %>% html_text()
company <- url %>% html_nodes("td:nth-child(2)") %>% html_text()
website <- url %>% html_nodes("td~ td+ td") %>% html_text()
fortune500 <- data.frame(company,rank,website)
head(fortune500)
#>              company rank                   website
#> 1            Walmart    1           www.walmart.com
#> 2        Exxon Mobil    2        www.exxonmobil.com
#> 3            Chevron    3           www.chevron.com
#> 4 Berkshire Hathaway    4 www.berkshirehathaway.com
#> 5              Apple    5             www.apple.com
#> 6     General Motors    6                www.gm.com

^{创建于 2023-03-01 与 reprex v2.0.2}

尝试使用 rvest 从网页中抓取表格时清空变量

问题描述投票：0回答：1

1个回答

最新问题

尝试使用 rvest 从网页中抓取表格时清空变量

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1