我是 r 中编码的新手,正在尝试将下表抓取到数据框中:
https://www.zyxware.com/articles/5363/list-of-fortune-500-companies-and-their-websites-2015
应该相当简单,但我的变量有 0 个观察值,我不确定我做错了什么。我使用的代码是:
library(tidyverse)
library(rvest)
#set the url of the website
url <- read_html("https://www.zyxware.com/articles/5363/list-of-fortune-500-companies-and-their-websites-2015")
#Scrape variables
rank <- url %>% html_nodes(".td:nth-child(1)") %>% html_text()
company <- url %>% html_nodes(".td:nth-child(2)") %>% html_text()
website <- url %>% html_nodes(".td~ td+ td") %>% html_text()
#Create dataframe
fortune500 <- data.frame(company,rank,website)
试图遵循这个walkthrough。非常感谢任何帮助:)
您可以通过在
html_table()
上调用 url
并选择第一个元素来完成。
library(tidyverse)
library(rvest)
url <- read_html("https://www.zyxware.com/articles/5363/list-of-fortune-500-companies-and-their-websites-2015")
url %>% html_table() %>% pluck(1)
#> # A tibble: 500 × 3
#> Rank Company Website
#> <int> <chr> <chr>
#> 1 1 Walmart www.walmart.com
#> 2 2 Exxon Mobil www.exxonmobil.com
#> 3 3 Chevron www.chevron.com
#> 4 4 Berkshire Hathaway www.berkshirehathaway.com
#> 5 5 Apple www.apple.com
#> 6 6 General Motors www.gm.com
#> 7 7 Phillips 66 www.phillips66.com
#> 8 8 General Electric www.ge.com
#> 9 9 Ford Motor www.ford.com
#> 10 10 CVS Health www.cvshealth.com
#> # … with 490 more rows
创建于 2023-03-01 与 reprex v2.0.2
或者,您的原始代码也可以使用,您只需要删除
td
前面的句点。 .
标识一个对象类,因此您试图标识类 td
的对象。如果前面没有.
,它会寻找名为td
的标签,这就是你想要的。
library(tidyverse)
library(rvest)
url <- read_html("https://www.zyxware.com/articles/5363/list-of-fortune-500-companies-and-their-websites-2015")
rank <- url %>% html_nodes("td:nth-child(1)") %>% html_text()
company <- url %>% html_nodes("td:nth-child(2)") %>% html_text()
website <- url %>% html_nodes("td~ td+ td") %>% html_text()
fortune500 <- data.frame(company,rank,website)
head(fortune500)
#> company rank website
#> 1 Walmart 1 www.walmart.com
#> 2 Exxon Mobil 2 www.exxonmobil.com
#> 3 Chevron 3 www.chevron.com
#> 4 Berkshire Hathaway 4 www.berkshirehathaway.com
#> 5 Apple 5 www.apple.com
#> 6 General Motors 6 www.gm.com
创建于 2023-03-01 与 reprex v2.0.2