在 R 中抓取会创建 18 个小标题

Question

我正在尝试学习如何在 R 中抓取数据。使用其他资源和聊天 gpt 的一些帮助，我有代码可以抓取 NAIA 棒球统计表，但它会创建 18 个小标题。它确实抓住了所有的统计数据，这很好！

当我使用

bind_rows()

将所有内容放入一个 df 中时，我得到了 3582 个观察值（199 个团队 * 18 个小标题）以及各列中的几个 NA 值。看起来

bind_rows()

将使用第一个小标题的统计数据创建 199 行，然后使用第二个小标题的统计数据创建 199 行，依此类推，直到所有 18 个小标题都包含在所有统计数据中。

我想创建一个包含 199 行和所有统计数据的 df。我附上了一张图片，显示 df 具有第一组统计数据，然后球队列表从第二组统计数据开始。

library(dplyr)
library(rvest)

url <- "https://naiastats.prestosports.com/sports/bsb/2022-23/teams"

page <- read_html(url)

data <- page %>% html_nodes("table") %>% html_table()

combined_data <- bind_rows(data)

Answer 1

您可能希望使用 left_join() 从“dplyr”表 2 到 N 到第一个表，而不是使用 bind_rows 。您遇到的问题（正如 MrFlick 提到的）是有 2 组不同的统计数据，并且某些表在多个表中使用相同的列标题。

在此代码中，我仅使用前 9 个表 - 这应该是整体记录。对于会议记录，请使用表 10 至 18。

library(dplyr)
library(rvest)

url <- "https://naiastats.prestosports.com/sports/bsb/2022-23/teams"

page <- read_html(url)

data <- page %>% html_nodes("table") %>% html_table()

#get the first table and remove the ranking column
output <- data[1][[1]][, -1]
#For tables 2 to 9 - season records only
for(i in 2:9) {
   #join the next table to new master table
   #if there are duplicate columns there are renamed with a .x and .y
   output <- left_join(output, data[i][[1]][, -1], by=join_by(Team == Team))
   #remove the duplicate columns - the .y
   output <- select(output, !ends_with(".y"))
   #reset the original column names - remove the .x
   colnames(output) <- sub(".x", "", colnames(output))
}

上面的代码将生成一个长 199 行、宽 46 个变量的数据框。

在 R 中抓取会创建 18 个小标题

问题描述投票：0回答：1

1个回答

最新问题

在 R 中抓取会创建 18 个小标题

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1