distinct() 不适用于 R 中的数据帧

问题描述 投票:0回答:1

我正在尝试从有关 COVID 19 病例的表格中获取具体值,并且我想找到每个月(有 4 个月)的最高病例数。

我创建了几个月的新列,然后重新组织了数据框,以便每个月的所有行均按案例降序排列,但是,当我尝试对月份使用不同的每月仅获取一行时,它确实如此没有什么。我是否使用了错误的功能?My Code

r dataframe dplyr distinct
1个回答
0
投票
# since you didn't provide a sample dataset, we download Our World in Data's COVID-19 dataset and transform it into something similar
country_date_cases <- read_csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/cases_deaths/new_cases.csv") |>
  pivot_longer(-date, names_to = "Country.Region", values_to = "Cases") |> 
  filter(year(date) == 2020 & between(month(date), 1, 4))

   date       Country.Region      Cases
   <date>     <chr>               <dbl>
 1 2020-01-03 World                   0
 2 2020-01-03 Afghanistan             0
 3 2020-01-03 Africa                  0
 4 2020-01-03 Albania                 0
 5 2020-01-03 Algeria                 0
 6 2020-01-03 American Samoa          0
 7 2020-01-03 Andorra                 0
 8 2020-01-03 Angola                  0
 9 2020-01-03 Anguilla                0
10 2020-01-03 Antigua and Barbuda     0
# ℹ 29,264 more rows

country_date_cases |> 
  mutate(Month = month(date)) |> 
  arrange(Month, desc(Cases)) |> 
  distinct(Country.Region, Month, .keep_all = TRUE) # just using Month gets the largest value for each month, which in my dataset, which includes the world total, is obviously the world total. If we include Country.Region in there too, we get the largest value for each country in each month.

# A tibble: 984 × 4
   date       Country.Region      Cases Month
   <date>     <chr>               <dbl> <dbl>
 1 2020-01-31 World                2015     1
 2 2020-01-31 Asia                 2010     1
 3 2020-01-31 Upper middle income  1990     1
 4 2020-01-31 China                1984     1
 5 2020-01-31 High income            20     1
 6 2020-01-27 Thailand                7     1
 7 2020-01-25 North America           5     1
 8 2020-01-25 United States           5     1
 9 2020-01-26 Europe                  5     1
10 2020-01-26 European Union          5     1
# ℹ 974 more rows
# ℹ Use `print(n = ...)` to see more rows

# so this is equivalent to slice_max. However, with that said, as Seth has already commented, slice_max is the preferred option, as it is shorter, and more readable

country_date_cases |> 
  mutate(Month = month(date)) |>
  slice_max(Cases, n = 1, by = c(Country.Region, Month), with_ties = FALSE) # with slice_max, we also have the option of including ties, which I didn't do here. If you wanted the Country.Region with the largest number of cases in each month, you could change this to `by = Month`
© www.soinside.com 2019 - 2024. All rights reserved.