Raggregate()和distinct()函数仅清理我的一些数据

问题描述 投票:0回答:1

我目前正在努力尝试估算或删除 R 中大部分重复的行......除了三列。我正在处理一个死亡率数据集,其中国家/年龄组/性别/年份/其他分类变量都相同......但多行仍然显示不同的死亡计数/百分比/人均死亡人数重复的行。 (我尝试使用的数据在这里:https://www.kaggle.com/ds/4597596

我尝试使用 dplyr 函数来尝试仅选择特定列。我还尝试使用聚合函数来估算平均死亡人数并消除重复的行。我似乎仍然没有正确地做到这一点,所以我不确定我在这里做错了什么。

(例如,完成所有这些操作后,我得到两行类似“United States of America, Male, Age Group 25-34, Year=2020”的内容,但一行的死亡人数为 3465,另一行的死亡人数为计数为 3417。除了那一列之外,所有其他信息都相同,对于靠近数据框顶部的数据似乎效果很好,所以我不确定为什么会发生这种情况。)

#dplyr    
Death %>% distinct(Death$RegionCode, Death$RegionName, Death$CountryCode, Death$CountryName, Death$Year, Death$Sex, Death$AgeGroup,.keep_all = TRUE)

#aggregate function
unalived1 <- aggregate(Death$SuicideCount,by=list(RegionName=Death$RegionName, CountryName=Death$CountryName, Year=Death$Year, Sex=Death$Sex, AgeGroup=Death$AgeGroup, CauseSpecificDeathPercentage=Death$CauseSpecificDeathPercentage, DeathRatePer100K=Death$DeathRatePer100K, Population=Death$Population, GDP=Death$GDP, GDPPerCapita=Death$GDPPerCapita, GrossNationalIncome=Death$GrossNationalIncome, GNIPerCapita=Death$GNIPerCapita, InflationRate=Death$InflationRate, EmploymentPopulationRatio=Death$EmploymentPopulationRatio),FUN=mean)

unalived2 <- aggregate(unalived1$CauseSpecificDeathPercentage,by=list(RegionName=unalived1$RegionName, CountryName=unalived1$CountryName, Year=unalived1$Year, Sex=unalived1$Sex, AgeGroup=unalived1$AgeGroup, SuicideCount=unalived1$x, DeathRatePer100K=unalived1$DeathRatePer100K, Population=unalived1$Population, GDP=unalived1$GDP, GDPPerCapita=unalived1$GDPPerCapita, GrossNationalIncome=unalived1$GrossNationalIncome, GNIPerCapita=unalived1$GNIPerCapita, InflationRate=unalived1$InflationRate, EmploymentPopulationRatio=unalived1$EmploymentPopulationRatio),FUN=mean)

unalived3 <- aggregate(unalived2$DeathRatePer100K,by=list(RegionName=unalived2$RegionName, CountryName=unalived2$CountryName, Year=unalived2$Year, Sex=unalived2$Sex, AgeGroup=unalived2$AgeGroup, SuicideCount=unalived2$SuicideCount, CauseSpecificDeathPercentage=unalived2$x, Population=unalived2$Population, GDP=unalived2$GDP, GDPPerCapita=unalived2$GDPPerCapita, GrossNationalIncome=unalived2$GrossNationalIncome, GNIPerCapita=unalived2$GNIPerCapita, InflationRate=unalived2$InflationRate, EmploymentPopulationRatio=unalived2$EmploymentPopulationRatio),FUN=mean)

unalived4 <- na.omit(unalived3)

unalived <- unalived4


US_Deaths <- unalived[unalived$CountryName %in% c("United States of America"),]
US_Deaths_Male <- US_Deaths[US_Deaths$Sex %in% c("Male"),]
US_Deaths_Male_2534 <- US_Deaths_Male[US_Deaths_Male$AgeGroup %in% c("25-34 years"),]

完成上述所有操作后,我仍然得到重复的行,其中只有几列具有不同的值。

我想知道我是否能深入了解如何正确地解决这个问题。我宁愿尝试估算数据,但消除行也可以。

r dplyr data-cleaning imputation
1个回答
0
投票

首先删除完全重复的内容:

Death <- Death %>% distinct()

然后总结,按与您选择的变量相似的变量进行分组,并取其余列的平均值。我不确定这种方法是否有意义,但它会删除重复项。

Death <- Death %>%
  group_by(
    RegionName,
    CountryName,
    Year,
    Sex,
    AgeGroup,
    Population,
    GDP,
    GDPPerCapita,
    GrossNationalIncome,
    GNIPerCapita,
    InflationRate,
    EmploymentPopulationRatio
  ) %>%
  summarise(
    SuicideCount = mean(SuicideCount),
    CauseSpecificDeathPercentage = mean(CauseSpecificDeathPercentage),
    DeathRatePer100K = mean(DeathRatePer100K),
    CauseSpecificDeathPercentage = mean(CauseSpecificDeathPercentage),
    .groups = "drop"
  )

检查您提到的示例:

Death %>%
  filter(
    CountryName == "United States of America",
    AgeGroup == "25-34 years",
    Year == 2020,
    Sex == "Male"
  ) %>%
  nrow()

结果为 1,表明该特定案例没有重复项。

© www.soinside.com 2019 - 2024. All rights reserved.