如果满足条件,如何保留行并删除其他行

问题描述 投票:2回答:1

我正在处理分类数据,并且在我以图形方式显示数据之前将数据提供给第二步。但是,我需要行来匹配条件,这就是我被卡住的地方 - 因为我不想手动操作而被卡住了。 我的数据:

x <- data.frame("Phylum" = c("Chordata", "Chordata", "Chordata", "Chordata", "Chordata", "Chordata"),
                "Class" = c("NA", "Actinopterygii", "Actinopterygii", "Actinopterygii", "Actinopterygii", "Actinopterygii"),
                "Order" = c("NA", "NA", "Gadiformes", "Gadiformes", "Gadiformes", "Gadiformes"), 
                "Family" = c("NA", "NA", "NA", "Moridae", "Moridae", "Moridae"), 
                "Genus" = c("NA", "NA", "NA", "NA", "Notophycis", "Notophycis"), 
                "Species" = c("NA", "NA", "NA", "NA", "NA", "Notophycis marginata"),
                 Number = c(21616, 12123, 1497, 730,730,730))

想要的最终结果:

y <- data.frame("Phylum" = c("Chordata", "Chordata", "Chordata", "Chordata"), 
                "Class" = c("NA", "Actinopterygii", "Actinopterygii", "Actinopterygii"), 
                "Order" = c("NA", "NA", "Gadiformes", "Gadiformes"), "Family" = c("NA", "NA", "NA", "Moridae"), 
                "Genus" = c("NA", "NA", "NA", "Notophycis"), "Species" = c("NA", "NA", "NA", "Notophycis marginata"), 
                 Number = c(9493, 10626, 767, 730))

这是来自更大更复杂数据集的简单子集示例。所以如果我能以某种方式把它放到代码中:

  • 数字之和(Phylum == "P1" & Class == "NA") - 数字之和(Class == "C1" & Order == "NA")如果门匹配,这将等于P1的新数字
  • 数字之和(Class == "C1" & Order== "NA") - 数字之和(Order == "O1" & Family == "NA")IF类匹配,这将等于C1的新数字等...

但是如果数字匹配多行,我需要有代码来评估这些行,并选择具有最少数量的NA并保留该数量的行...

我认为我希望编写一个函数来执行此操作,但不知道从哪里开始!

感谢帮助:)

UPDATE

测试:

Phylum  Class   Order   Family  Genus   Species Reads_sum
Chordata    Elasmobranchii  Carcharhiniformes   NA  NA  NA  31
Chordata    Actinopterygii  Perciformes Scombridae  NA  NA  589
Chordata    Elasmobranchii  Carcharhiniformes   Pentanchidae    NA  NA  31
Chordata    Actinopterygii  Myctophiformes  Myctophidae Notoscopelus    NA  208
Chordata    Actinopterygii  Perciformes Scombridae  Katsuwonus  NA  589
Chordata    Actinopterygii  Myctophiformes  Myctophidae Notoscopelus    Notoscopelus caudispinosus  178
Chordata    Actinopterygii  Perciformes Scombridae  Katsuwonus  Katsuwonus pelamis  589
Cnidaria    Hydrozoa    Leptothecata    Plumulariidae   NA  NA  69
Cnidaria    Hydrozoa    Leptothecata    Plumulariidae   Plumularia  NA  69
Echinodermata   Ophiuroidea NA  NA  NA  NA  146
Echinodermata   Ophiuroidea Ophiurida   NA  NA  NA  137
Echinodermata   Ophiuroidea Ophiurida   Ophiuridae  NA  NA  137
Echinodermata   Ophiuroidea Ophiurida   Ophiuridae  Ophioplinthus   NA  137
Echinodermata   Ophiuroidea Ophiurida   Ophiuridae  Ophioplinthus   Ophioplinthus accomodata    137
Mollusca    Cephalopoda Oegopsida   Ommastrephidae  NA  NA  34311
Ochrophyta  Phaeophyceae    Ectocarpales    Acinetosporaceae    NA  NA  29

执行我想要的代码但是每次都必须更改变量:

Tester$Reads_sum[Tester$Class == "Ophiuroidea" & Tester$Order == "NA"] - sum(Tester$Reads_sum[Tester$Class == "Ophiuroidea" & Tester$Order != "NA" & Tester$Family == "NA"])

所以我希望这样的东西可以工作,我只需要将Class更改为其他选定的分类等级:

for (i in unique(Tester$Class)){
  Tester$Test.1 <- ifelse(Tester$Class != "NA" & Tester$Order == "NA", 
                           Tester$Reads_sum[Tester$Class == i & Tester$Order == "NA"] - sum(Tester$Reads_sum[Tester$Class == i & Tester$Order != "NA" & Tester$Family == "NA"]), 0)
  }

但它给了我一个NA而不是9。

最终数据应如下所示:

Phylum  Class   Order   Family  Genus   Species Reads_sum
Chordata    Elasmobranchii  Carcharhiniformes   Pentanchidae    NA  NA  31
Chordata    Actinopterygii  Myctophiformes  Myctophidae Notoscopelus    NA  30
Chordata    Actinopterygii  Myctophiformes  Myctophidae Notoscopelus    Notoscopelus caudispinosus  178
Chordata    Actinopterygii  Perciformes Scombridae  Katsuwonus  Katsuwonus pelamis  589
Cnidaria    Hydrozoa    Leptothecata    Plumulariidae   Plumularia  NA  69
Echinodermata   Ophiuroidea NA  NA  NA  NA  9
Echinodermata   Ophiuroidea Ophiurida   Ophiuridae  Ophioplinthus   Ophioplinthus accomodata    137
Mollusca    Cephalopoda Oegopsida   Ommastrephidae  NA  NA  34311
Ochrophyta  Phaeophyceae    Ectocarpales    Acinetosporaceae    NA  NA  29
r function loops for-loop split
1个回答
0
投票

感谢更新。我想出了一些我认为符合您要求的东西,但需要一些支持。

我是否正确地按照c("Phylum", "Class", "Order", "Family", "Genus", "Species")的顺序思考它的数据树?并且您有兴趣找到树的每个级别,您想要删除下面的图层的值?

我希望我的代码不会太混乱,我发现数据在当前格式中难以使用。我更喜欢将它分成树的级别,即只有Phylum数据的级别,一直到具有树的所有级别的那些。为此,我最舒服使用data.table包。

我已经使用了lapply's,因为我发现它们很容易解释,一旦你使用它们很多。我相信会有一个更有效的解决方案,但作为首发,我认为了解和理解所需的步骤更为重要。

# using data.table package, as I find it quicker and easier to work with 
# for complex problems. Run the hashed out command below if you dont have it
# install.packages("data.table")
library(data.table)

# turning in to a data.table, similar to data.frame, but some differences.
dt <- as.data.table(Tester)
# I am making an id, which I will use to split up this data. Different rows 
# have different structures, as its a tree structure, so I am going to break
# the data up
dt[, id := 1:.N]

# to do so i need to know the order of significance of the tree. I believe
# they go in this order:
col_structure <- c("Phylum", "Class", "Order", "Family", "Genus", "Species")

# I want to find out at which level of the tree each row is, so I am going
# to change teh shape from wide to long, and then do some row aggregation on 
# the single column, to group
melt_dt <- melt(dt, id.vars = "id", 
                measure.vars = col_structure)
# tip: try not to use "NA", but instead NA, they have different structures 
# and built in commands like is.na make them easier to differentiate
melt_dt[value == "NA", value := NA]
melt_dt <- melt_dt[!is.na(value)]
melt_dt[]
# using a data.table command .N, grouped by id, to find out how many non NA
# values there are, this will tell me where it is in the tree
group_ids <- melt_dt[, .N, by = id]

# Ok, so now I will split up each row in to where it sits in the tree
split_ids <- split(group_ids, group_ids$N)
split_ids
# pull out the number of levels of tree for easy use
levels <- seq_along(split_ids)

# merge back in the original data, so we have the same data at the start, but
# split up in to new sets. Makes it easier to think about the problem
split_dt <- lapply(levels, function(x){
  out <- merge(split_ids[[x]], dt, by = "id")
  N <- as.numeric(names(split_ids)[x])
  # using keys in my data, to make easy extraction. means rather than do
  # Phylum == "a" & Class == "b" later on, if Phylum & Class are the keys,
  # then can use command J("a", "b"). See next stage
  setkeyv(out, col_structure[1:N])
  out
})

# Now I'm going to add the value in. I will look at the next level of the tree
# and remove the values from that level from the reads_sum. Try it with setting
# x = 1.
# I've removed bottom element of the tree, don't know what to do with them
split_dt_with_value <- lapply(levels[1:(length(levels)-1)], function(x){
  # similar to for loop, but using data.table keys to extract data
  out <- split_dt[[x]]
  out$Test.1 <- out$Reads_sum - sapply(1:nrow(out), function(i){
    sum(split_dt[[(x+1)]][J(out[i, key(out), with = FALSE])]$Reads_sum,
        na.rm = TRUE)
  })
  out
})

# combine results, and with the bottom tree level
combined <- rbindlist(c(split_dt_with_value,
                        split_dt[max(levels)]), 
                        fill = TRUE)
# turn it back in to data frame form 
combined <- as.data.frame(combined)
combined

请看看,如果任何步骤令人困惑,或任何逻辑不正确,请告诉我:)

干杯,强尼

© www.soinside.com 2019 - 2024. All rights reserved.