我希望在R中获取一个数据框,并根据我在两个列V1和V2中看到的内容对其进行扩充。简而言之,我有S1-S6阶段,它们是字符串。
对于阶段中存在间隙的每一行,我需要添加行。看看下面的数据框,如果我在同一行看到“S 3”和“S 3”,我就不需要做任何事了。同样,如果我在同一行看到'S 3'和'S 4',我也不需要做任何事情。
输入:
----------------------------------
|Var1 | V1 | V2 |
----------------------------------
|0060a00000fUbAnAAK |'S 2' |'S 5'|
----------------------------------
输出:
----------------------------------
|Var1 | V1 | V2 |
----------------------------------
|0060a00000fUbAnAAK |'S 2' |'S 3'|
----------------------------------
|0060a00000fUbAnAAK |'S 3' |'S 4'|
----------------------------------
|0060a00000fUbAnAAK |'S 4' |'S 5'|
----------------------------------
输入:
----------------------------------
|Var1 | V1 | V2 |
----------------------------------
|0060a00000fUbAnAAK |'S 5' |'S 3'|
----------------------------------
输出:
----------------------------------
|Var1 | V1 | V2 |
----------------------------------
|0060a00000fUbAnAAK |'S 5' |'S 4'|
----------------------------------
|0060a00000fUbAnAAK |'S 4' |'S 3'|
----------------------------------
使用tidyverse
的想法是转换为长格式,将数字与S
分开并完成序列。一旦我们有了这个,我们将列粘贴在一起(S
和values
)并转换回宽格式。最后,我们采用V1
的滞后变量,并删除NA
s,即
library(tidyverse)
df %>%
gather(var, val, -1) %>%
separate(val, into = c('char', 'number'), sep = ' ') %>%
mutate(number = as.numeric(number)) %>%
complete(nesting(var, Var1, char), number = full_seq(min(number):max(number), 1)) %>%
unite('V1_2', c('char', 'number'), sep = ' ') %>%
group_by(var) %>%
mutate(new = row_number()) %>%
spread(var, V1_2) %>%
mutate(V1 = lag(V1)) %>%
na.omit() %>%
select(-new)
这使,
# A tibble: 3 x 3 Var1 V1 V2 <chr> <chr> <chr> 1 xxx S 2 S 3 2 xxx S 3 S 4 3 xxx S 4 S 5
此更新还考虑了减少的阶段
样本数据
library(data.table)
DT <- fread("Var1 | V1 | V2
0060a00000fUbAnAAK |S 2 |S 5
0060a00000fUbAnAAK_ |S 5 |S 3")
# Var1 V1 V2
# 1: 0060a00000fUbAnAAK S 2 S 5
# 2: 0060a00000fUbAnAAK_ S 5 S 3
码
#determine order of stages
DT[ as.numeric( gsub("[^0-9]", "", V2 ) ) < as.numeric( gsub("[^0-9]", "", V1 ) ), order := "desc" ]
DT[ is.na( order) , order := "asc" ]
#melt DT to long format
DT <- melt( DT, id.vars = c("Var1","order"), value.name = "stage")
#get stage as numeric and clean up unwanted columns
DT[, `:=`(stage = as.numeric( gsub("[^0-9]", "", stage)))]
#create new stages based on minimum and maximum stage per Var1-value
#use different methodes of ascending and descneding stages, then bind the rows together
rbind(
DT[order == "asc", .( V1 = paste0( "S ", min(stage): (max(stage) - 1 ) ),
V2 = paste0( "S ", (min(stage)+1):max(stage) ) ), by = .(Var1)],
DT[order == "desc", .( V1 = paste0( "S ", max(stage): (min(stage) + 1 ) ),
V2 = paste0( "S ", (max(stage)-1):min(stage) ) ), by = .(Var1)]
)
产量
# Var1 V1 V2
# 1: 0060a00000fUbAnAAK S 2 S 3
# 2: 0060a00000fUbAnAAK S 3 S 4
# 3: 0060a00000fUbAnAAK S 4 S 5
# 4: 0060a00000fUbAnAAK_ S 5 S 4
# 5: 0060a00000fUbAnAAK_ S 4 S 3
`data.table` solution
**sample data**
library(data.table)
DT <- fread("Var1 | V1 | V2
0060a00000fUbAnAAK |S 2 |S 5")
**code**
#melt DT to long format
DT <- melt( DT, id.vars = "Var1", value.name = "stage")
#get stage as numeric and clean up unwanted columns
DT[, `:=`(variable = NULL, stage = as.numeric( gsub("[^0-9]", "", stage)))]
#create new stages based on minimum and maximum stage per Var1-value
DT[, .( V1 = paste0( "S ", min(stage):(max(stage)-1) ),
V2 = paste0( "S ", (min(stage)+1):max(stage) ) ), by = .(Var1)][]
**output**
# Var1 V1 V2
# 1: 0060a00000fUbAnAAK S 2 S 3
# 2: 0060a00000fUbAnAAK S 3 S 4
# 3: 0060a00000fUbAnAAK S 4 S 5