从宽到长形成巨大的data.table（1,000,000×4,000别名8GB）

Question

我的磁盘上有这个8GB的CSV文件。它每行有一个“匹配”。

“匹配”包括一些数据，如id，date和winner。但它也拥有10名玩家及其所有数据。那些存储在participants.0.stats.visionScore，participants.1.stats.visionScore，...，participants.0.stats.assists，...，participants.9.stats.assists，...我想你得到的模式。这只是participants.{number}.stats.{variable_name}。每个参与者都有数百个统计数据;这就是为什么我总共有大约4,000列。

我像这样读取数据：

> d <- fread("Matches.csv")
> head(d)
   participants.1.stats.totalDamageDealt
1:                                118504
2:                                 20934
3:                                 76639
4:                                123932
5:                                160561
6:                                237046
   participants.8.stats.totalDamageTaken participants.9.stats.totalPlayerScore
1:                                 18218                                     0
2:                                 12378                                     0
3:                                 46182                                     0
4:                                 19340                                     0
5:                                 30808                                     0
6:                                 36194                                     0
... [there are thousands of lines I omit here] ...

当然，我现在想要数据的表示，其中一行对应于一个参与者。我想象一个这样的结果：

> [magic]
> head(d)
   participant             stats.totalDamageDealt
1:           1                             118504
2:           2                             190143
3:           3                              46700
4:           4                              60787
5:           5                              78108
6:           6                             124761
                  stats.totalDamageTaken                stats.totalPlayerScore
1:                                 18218                                     0
2:                                 15794                                     0
3:                                 34578                                     0
4:                                 78771                                     0
5:                                 16749                                     0
6:                                 11540                                     0
...

但是所有那些方法，比如meld，cast和reshape都需要我手工命名所有列。即使patterns用于meld，我最终也必须为每个参与者命名所有数百列。在R中没有办法让这个东西变长吗？

Answer 1

我不是100％确定我理解数据是如何布局的，但我想我已经有了。从示例数据看，参与者1具有来自原始数据的totalDamageDealt的多行数据，并且结果不需要聚合。如果不是这样，可能需要采取不同的步骤。我不得不制作自己的样本数据来尝试运行它。如果您想发布涵盖所有可能性的最小数据集，那将会很有帮助。

否则，这里有一些方法可以使数据完全长，以提取参与者信息，然后再次扩展以使其成为您所需的格式。如果在进行dcast步骤中可能发生的数据扩展时需要任何聚合。

library(data.table)
library(stringr)

# Create example data
dt <- data.table(participant.1.stats.visionScore = c(1,1.1,1.2,1.3,1.4,1.5),
           participant.1.stats.totalDamageDealt = c(7.1,8.1,9.1,10.1,11.1,12.1),
           participant.2.stats.visionScore = c(2,2.1,2.2,2.3,2.4,2.5),
           participant.2.stats.totalDamageDealt = c(7.2,8.2,9.2,10.2,11.2,12.2))

# Make data totally long (not wide at all)
dt <- melt(dt,measure.vars = names(dt))

# Separate participant and stat details into columns
dt[,participant := variable %>% str_extract("(?<=^participant\\.)\\d+")]
dt[,stat := variable %>% str_extract("(?<=.stats.).+")]

# Remove variable for cleanup
dt[,variable := NULL]

# Create an index to create a unique key in order to be able to dcast without aggregating
dt[,index := 1:.N, by = list(participant,stat)]

# dcast to make the data wide again
dt <- dcast(dt,index + participant ~ stat, value.var = "value")

# Sort to make it easier for a human to view the table
dt <- dt[order(participant)]

#     index participant totalDamageDealt visionScore
# 1:      1           1              7.1         1.0
# 2:      2           1              8.1         1.1
# 3:      3           1              9.1         1.2
# 4:      4           1             10.1         1.3
# 5:      5           1             11.1         1.4
# 6:      6           1             12.1         1.5
# 7:      1           2              7.2         2.0
# 8:      2           2              8.2         2.1
# 9:      3           2              9.2         2.2
# 10:     4           2             10.2         2.3
# 11:     5           2             11.2         2.4
# 12:     6           2             12.2         2.5

Answer 2

好的，使用您提供的数据样本：

library(data.table)

setDT(d) 

d <- melt(d, measure = patterns("^participants"), value.name = "value")

d <- d[,  `:=` (ID = gsub(".*?\\.(\\d+)\\..*","\\1", variable),
                stats = gsub(".*?(stats\\..*)$","\\1", variable))
  ][, .(variable, value, ID, stats)]
d <- dcast(d, ID ~ stats, value.var= "value", fun.aggregate = sum)

编辑：重写这个作为data.table唯一的速度解决方案

请注意，您在源数据中有一些额外的列，例如participantIdentities.6.player.accountId，您没有解决，所以我只是将它们排除在外。如果他们需要被包括在内，你可以将它们添加到patterns或id.vars中的melt。

一个注意事项：您投射的所有值必须是数字，否则dcast将失败。我相信这将是您的完整数据集的问题。这意味着您需要在participants.1.highestAchievedSeasonTier中正确识别像id.vars这样的列作为melt，否则将它们从dcast中排除。

导致（我只是粘贴了许多的前4列）

  ID participants.4.timeline.xpPerMinDeltas.20-30 stats.goldEarned stats.perk3Var1
1  1                                            0                0               0
2  4                                           NA                0            3475
3  7                                            0                0               0
4  8                                            0                0               0
5  9                                            0           105872               0

Answer 3

我找到了一个答案，即使对这些大量数据非常有效。事实上，我认为它对于这种场景来说就像它在R中一样有效：

cn <- names(d)
pc <- cn[which(grepl("participants.", cn))]
ppcn <- substring(pc[0:(length(pc)/10)], 16)
d_long <- reshape(d, direction='long', varying=pc, timevar='participant', times=c('participants.0', 'participants.1', 'participants.2', 'participants.3', 'participants.4', 'participants.5', 'participants.6', 'participants.7', 'participants.8', 'participants.9'), v.names=ppcn)

它背后的想法是用一些额外的代码行来创建reshape函数的参数，这样R就可以知道我真正在谈论的是哪些列。

使用此解决方案，我可以在一个步骤中创建（1）长d（无双关语），而无需临时可能较大的表和（2）没有类型转换，包括所有类型的列。

从宽到长形成巨大的data.table（1,000,000×4,000别名8GB）

问题描述投票：1回答：3

3个回答

最新问题

从宽到长形成巨大的data.table（1,000,000×4,000别名8GB）

问题描述 投票：1回答：3

3个回答

最新问题

问题描述投票：1回答：3