我按照Hadley的主题:Issue in Loading multiple .csv files into single dataframe in R using rbind读取多个CSV
文件,然后将它们转换为一个数据帧。我还在lapply
上讨论了sapply
与Grouping functions (tapply, by, aggregate) and the *apply family的实验。
这是我的第一个CSV文件:
dput(File1)
structure(list(First.Name = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("A",
"C"), class = "factor"), Last.Name = structure(c(1L, 2L, 2L,
2L, 2L), .Label = c("B", "D"), class = "factor"), Income = c(55L,
23L, 34L, 45L, 44L), Tax = c(23L, 21L, 22L, 24L, 25L), Location = structure(c(3L,
3L, 1L, 4L, 2L), .Label = c("Americas", "AP", "EMEA", "LATAM"
), class = "factor")), .Names = c("First.Name", "Last.Name",
"Income", "Tax", "Location"), class = "data.frame", row.names = c(NA,
-5L))
这是我的第二个CSV文件:
dput(File2)
structure(list(First.Name = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("A",
"C"), class = "factor"), Last.Name = structure(c(1L, 2L, 2L,
2L, 2L), .Label = c("B", "D"), class = "factor"), Income = c(55L,
55L, 55L, 55L, 55L), Tax = c(24L, 24L, 24L, 24L, 24L), Location = structure(c(3L,
3L, 1L, 4L, 2L), .Label = c("Americas", "AP", "EMEA", "LATAM"
), class = "factor")), .Names = c("First.Name", "Last.Name",
"Income", "Tax", "Location"), class = "data.frame", row.names = c(NA,
-5L))
这是我的代码:
dat1 <-",First.Name,Last.Name,Income,Tax,Location\n1,A,B,55,23,EMEA\n2,C,D,23,21,EMEA\n3,A,D,34,22,Americas\n4,A,D,45,24,LATAM\n5,A,D,44,25,AP"
dat2 <-",First.Name,Last.Name,Income,Tax,Location\n1,A,B,55,24,EMEA\n2,C,D,55,24,EMEA\n3,A,D,55,24,Americas\n4,A,D,55,24,LATAM\n5,A,D,55,24,AP"
tc1 <- textConnection(dat1)
tc2 <- textConnection(dat2)
merged_file <- do.call(rbind, lapply(list(tc1,tc2), read.csv))
虽然这很好用,但我想将lapply
改为sapply
。从上面的线程中,我意识到sapply
会将读取因子从csv
文件更改为矩阵,但我不确定为什么这些字段被翻转。例如,Income
字段占据第3行和第8行,但不在一列中。
这是代码:
tc1 <- textConnection(dat1)
tc2 <- textConnection(dat2)
# change lapply to sapply
merged_file <- do.call(rbind, sapply(list(tc1,tc2), read.csv))
这是输出:
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 1 1 1
[2,] 1 2 2 2 2
[3,] 55 23 34 45 44
[4,] 23 21 22 24 25
[5,] 3 3 1 4 2
[6,] 1 2 1 1 1
[7,] 1 2 2 2 2
[8,] 55 55 55 55 55
[9,] 24 24 24 24 24
[10,] 3 3 1 4 2
我很感激任何帮助。我对R很新,不知道发生了什么。
这个问题与因素无关,它是通用的sapply
vs lapply
。为什么sapply
得到它错误,而lapply
得到它正确?请记住,在R中,数据框是列表列表。每列可以有不同的类型。
lapply
将列表列表返回给rbind
,它正确地进行连接。它将相应的列保持在一起。所以你的因素正确出现了。sapply
......
返回一个数字矩阵...(因为矩阵只能有一种类型,与数据帧不同)
...更糟糕的是,has an unwanted transpose
所以sapply
将你的两个5x6输入数据帧转换为转置的6x5矩阵(列现在对应于行)......
将所有数据强制转换为数字(垃圾!)。
然后rbind
行 - 将这两个垃圾6x5数字矩阵“连接”成一个非常垃圾的12x5矩阵。由于列已转换为行,因此行连接矩阵会合并数据类型,显然您的因素会混乱。总结:只需使用lapply