首先,这是NOT calculating Euclidean distance between two matrices的问题。
假设我有两个矩阵x
和y
,例如,
set.seed(1)
x <- matrix(rnorm(15), ncol=5)
y <- matrix(rnorm(20), ncol=5)
where
> x
[,1] [,2] [,3] [,4] [,5]
[1,] -0.6264538 1.5952808 0.4874291 -0.3053884 -0.6212406
[2,] 0.1836433 0.3295078 0.7383247 1.5117812 -2.2146999
[3,] -0.8356286 -0.8204684 0.5757814 0.3898432 1.1249309
> y
[,1] [,2] [,3] [,4] [,5]
[1,] -0.04493361 0.59390132 -1.98935170 -1.4707524 -0.10278773
[2,] -0.01619026 0.91897737 0.61982575 -0.4781501 0.38767161
[3,] 0.94383621 0.78213630 -0.05612874 0.4179416 -0.05380504
[4,] 0.82122120 0.07456498 -0.15579551 1.3586796 -1.37705956
然后,我想获得尺寸为3×4的距离矩阵distmat
,其中元素distmat[i,j]
是norm(x[1,]-y[2,],"2")
或dist(rbind(x[1,],y[2,]))
中的值。
我的代码如下
distmat <- as.matrix(unname(unstack(within(idx<-expand.grid(seq(nrow(x)),seq(nrow(y))), d <-sqrt(rowSums((x[Var1,]-y[Var2,])**2))), d~Var2)))
给出
> distmat
[,1] [,2] [,3] [,4]
[1,] 3.016991 1.376622 2.065831 2.857002
[2,] 4.573625 3.336707 2.698124 1.412811
[3,] 3.764925 2.235186 2.743056 3.358577
但是当x
和y
行数很大时,我认为我的代码不够优雅或效率很高。
我期待实现这个目标的基R中更快,更优雅的代码。预先感谢!
基准模板
为了方便起见,您可以使用以下内容作为基准,以查看代码是否更快:
set.seed(1)
x <- matrix(rnorm(15000), ncol=5)
y <- matrix(rnorm(20000), ncol=5)
# my customized approach
method_ThomasIsCoding <- function() {
as.matrix(unname(unstack(within(idx<-expand.grid(seq(nrow(x)),seq(nrow(y))), d <-sqrt(rowSums((x[Var1,]-y[Var2,])**2))), d~Var2)))
}
# your approach
method_XXX <- function() {
# fill with your approach
}
microbenchmark::microbenchmark(
method_ThomasIsCoding(),
method_XXX(),
unit = "relative",
check = "equivalent",
times = 10
)
proxy
程序包具有此功能。
library(proxy)
dist(x, y)
[,1] [,2] [,3] [,4]
[1,] 3.016991 1.376622 2.065831 2.857002
[2,] 4.573625 3.336707 2.698124 1.412811
[3,] 3.764925 2.235186 2.743056 3.358577
解决方案:既优雅又快5倍]
euclidean_distance <- function(p,q){
sqrt(sum((p - q)^2))
}
distmat = outer(
as.data.frame(t(x)),
as.data.frame(t(y)),
Vectorize(euclidean_distance)
)
输出:
> distmat
V1 V2 V3 V4
V1 3.016991 1.376622 2.065831 2.857002
V2 4.573625 3.336707 2.698124 1.412811
V3 3.764925 2.235186 2.743056 3.358577
基准:
set.seed(1)
x <- matrix(rnorm(1500), ncol=5)
y <- matrix(rnorm(2000), ncol=5)
# my customized approach
method_ThomasIsCoding <- function() {
as.matrix(unname(unstack(within(idx<-expand.grid(seq(nrow(x)),seq(nrow(y))), d <-sqrt(rowSums((x[Var1,]-y[Var2,])**2))), d~Var2)))
}
# your approach
method_Jet <- function() {
# fill with your approach
outer(as.data.frame(t(x)),as.data.frame(t(y)),Vectorize(euclidean_distance))
}
microbenchmark::microbenchmark(
method_ThomasIsCoding(),
method_Jet(),
unit = "relative",
check = "equivalent",
times = 1
)
输出:
expr time
1 method_ThomasIsCoding() 68785152
2 method_Jet() 368550933