在曲线上找到最佳权衡点

问题描述 投票:46回答:11

假设我有一些数据,我想在其上安装一个参数化模型。我的目标是为此模型参数找到最佳值。

我正在使用AIC / BIC / MDL类型的标准进行模型选择,该模型奖励具有低误差的模型,但也惩罚具有高复杂度的模型(我们正在寻找对这些数据的最简单但最有说服力的解释,可以这么说,la Occam's razor )。

按照上面的说明,这是我得到的三种不同标准的例子(两个要最小化,一个要最大化):

在视觉上你可以很容易地看到肘部形状,你会在该区域的某处选择一个参数值。问题是我正在为大量实验做这件事,我需要一种方法来找到这个值而不需要干预。

我的第一个直觉是尝试从角落以45度角绘制一条直线并继续移动它直到它与曲线相交,但这说起来容易做起来:)如果曲线有些偏斜,它也会错过感兴趣的区域。

关于如何实现这个或更好的想法的任何想法?

以下是重现上述一个图表所需的样本:

curve = [8.4663 8.3457 5.4507 5.3275 4.8305 4.7895 4.6889 4.6833 4.6819 4.6542 4.6501 4.6287 4.6162 4.585 4.5535 4.5134 4.474 4.4089 4.3797 4.3494 4.3268 4.3218 4.3206 4.3206 4.3203 4.2975 4.2864 4.2821 4.2544 4.2288 4.2281 4.2265 4.2226 4.2206 4.2146 4.2144 4.2114 4.1923 4.19 4.1894 4.1785 4.178 4.1694 4.1694 4.1694 4.1556 4.1498 4.1498 4.1357 4.1222 4.1222 4.1217 4.1192 4.1178 4.1139 4.1135 4.1125 4.1035 4.1025 4.1023 4.0971 4.0969 4.0915 4.0915 4.0914 4.0836 4.0804 4.0803 4.0722 4.065 4.065 4.0649 4.0644 4.0637 4.0616 4.0616 4.061 4.0572 4.0563 4.056 4.0545 4.0545 4.0522 4.0519 4.0514 4.0484 4.0467 4.0463 4.0422 4.0392 4.0388 4.0385 4.0385 4.0383 4.038 4.0379 4.0375 4.0364 4.0353 4.0344];
plot(1:100, curve)

编辑

我接受了Jonas给出的解决方案。基本上,对于曲线上的每个点p,我们找到具有最大距离d的那个:

algorithm matlab data-modeling model-fitting
11个回答
40
投票

找到弯头的一种快速方法是从曲线的第一个点到最后一个点绘制一条直线,然后找到离该直线最远的数据点。

这当然在某种程度上取决于线条平坦部分的点数,但如果每次测试相同数量的参数,它应该可以合理地确定。

curve = [8.4663 8.3457 5.4507 5.3275 4.8305 4.7895 4.6889 4.6833 4.6819 4.6542 4.6501 4.6287 4.6162 4.585 4.5535 4.5134 4.474 4.4089 4.3797 4.3494 4.3268 4.3218 4.3206 4.3206 4.3203 4.2975 4.2864 4.2821 4.2544 4.2288 4.2281 4.2265 4.2226 4.2206 4.2146 4.2144 4.2114 4.1923 4.19 4.1894 4.1785 4.178 4.1694 4.1694 4.1694 4.1556 4.1498 4.1498 4.1357 4.1222 4.1222 4.1217 4.1192 4.1178 4.1139 4.1135 4.1125 4.1035 4.1025 4.1023 4.0971 4.0969 4.0915 4.0915 4.0914 4.0836 4.0804 4.0803 4.0722 4.065 4.065 4.0649 4.0644 4.0637 4.0616 4.0616 4.061 4.0572 4.0563 4.056 4.0545 4.0545 4.0522 4.0519 4.0514 4.0484 4.0467 4.0463 4.0422 4.0392 4.0388 4.0385 4.0385 4.0383 4.038 4.0379 4.0375 4.0364 4.0353 4.0344];

%# get coordinates of all the points
nPoints = length(curve);
allCoord = [1:nPoints;curve]';              %'# SO formatting

%# pull out first point
firstPoint = allCoord(1,:);

%# get vector between first and last point - this is the line
lineVec = allCoord(end,:) - firstPoint;

%# normalize the line vector
lineVecN = lineVec / sqrt(sum(lineVec.^2));

%# find the distance from each point to the line:
%# vector between all points and first point
vecFromFirst = bsxfun(@minus, allCoord, firstPoint);

%# To calculate the distance to the line, we split vecFromFirst into two 
%# components, one that is parallel to the line and one that is perpendicular 
%# Then, we take the norm of the part that is perpendicular to the line and 
%# get the distance.
%# We find the vector parallel to the line by projecting vecFromFirst onto 
%# the line. The perpendicular vector is vecFromFirst - vecFromFirstParallel
%# We project vecFromFirst by taking the scalar product of the vector with 
%# the unit vector that points in the direction of the line (this gives us 
%# the length of the projection of vecFromFirst onto the line). If we 
%# multiply the scalar product by the unit vector, we have vecFromFirstParallel
scalarProduct = dot(vecFromFirst, repmat(lineVecN,nPoints,1), 2);
vecFromFirstParallel = scalarProduct * lineVecN;
vecToLine = vecFromFirst - vecFromFirstParallel;

%# distance to line is the norm of vecToLine
distToLine = sqrt(sum(vecToLine.^2,2));

%# plot the distance to the line
figure('Name','distance from curve to line'), plot(distToLine)

%# now all you need is to find the maximum
[maxDist,idxOfBestPoint] = max(distToLine);

%# plot
figure, plot(curve)
hold on
plot(allCoord(idxOfBestPoint,1), allCoord(idxOfBestPoint,2), 'or')

0
投票

如果你愿意,我已将它翻译为R作为我自己的练习(原谅我的非优化编码风格)。 *应用它来找到k-means上的最佳簇数 - 工作得非常好。

elbow.point = function(x){
elbow.curve = c(x)
nPoints = length(elbow.curve);
allCoord = cbind(c(1:nPoints),c(elbow.curve))
# pull out first point
firstPoint = allCoord[1,]
# get vector between first and last point - this is the line
lineVec = allCoord[nPoints,] - firstPoint;
# normalize the line vector
lineVecN = lineVec / sqrt(sum(lineVec^2));
# find the distance from each point to the line:
# vector between all points and first point
vecFromFirst = lapply(c(1:nPoints), function(x){
  allCoord[x,] - firstPoint
})
vecFromFirst = do.call(rbind, vecFromFirst)
rep.row<-function(x,n){
  matrix(rep(x,each=n),nrow=n)
}
scalarProduct = matrix(nrow = nPoints, ncol = 2)
scalarProduct[,1] = vecFromFirst[,1] * rep.row(lineVecN,nPoints)[,1]
scalarProduct[,2] = vecFromFirst[,2] * rep.row(lineVecN,nPoints)[,2]
scalarProduct = as.matrix(rowSums(scalarProduct))
vecFromFirstParallel = matrix(nrow = nPoints, ncol = 2)
vecFromFirstParallel[,1] = scalarProduct * lineVecN[1]
vecFromFirstParallel[,2] = scalarProduct * lineVecN[2]
vecToLine = lapply(c(1:nPoints), function(x){
  vecFromFirst[x,] - vecFromFirstParallel[x,]
})
vecToLine = do.call(rbind, vecToLine)
# distance to line is the norm of vecToLine
distToLine = as.matrix(sqrt(rowSums(vecToLine^2)))
##
which.max(distToLine)
}

函数的输入x应该是带有值的列表/向量


0
投票

不要忽视模型选择的k折交叉验证,这是AIC / BIC的绝佳替代品。还要考虑您正在建模的基本情况,并允许您使用领域知识来帮助选择模型。


18
投票

如果有人需要上面由Jonas发布的Matlab代码的Python版本。

import numpy as np
curve = [8.4663, 8.3457, 5.4507, 5.3275, 4.8305, 4.7895, 4.6889, 4.6833, 4.6819, 4.6542, 4.6501, 4.6287, 4.6162, 4.585, 4.5535, 4.5134, 4.474, 4.4089, 4.3797, 4.3494, 4.3268, 4.3218, 4.3206, 4.3206, 4.3203, 4.2975, 4.2864, 4.2821, 4.2544, 4.2288, 4.2281, 4.2265, 4.2226, 4.2206, 4.2146, 4.2144, 4.2114, 4.1923, 4.19, 4.1894, 4.1785, 4.178, 4.1694, 4.1694, 4.1694, 4.1556, 4.1498, 4.1498, 4.1357, 4.1222, 4.1222, 4.1217, 4.1192, 4.1178, 4.1139, 4.1135, 4.1125, 4.1035, 4.1025, 4.1023, 4.0971, 4.0969, 4.0915, 4.0915, 4.0914, 4.0836, 4.0804, 4.0803, 4.0722, 4.065, 4.065, 4.0649, 4.0644, 4.0637, 4.0616, 4.0616, 4.061, 4.0572, 4.0563, 4.056, 4.0545, 4.0545, 4.0522, 4.0519, 4.0514, 4.0484, 4.0467, 4.0463, 4.0422, 4.0392, 4.0388, 4.0385, 4.0385, 4.0383, 4.038, 4.0379, 4.0375, 4.0364, 4.0353, 4.0344]
nPoints = len(curve)
allCoord = np.vstack((range(nPoints), curve)).T
np.array([range(nPoints), curve])
firstPoint = allCoord[0]
lineVec = allCoord[-1] - allCoord[0]
lineVecNorm = lineVec / np.sqrt(np.sum(lineVec**2))
vecFromFirst = allCoord - firstPoint
scalarProduct = np.sum(vecFromFirst * np.matlib.repmat(lineVecNorm, nPoints, 1), axis=1)
vecFromFirstParallel = np.outer(scalarProduct, lineVecNorm)
vecToLine = vecFromFirst - vecFromFirstParallel
distToLine = np.sqrt(np.sum(vecToLine ** 2, axis=1))
idxOfBestPoint = np.argmax(distToLine)

8
投票

信息理论模型选择的要点是它已经考虑了参数的数量。因此,没有必要找到肘部,只需要找到最小的肘部。

只有在使用拟合时才能找到曲线的肘部。即使这样,您选择弯头的方法在某种意义上也会对参数的数量设置一个惩罚。要选择肘部,您需要最小化从原点到曲线的距离。距离计算中两个维度的相对权重将产生固有的惩罚项。信息理论标准基于参数的数量和用于估计模型的数据样本的数量来设置该度量。

底线建议:使用BIC并采取最低限度。


7
投票

首先,快速计算回顾:每个图的一阶导数f'表示绘制函数f的速率正在变化。二阶导数f''代表f'变化的速率。如果f''很小,则意味着图表以适度的速度改变方向。但如果f''很大,则意味着图表正在迅速改变方向。

您希望隔离f''在图域中最大的点。这些将是您的最佳模型选择的候选点。你选择哪一点必须取决于你,因为你还没有明确指出你对健身与复杂性的重视程度。


5
投票

所以解决这个问题的一种方法是两条适合你手肘L的两条线。但由于曲线的一部分只有几个点(正如我在评论中所提到的),除非你检测到哪些点间隔并在它们之间进行插值以制造更均匀的系列,然后使用RANSAC,否则线条拟合会受到影响。找到两条适合L的线 - 有点复杂但并非不可能。

所以这里有一个更简单的解决方案 - 由于MATLAB的缩放(显然),你提出的图表看起来就像它们一样。所以我所做的就是使用比例信息最小化从“原点”到您的点的距离。

请注意:原点估计可以大大改善,但我会留给你。

这是代码:

%% Order
curve = [8.4663 8.3457 5.4507 5.3275 4.8305 4.7895 4.6889 4.6833 4.6819 4.6542 4.6501 4.6287 4.6162 4.585 4.5535 4.5134 4.474 4.4089 4.3797 4.3494 4.3268 4.3218 4.3206 4.3206 4.3203 4.2975 4.2864 4.2821 4.2544 4.2288 4.2281 4.2265 4.2226 4.2206 4.2146 4.2144 4.2114 4.1923 4.19 4.1894 4.1785 4.178 4.1694 4.1694 4.1694 4.1556 4.1498 4.1498 4.1357 4.1222 4.1222 4.1217 4.1192 4.1178 4.1139 4.1135 4.1125 4.1035 4.1025 4.1023 4.0971 4.0969 4.0915 4.0915 4.0914 4.0836 4.0804 4.0803 4.0722 4.065 4.065 4.0649 4.0644 4.0637 4.0616 4.0616 4.061 4.0572 4.0563 4.056 4.0545 4.0545 4.0522 4.0519 4.0514 4.0484 4.0467 4.0463 4.0422 4.0392 4.0388 4.0385 4.0385 4.0383 4.038 4.0379 4.0375 4.0364 4.0353 4.0344];
x_axis = 1:numel(curve);
points = [x_axis ; curve ]'; %' - SO formatting

%% Get the scaling info
f = figure(1);
plot(points(:,1),points(:,2));
ticks = get(get(f,'CurrentAxes'),'YTickLabel');
ticks = str2num(ticks);
aspect = get(get(f,'CurrentAxes'),'DataAspectRatio');
aspect = [aspect(2) aspect(1)];    
close(f);   

%% Get the "origin"
O = [x_axis(1) ticks(1)];

%% Scale the data - now the scaled values look like MATLAB''s idea of
% what a good plot should look like
scaled_O = O.*aspect;
scaled_points = bsxfun(@times,points,aspect);

%% Find the closest point
del = sum((bsxfun(@minus,scaled_points,scaled_O).^2),2);
[val ind] = min(del);
best_ROC = [ind curve(ind)];

%% Display
plot(x_axis,curve,'.-');
hold on;
plot(O(1),O(2),'r*');
plot(best_ROC(1),best_ROC(2),'k*');

结果:

另外,对于Fit(maximize)曲线,您必须将原点更改为[x_axis(1) ticks(end)]


5
投票

以下是Jonas在R中实现的解决方案:

elbow_finder <- function(x_values, y_values) {
  # Max values to create line
  max_x_x <- max(x_values)
  max_x_y <- y_values[which.max(x_values)]
  max_y_y <- max(y_values)
  max_y_x <- x_values[which.max(y_values)]
  max_df <- data.frame(x = c(max_y_x, max_x_x), y = c(max_y_y, max_x_y))

  # Creating straight line between the max values
  fit <- lm(max_df$y ~ max_df$x)

  # Distance from point to line
  distances <- c()
  for(i in 1:length(x_values)) {
    distances <- c(distances, abs(coef(fit)[2]*x_values[i] - y_values[i] + coef(fit)[1]) / sqrt(coef(fit)[2]^2 + 1^2))
  }

  # Max distance point
  x_max_dist <- x_values[which.max(distances)]
  y_max_dist <- y_values[which.max(distances)]

  return(c(x_max_dist, y_max_dist))
}

3
投票

我们可以用简单直观的方式说出来

如果我们从曲线上的任意点到曲线的两个端点绘制两条线,则这两条线以度为单位的最小角度的点是所需的点。

在这里,两条线可以看作是手臂,点可以看作肘点!


2
投票

我一直在研究膝关节/肘关节检测。绝不是我是专家。一些可能与此问题相关的方法。

DFDT代表Dynamic First Derivate Threshold。它计算一阶导数并使用阈值算法来检测膝/肘点。 DSDT类似但使用二阶导数,我的评价表明他们有相似的表现。

S方法是L方法的扩展。 L方法在曲线上插入两条直线,两条线之间的截距是拐点/肘点。通过循环整个点,拟合线并评估MSE(均方误差)来找到最佳拟合。 S方法适合3条直线,这提高了精度,但也需要更多的计算。

我的所有代码都在GitHub上公开发布。此外,这个article可以帮助您找到有关该主题的更多信息。它只有四页长,所以应该很容易阅读。您可以使用代码,如果您想讨论任何方法,请随意这样做。


1
投票

双重派生方法。但是,对于噪声数据而言似乎并不好用。对于输出,您只需找到d2的最大值来识别弯头。这个实现在R.

elbow_finder <- function(x_values, y_values) {
  i_max <- length(x_values) - 1

  # First and second derived
  first_derived <- list()
  second_derived <- list()

  # First derived
  for(i in 2:i_max){
    slope1 <- (y_values[i+1] - y_values[i]) / (x_values[i+1] - x_values[i])
    slope2 <- (y_values[i] - y_values[i-1]) / (x_values[i] - x_values[i-1])
    slope_avg <- (slope1 + slope2) / 2
    first_derived[[i]] <- slope_avg 
  }
  first_derived[[1]] <- NA
  first_derived[[i_max+1]] <- NA
  first_derived <- unlist(first_derived)

  # Second derived
  for(i in 3:i_max-1){
    d1 <- (first_derived[i+1] - first_derived[i]) / (x_values[i+1] - x_values[i])
    d2 <- (first_derived[i] - first_derived[i-1]) / (x_values[i] - x_values[i-1])
    d_avg <- (d1 + d2) / 2
    second_derived[[i]] <- d_avg 
  }
  second_derived[[1]] <- NA
  second_derived[[2]] <- NA
  second_derived[[i_max]] <- NA
  second_derived[[i_max+1]] <- NA
  second_derived <- unlist(second_derived)

  return(list(d1 = first_derived, d2 = second_derived))
}
© www.soinside.com 2019 - 2024. All rights reserved.