Android语义分割后处理太慢

Question

如果有人能为我上周一直没有成功完成的任务提供建议，我将非常感激。我有语义分割模型（MobileNetV3 +轻量ASPP）。简短信息：输入-1024x1024，输出-相同大小和2类（bg和vehicle），所以我的输出形状是（1，1048576，2）。我不是移动开发人员或Java世界专家，因此我使用了一些完整的Andoid示例进行图像分割测试：来自Google的一个：https://github.com/tensorflow/examples/tree/master/lite/examples/image_segmentation另一个开源：https://github.com/pillarpond/image-segmenter-android

我已成功将其转换为tflite格式，并且在启用了GPU且具有10个线程的OnePlus 7上，其推理时间为105-140ms。但是在这里我遇到了一个问题：这两个android示例中的一般执行时间，或者您可以找到的用于语义分段的时间，大约为1050-1300ms（小于1FPS）。该管道的最慢部分是图像后处理（〜900-1150ms）。您可以在Deeplab#segment方法中看到该部分。因为除了bg之外我只有1个班级-我没有this third loop，但是其他所有内容都保持不变并且仍然很慢。与其他常见的移动设备大小（例如128/226/512）相比，输出大小并不小，但仍然如此。我认为在现代智能手机上处理1024x1024矩阵并在画布上绘制矩形不需要花费太多时间。我尝试了不同的解决方案，例如将矩阵操作拆分为线程，或者一次创建了所有这些对象（如RectF和Recognition），然后仅在嵌套循环中用新数据填充了它们的属性，但是我都没有成功。在桌面方面，我可以使用numpy和opencv轻松地处理它，而且我什至不了解如何在Android中执行相同的操作，并且它是否有效。这是我在python中使用的代码：

CLASS_COLORS = [(0, 0, 0), (255, 255, 255)] # black for bg and white for mask


def get_image_array(image_input, width, height):
    img = cv2.imread(image_input, 1)
    img = cv2.resize(img, (width, height))
    img = img.astype(np.float32)
    img[:, :, 0] -= 128.0
    img[:, :, 1] -= 128.0
    img[:, :, 2] -= 128.0
    img = img[:, :, ::-1]
    return img

def get_segmentation_array(seg_arr, n_classes):
    output_height = seg_arr.shape[0]
    output_width = seg_arr.shape[1]
    seg_img = np.zeros((output_height, output_width, 3))
    for c in range(n_classes):
        seg_arr_c = seg_arr[:, :] == c
        seg_img[:, :, 0] += ((seg_arr_c)*(CLASS_COLORS[c][0])).astype('uint8')
        seg_img[:, :, 1] += ((seg_arr_c)*(CLASS_COLORS[c][1])).astype('uint8')
        seg_img[:, :, 2] += ((seg_arr_c)*(CLASS_COLORS[c][2])).astype('uint8')

    return seg_img


interpreter = tf.lite.Interpreter(model_path=f"my_model.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()


img_arr = get_image_array("input.png", 1024, 1024)
interpreter.set_tensor(input_details[0]['index'], np.array([x]))
interpreter.invoke()

output = interpreter.get_tensor(output_details[0]['index'])
output = output.reshape((1024,  1024, 2)).argmax(axis=2)
seg_img = get_segmentation_array(output, 2)
cv2.imwrite("output.png", seg_img)

也许比当前的后处理解决方案还强大。我真的很感谢任何帮助。我敢肯定，有什么可以改善后处理并将其时间减少到〜100ms的，所以我一般会有〜5FPS。

Answer 1

新更新。多亏了Farmaker，我使用了上面评论中在他的仓库中找到的一段代码，现在管道看起来像：

int channels = 3;
int n_classes = 2;
int float_byte_size = 4;
int width = model.inputWidth;
int height = model.inputHeight;

int[] intValues = new int[width * height];
ByteBuffer inputBuffer = ByteBuffer.allocateDirect(width * height * channels * float_byte_size).order(ByteOrder.nativeOrder());
ByteBuffer outputBuffer = ByteBuffer.allocateDirect(width * height * n_classes * float_byte_size).order(ByteOrder.nativeOrder());

Bitmap input = textureView.getBitmap(width, height);
input.getPixels(intValues, 0, width, 0, 0, height, height);

inputBuffer.rewind();
outputBuffer.rewind();

for (final int value: intValues) {
    inputBuffer.putFloat(((value >> 16 & 0xff) - 128.0) / 1.0f);
    inputBuffer.putFloat(((value >> 8 & 0xff) - 128.0) / 1.0f);
    inputBuffer.putFloat(((value & 0xff) - 128.0) / 1.0f);
}

tfLite.run(inputBuffer, outputBuffer);

final Bitmap output = Bitmap.createBitmap(width, height, Bitmap.Config.ARGB_8888);
outputBuffer.flip();
int[] pixels = new int[width * height];
for (int i = 0; i < width * height; i++) {
    float max = outputBuffer.getFloat();
    float val = outputBuffer.getFloat();
    int id = val > max ? 1 : 0;
    pixels[i] = id == 0 ? 0x00000000 : 0x990000ff;
}
output.setPixels(pixels, 0, width, 0, 0, width, height);
resultView.setImageBitmap(ImageUtils.resizeBitmap(output, resultView.getWidth(), resultView.getHeight()));

目前，后处理时间约为70-130ms，第95位约为90ms，再加上约60ms的图像预处理时间，约140ms的推理时间，以及其他启用GPU和10个线程的东西的约30-40ms，这给了我一般执行时间约330ms，即3FPS！这是针对1024x1024的大型模型。在这一点上，我非常满意，并且想为我的模型尝试不同的配置，包括将MobilenetV3 Small作为骨干网。

Android语义分割后处理太慢

问题描述投票：0回答：1

1个回答

最新问题

Android语义分割后处理太慢

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1