如何结合多个OCR工具的结果以获得更好的文本识别

Question

想象一下，您有不同的 OCR 工具来从图像中读取文本，但没有一个工具可以为您提供 100% 准确的输出。然而，结合起来，结果可能非常接近真实情况 - 将文本“融合”在一起以获得良好结果的最佳技术是什么？

示例：

实际文字

§ 5.1: The contractor is obliged to announce the delay by 01.01.2019 at the latest. The identification-number to be used is OZ-771LS.

OCR工具1

5 5.1 The contractor is obliged to announce the delay by O1.O1.2019 at the latest. The identification-number to be used is OZ77lLS.

OCR工具2

§5.1: The contract or is obliged to announce theedelay by 01.O1. 2O19 at the latest. The identification number to be used is O7-771LS

OCR工具3

§ 5.1: The contractor is oblige to do announced he delay by 01.01.2019 at the latest. T he identification-number ti be used is OZ-771LS.

融合 OCR 1、2 和 3 来获取实际文本的有前途的算法是什么？

我的第一个想法是创建一个任意长度的“滚动窗口”，比较窗口中的单词，并从 3 个工具预测的每个位置中取出单词 2。

例如窗口大小为 3：

[5 5.1 The]

[§5.1: The contract]

[§ 5.1: The]

如您所见，该算法不起作用，因为所有三种工具对于位置一都有不同的候选者（5，§5.1：和§）。

当然可以添加一些技巧，例如 Levenshtein 距离以允许一些偏差，但我担心这实际上不够稳健。

Answer 1

对我来说，这看起来像是一个美丽的集成推理问题。

合并多个模型的预测的方法不止一种。对于分类问题来说，这是最简单的，直观上可以将模型的预测视为投票。然后由您决定如何处理投票。您是否想要更多地权衡特定模型（例如，如果它具有卓越的性能），您是否想要获得预测的平均值（对您的 nlp 用例没有多大意义），您是否想要选择类别（字符）获得最多票数。

这称为maxVoting。我将以此为例进行展示。

from sklearn.base import BaseEstimator, TransformerMixin
from collections import Counter

class MaxVotingEnsemble(BaseEstimator, TransformerMixin):
    def transform(self, X):
        print("\nTransforming predictions with MaxVotingEnsemble...")

        # Zip the predictions for each position
        zipped_predictions = zip(*X)

        # Find the mode for each position
        merged_predictions = []
        for position, predictions in enumerate(zipped_predictions):
            print(f"\nProcessing position {position + 1}: {predictions}")

            # Find the mode for the current position
            mode_prediction = Counter(predictions).most_common(1)[0][0]
            print(f"Mode prediction for position {position + 1}: {mode_prediction}")

            merged_predictions.append(mode_prediction)

        return merged_predictions

我在 Python 3.11 中运行了这个，我得到：

Merged Predictions:
§ 5.1: The contractor is obliged to announce the delay by 01.01.2019 at the latest. The identifiiation-nnmber to be used is OZ-771LS.

如您所见，开箱即用，效果很好。然而，这主要是由于以下事实：如果第一个字符串的预测操作不占多数（仔细检查后，它已经是所需结果的一个很好的近似值）。

这里已经收获了唾手可得的果实，并且要获得更好的结果会变得更加麻烦。以下是一些“下一步”的想法：

通常添加更多模型会给你带来更好的结果，因为选择第一个字符串出现问题的可能性会降低。
根据模型的性能来衡量模型也可以使您的预测更加稳健。
目前，字符是根据索引/位置进行简单比较的。我们可以使用序列对齐算法来找到字符的最佳对齐方式。其中一种算法是 Needleman-Wunsch 算法，它经常用于生物信息学中的序列比对。在 python 中，Bio 包中的pairwise2 模块为你提供了支持。

这就是我留给您的地方，为您提供设置解决方案的第一步。

如何结合多个OCR工具的结果以获得更好的文本识别

问题描述投票：0回答：1

1个回答

最新问题

如何结合多个OCR工具的结果以获得更好的文本识别

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1