如何使用 Tesseract 获取具有宽类型收据的批次上的文本

Question

我必须以各种角度、质量和语言（法语、英语和西班牙语）对批量收据/发票进行 OCR（有些是扫描的，有些不是）我认为我所做的脚本可以完美处理批次的 30%，但我有一些收据很难从中获得结果，就像提供的示例一样。

我正在使用Python和pytesseract，我在磁盘上有图像，所以pytesseract不应该进行任何预处理。

这是我用来检测角度、鸟瞰收据并使用 pytesseract 提取文本的代码的迷你示例。

    transform_ok = True

    orig_img = Image.open(my_file)
    # convert from RGB (PIL) to BGR (CV2)
    orig = cv2.cvtColor(asarray(orig_img), cv2.COLOR_RGB2BGR)

    image = orig.copy()
    image = imutils.resize(image, width=500)
    ratio = orig.shape[1] / float(image.shape[1])

    # apply image transformation, color -> gray -> add a little blur -> edged focus
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    # Get rid of noise with Gaussian Blur filter
    blurred = cv2.GaussianBlur(gray, (5, 5,), 0)
    # edged = cv2.Canny(blurred, 75, 200)
    edged = imutils.auto_canny(blurred)

    # find and sorted contour for find receipt (bigest edge)
    # cnts = cv2.findContours(edged.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cv2.findContours(edged.copy(), cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
    cnts = imutils.grab_contours(cnts)
    cnts = sorted(cnts, key=cv2.contourArea, reverse=True)[:5]
    receiptCnt = None
    # loop over the contours
    for c in cnts:
        # approximate the contour
        peri = cv2.arcLength(c, True)
        approx = cv2.approxPolyDP(c, 0.02 * peri, True)
        # if our approximated contour has four points, then we can
        # assume we have found the outline of the receipt
        if len(approx) == 4:
            receiptCnt = approx
            break
    # Receipt contour not found
    if receiptCnt is None:
        print("Could not find receipt outline, "
              "Used raw document")
        # transform fail...
        transform_ok = False
    if transform:
        # top-down bird's-eye view of the receipt
        receipt = four_point_transform(orig, receiptCnt.reshape(4, 2) * ratio)
        cv2.imwrite("receipt.jpeg", page)
    else:
        orig_img.save("receipt.jpeg")

    raw_text = pytesseract.image_to_data(
        "receipt.jpeg",
        config=f"--tessdata-dir {tessdata_dir_best} --oem 1 --psm 4",
        lang='script/Latin+fra+eng+spa',
        output_type='dict'
    )

注意：你可以看到我还使用了更准确的tessdata 最好

这是我的收据示例。这是法语的，我无法从中获取任何文本，鸟瞰脚本失败......

我怎样才能改善这个结果？

我尝试了这个，但对这张收据没有影响

欢迎任何帮助/线索！

Answer 1

您应该使用黑白图片。调整大小和灰度值。这很大程度上取决于图片质量：

import cv2
import pytesseract

img = cv2.imread('receipt.jpeg', cv2.IMREAD_UNCHANGED)

print('Original Dimensions : ',img.shape)
 
scale_percent = 100 # percent of original size
width = int(img.shape[1] * scale_percent / 100)
height = int(img.shape[0] * scale_percent / 100)
dim = (width, height)
  
# resize image
resized = cv2.resize(img, dim, interpolation = cv2.INTER_AREA)

print('Resized Dimensions : ',resized.shape)

grayImage = cv2.cvtColor(resized, cv2.COLOR_BGR2GRAY)
# 127 between black and white, 255 white 
(thresh, blackAndWhiteImage) = cv2.threshold(grayImage, 85, 255, cv2.THRESH_BINARY)

# ocr text
text = pytesseract.image_to_string(blackAndWhiteImage)
print(text)

cv2.imshow('image', blackAndWhiteImage)
cv2.waitKey(0)
cv2.destroyAllWindows()

输出：不是100%，但是你可以自己玩一下

VANIKORO

Contre cia) Paridis
“4 Nantes
Ter : 02 28 23 64 97

mera = Dette Heure Etab
DUPLICATA

x
3000001982180 2E8-TAGBLU a

renise Promotion -20,00%
sur 92,00 soit -18,40

x44% 73,60 EUR #4

1 article
eee Reglenent =”
Carte bleue 73,60 €
Taxe “Wontant. Taux Base HT
TWA 12,27 20,00% 61,33
Co ease temas Wall ag tac baylen ca

° sb. Merch” a varre compréhansion
Vous avez 6té consei11é par
SVLVIE

如何使用 Tesseract 获取具有宽类型收据的批次上的文本

问题描述投票：0回答：1

1个回答

最新问题

如何使用 Tesseract 获取具有宽类型收据的批次上的文本

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1