我必须以各种角度、质量和语言(法语、英语和西班牙语)对批量收据/发票进行 OCR(有些是扫描的,有些不是) 我认为我所做的脚本可以完美处理批次的 30%,但我有一些收据很难从中获得结果,就像提供的示例一样。
我正在使用Python和pytesseract,我在磁盘上有图像,所以pytesseract不应该进行任何预处理。
这是我用来检测角度、鸟瞰收据并使用 pytesseract 提取文本的代码的迷你示例。
transform_ok = True
orig_img = Image.open(my_file)
# convert from RGB (PIL) to BGR (CV2)
orig = cv2.cvtColor(asarray(orig_img), cv2.COLOR_RGB2BGR)
image = orig.copy()
image = imutils.resize(image, width=500)
ratio = orig.shape[1] / float(image.shape[1])
# apply image transformation, color -> gray -> add a little blur -> edged focus
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Get rid of noise with Gaussian Blur filter
blurred = cv2.GaussianBlur(gray, (5, 5,), 0)
# edged = cv2.Canny(blurred, 75, 200)
edged = imutils.auto_canny(blurred)
# find and sorted contour for find receipt (bigest edge)
# cnts = cv2.findContours(edged.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cv2.findContours(edged.copy(), cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
cnts = imutils.grab_contours(cnts)
cnts = sorted(cnts, key=cv2.contourArea, reverse=True)[:5]
receiptCnt = None
# loop over the contours
for c in cnts:
# approximate the contour
peri = cv2.arcLength(c, True)
approx = cv2.approxPolyDP(c, 0.02 * peri, True)
# if our approximated contour has four points, then we can
# assume we have found the outline of the receipt
if len(approx) == 4:
receiptCnt = approx
break
# Receipt contour not found
if receiptCnt is None:
print("Could not find receipt outline, "
"Used raw document")
# transform fail...
transform_ok = False
if transform:
# top-down bird's-eye view of the receipt
receipt = four_point_transform(orig, receiptCnt.reshape(4, 2) * ratio)
cv2.imwrite("receipt.jpeg", page)
else:
orig_img.save("receipt.jpeg")
raw_text = pytesseract.image_to_data(
"receipt.jpeg",
config=f"--tessdata-dir {tessdata_dir_best} --oem 1 --psm 4",
lang='script/Latin+fra+eng+spa',
output_type='dict'
)
注意:你可以看到我还使用了更准确的tessdata 最好
这是我的收据示例。 这是法语的,我无法从中获取任何文本,鸟瞰脚本失败......
我怎样才能改善这个结果?
我尝试了这个,但对这张收据没有影响
欢迎任何帮助/线索!
您应该使用黑白图片。调整大小和灰度值。这很大程度上取决于图片质量:
import cv2
import pytesseract
img = cv2.imread('receipt.jpeg', cv2.IMREAD_UNCHANGED)
print('Original Dimensions : ',img.shape)
scale_percent = 100 # percent of original size
width = int(img.shape[1] * scale_percent / 100)
height = int(img.shape[0] * scale_percent / 100)
dim = (width, height)
# resize image
resized = cv2.resize(img, dim, interpolation = cv2.INTER_AREA)
print('Resized Dimensions : ',resized.shape)
grayImage = cv2.cvtColor(resized, cv2.COLOR_BGR2GRAY)
# 127 between black and white, 255 white
(thresh, blackAndWhiteImage) = cv2.threshold(grayImage, 85, 255, cv2.THRESH_BINARY)
# ocr text
text = pytesseract.image_to_string(blackAndWhiteImage)
print(text)
cv2.imshow('image', blackAndWhiteImage)
cv2.waitKey(0)
cv2.destroyAllWindows()
输出:不是100%,但是你可以自己玩一下
VANIKORO
Contre cia) Paridis
“4 Nantes
Ter : 02 28 23 64 97
mera = Dette Heure Etab
DUPLICATA
x
3000001982180 2E8-TAGBLU a
renise Promotion -20,00%
sur 92,00 soit -18,40
x44% 73,60 EUR #4
1 article
eee Reglenent =”
Carte bleue 73,60 €
Taxe “Wontant. Taux Base HT
TWA 12,27 20,00% 61,33
Co ease temas Wall ag tac baylen ca
° sb. Merch” a varre compréhansion
Vous avez 6té consei11é par
SVLVIE