在 pytesseract 中检索到错误的数字

Question

我正在尝试使用 pytesseract 从在线图像中检索数据，但是结果非常糟糕，我想知道是否有办法改进它。

这是我的代码：

import io
import requests
import pytesseract
from PIL import Image
response = requests.get("https://port.jpx.co.jp/jpx/chart/chart21.exe? template=ini/DayIndexCSV&basequote=151_2024&begin=2024/4/2&end=2024/04/02&mode=D")
img = Image.open(io.BytesIO(response.content))
text = pytesseract.image_to_string(img)
print(text)
img.show()

正如您所看到的，输出与真实文本确实不同。

看起来很多数字都改成“8”了，可以是2、4、5、6。有时它也会给出“。”用于分隔符，有时为“,”。

即使我只选择焦点部分，答案也不是更好：

w, h= img.size
img2=img.crop((240, 185, w-220, h-220))
text = pytesseract.image_to_string(img2)
print(text)

裁剪图像的实际值为“2,714.45”，而此代码返回“eT AS”

text = pytesseract.image_to_string(img2, config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
print(text)

此代码仅返回“1”。

我不太明白它是如何工作的，我也尝试根据 Use pytesseract OCR to recognize text from an image 所说的内容更改颜色，但它也不起作用。

如果有人知道我如何才能完成这项工作，我将不胜感激，

谢谢

Answer 1

tessaract 的初学者用户遇到的常见问题是期望它能够从屏幕像素中 OCR 文本，但是...tessarct 设计用于 OCR 扫描文本，dpi 为 300，以黑白图像形式给出。换句话说，您需要将图像放大到 300 dpi 的尺寸，并且需要将图像阈值设置为黑/白图像。

通过上面的代码实现，tessaract 将提供大部分正确的结果（除了使用的字体有奇怪的字形）。

在 pytesseract 中检索到错误的数字

问题描述投票：0回答：1

1个回答

最新问题

在 pytesseract 中检索到错误的数字

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1