我正在使用带有pytesseract的python进行OCR。因此,我要尝试做的就是读取图像上的文本,提取文本并将提取的文本使用文件处理存储在txt或csv文件中。我希望读取多个文件,存储文本,并检查要读取和存储的图像文本是否已经存在于txt文件中。这是我的代码,没有任何错误。最后几行是我想做的,但似乎没用。有人可以帮我吗?提前致谢。
import cv2
import pytesseract,csv,re,os
from PIL import Image
from ast import literal_eval
img = pytesseract.image_to_string(Image.open("test1.png"), lang="eng")
print(img)
with open('C:\\Users\\Hasan\\Videos\\Captures\\saved.csv', "w") as outfile:
writer = csv.writer(outfile)
writer.writerow(img)
string = open('C:\\Users\\Hasan\\Videos\\Captures\\saved.csv').read()
new_str = re.sub('[^a-zA-z0-9\n\.]', ' ', string)
open('C:\\Users\\Hasan\\Videos\\Captures\\saved.csv', "w").write(new_str)
# f = open("saved.csv", "r")
# read = f.readline()
# print("\n" + f.read())
with open('C:\\Users\\Hasan\\Videos\\Captures\\saved.csv') as sv:
for line in sv:
if img in line:
print("Data already exists")
else:
print("file saved successfully")
在写入CSV文件时替换'\ n',并在进行比较时从img
中删除'\ n'。
import cv2
import pytesseract,csv,re,os
from PIL import Image
from ast import literal_eval
img_path = "example_01.png"
out_csv_path = "saved.csv"
img = pytesseract.image_to_string(Image.open(img_path), lang="eng")
print(img)
with open(out_csv_path, "w") as outfile:
writer = csv.writer(outfile)
writer.writerow(img)
string = open(out_csv_path).read()
new_str = re.sub('[^a-zA-z0-9\. ]', '', string)
open(out_csv_path, "w").write(new_str)
# f = open("saved.csv", "r")
# read = f.readline()
# print("\n" + f.read())
with open(out_csv_path,newline='') as sv:
img = re.sub('[^a-zA-z0-9\. ]', '', img)
for line in sv:
print("Line text is: {}\nExtracted Text is: {}".format(line,img))
if img in line:
print("Data already exists")
else:
print("file saved successfully")
样本输出:
Noisyimage
to test
Tesseract OCR
Line text is: Noisyimageto testTesseract OCR
Extracted Text is: Noisyimageto testTesseract OCR
Data already exists