具有Pyspark的Pytesseract引发错误:-找不到pytesseract模块

问题描述 投票:0回答:1

我正在尝试使用spark和pytesseract编写OCR代码,即使安装了pytesseract模块,我也遇到了pytesseract模块找不到错误。

import pytesseract
from PIL import Image


path='/XXXX/JupyterLab/notebooks/testdir'
rdd = sc.binaryFiles(path)

rdd.keys().collect()
-->['file:XXX/JupyterLab/notebooks/testdir/copy.png']

input=rdd.keys().map(lambda s: s.replace("file:",""))

def read(x):
    import pytesseract
    image=Image.open(x)
    text=pytesseract.image_to_open(image)
    return text

newRdd= input.map(lambda x : read(x))
newRdd.collect()

“在newRdd.collect()上,我收到以下错误”

ModuleNotFoundError:没有名为'pytesseract'at的模块 org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.handlePythonException(PythonRunner.scala:298)在org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRunner.scala:438)在org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRunner.scala:421)在org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.hasNext(PythonRunner.scala:252)在org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)在scala.collection.Iterator $ class.foreach(Iterator.scala:893)在org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)在scala.collection.generic.Growable $ class。$ plus $ plus $ eq(Growable.scala:59)在scala.collection.mutable.ArrayBuffer。$ plus $ plus $ eq(ArrayBuffer.scala:104)在scala.collection.mutable.ArrayBuffer。$ plus $ plus $ eq(ArrayBuffer.scala:48)在scala.collection.TraversableOnce $ class.to(TraversableOnce.scala:310)在org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)在scala.collection.TraversableOnce $ class.toBuffer(TraversableOnce.scala:302)在org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)在scala.collection.TraversableOnce $ class.toArray(TraversableOnce.scala:289)在org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)在org.apache.spark.rdd.RDD $$ anonfun $ collect $ 1 $$ anonfun $ 16.apply(RDD.scala:960)在org.apache.spark.rdd.RDD $$ anonfun $ collect $ 1 $$ anonfun $ 16.apply(RDD.scala:960)在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2111)在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2111)在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)在org.apache.spark.scheduler.Task.run(Task.scala:109)在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:420)在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)在java.lang.Thread.run(Thread.java:748)

我不确定如何使用Image.open()将保存图像路径的rdd.key()传递给pytesseract.image_to_String()。

谢谢。

pyspark ocr rdd tesseract python-tesseract
1个回答
0
投票

我的错误已通过添加解决

sc.addPyFile('/ pathto .......... / pytesseract / pytesseract.py')

© www.soinside.com 2019 - 2024. All rights reserved.