Pytesseract with Pyspark throws Error :- pytesseract module not found - pyspark

I am trying to write OCR code using spark and pytesseract and I am running into pytesseract module not found error even though pytesseract module is installed.
import pytesseract
from PIL import Image
path='/XXXX/JupyterLab/notebooks/testdir'
rdd = sc.binaryFiles(path)
rdd.keys().collect()
-->['file:XXX/JupyterLab/notebooks/testdir/copy.png']
input=rdd.keys().map(lambda s: s.replace("file:",""))
def read(x):
import pytesseract
image=Image.open(x)
text=pytesseract.image_to_open(image)
return text
newRdd= input.map(lambda x : read(x))
newRdd.collect()
"On newRdd.collect() I get following error"
ModuleNotFoundError: No module named 'pytesseract'at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$16.apply(RDD.scala:960)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$16.apply(RDD.scala:960)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2111)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2111)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:420)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I am not sure how can I pass the rdd.key() which holds the image path to pytesseract.image_to_String() using Image.open().
Thank you.

My error was resolved by adding
sc.addPyFile('/pathto........../pytesseract/pytesseract.py')

Related

oracle sql developer sdcli utility import error

I'm trying to run some imports from an excel file using de sdcli utility import command, but I get an error. When I import from the sql developer gui the import works fine.
This is the command I run:
sdcli utility import -config import_file.sdimp
This is the command output:
java.lang.NullPointerException
at oracle.dbtools.raptor.data.readers.DataReaderRegistry.getReader(DataReaderRegistry.java:45)
at oracle.dbtools.raptor.data.core.ImportXMLUtil.reconcileConfig(ImportXMLUtil.java:1378)
at oracle.dbtools.raptor.data.core.ImportXMLUtil.reconcileConfig(ImportXMLUtil.java:1076)
at oracle.dbtools.raptor.data.core.ImportXMLUtil.reconcileConfig(ImportXMLUtil.java:1068)
at oracle.dbtools.raptor.data.headless.ImportCommand.doCommand(ImportCommand.java:177)
at oracle.dbtools.raptor.data.headless.ImportProcessor$ImportHeadlessTask.doWork(ImportProcessor.java:37)
at oracle.dbtools.raptor.data.headless.ImportProcessor$ImportHeadlessTask.doWork(ImportProcessor.java:27)
at oracle.dbtools.raptor.backgroundTask.RaptorTask.call(RaptorTask.java:199)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at oracle.dbtools.raptor.backgroundTask.RaptorTaskManager$RaptorFutureTask.run(RaptorTaskManager.java:702)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)```
Run:
sdcli.exe utility import -c C:\temp\isd.sdimp -conn prod_db
C:\temp\isd.sdimp file is created by SQL developer, XML can be manipulated, and connection must be given from the command line, so you can create a script to import data into a table, which you do not "see" as you do missing rights in "prod_db" and DBA is "untouchable"

GraphFrames with pySpark

I want to use GraphFrames with PySpark (currently using Spark v2.3.3, on Google Dataproc).
After installing GraphFrames with
pip install graphframes
I try to run the follwing code:
from graphframes import *
localVertices = [(1,"A"), (2,"B"), (3, "C")]
localEdges = [(1,2,"love"), (2,1,"hate"), (2,3,"follow")]
v = sqlContext.createDataFrame(localVertices, ["id", "name"])
e = sqlContext.createDataFrame(localEdges, ["src", "dst", "action"])
g = GraphFrame(v, e)
but I get this error:
Py4JJavaError: An error occurred while calling o301.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Any ideas how to fix this issue?
To use GraphFrames with Spark, you should install it as a Spark package, not a PIP package:
pyspark --packages graphframes:graphframes:0.7.0-spark2.3-s_2.11
In case you are using Jupyter for development start it from pyspark and not directly or from Anaconda. Meaning open the terminal and then run
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
This starts Jupyter with the correct PySpark packages loaded in the background. If you then import it in your script with from graphframes import it will pick it up correctly and run

Add py file to spark scala

I am trying to execute a python script from Scala Spark job (Spark 2.3) like below
val pyScript = "wasb://scripts#myAccount.blob.core.windows.net/print.py"
val pyScriptName = "print.py"
spark.sparkContext.addFile(pyScript)
val ipData = sc.parallelize(List("abc","def","ghi"))
val opData = ipData.pipe(org.apache.spark.SparkFiles.get(pyScriptName))
opData.foreach(println)
However I get the below exception. Any ideas what could be wrong?
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 2.0 failed 4 times, most recent failure: Lost task 2.3 in stage 2.0 (TID 141, wn4-novakv.oetevw42cdoe3jls1dzdeclktg.ex.internal.cloudapp.net, executor 2): java.io.IOException: Cannot run program "/mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1539905356890_0003/spark-5da59a08-a6a4-443d-b3b1-c31643e195c5/userFiles-e3ca8a1b-44f6-4804-9c95-625b1742fb77/print.py": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:111)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: error=2, No such file or directory

javaws name too long

I am trying to use iced tea / javaws to authenticate as user at the USPTO (patent/trademark office). I download the jnlp file and try to use it with javaws but hit a 'name too long' error - does anyone know how to work around this?
jeremy#jeremy-Blade ~/Downloads $ javaws uspto-auth.authenticate.jnlp
java.io.FileNotFoundException: /home/jeremy/.cache/icedtea-web/cache/0/https/efs.uspto.gov/TruePassWebStart/uspto-auth.authenticate.jnlp.q_SlNFU1NJT05JRD1GN0Q5ODBCNzhCODI3QTQwMkZEOEI0ODU4M0M5OTYzMi5wcm9kX3RwdG9tY2F0MjE1X2p2bTsgbXl1c3B0b19zaWduaW49aW5pdGlhdGVkOyBmc3Iucj0lN0IlMjJkJTIyJTNBOTAlMkMlMjJpJTIyJTNBJTIyZDU1NDUwNS01MTYyNjUxMy04NTMwLTJkNmItYTJmMmUlMjIlMkMlMjJlJTIyJTNBMTUxNTk5ODY2NDIxNCU3RDsgX191dG1hPTIxNzEyMjUwNS4yMDQ3MDM2NjA0LjE1MTUzOTM2MjUuMTUxNTU3NjYzOS4xNTE2ODAwMTU2LjI7IF9fdXRtej0yMTcxMjI1MDUuMTUxNTU3NjYzOS4xLjEudXRtY3NyPShkaXJlY3QpfHV0bWNjbj0oZGlyZWN0KXx1dG1jbWQ9KG5vbmUpOyBmc3Iucz0lN0IlMjJ2MSUyMiUzQS0xJTJDJTIydjIlMjIlM0EtMiUyQyUyMnJpZCUyMiUzQSUyMmQwNDgwMTItNTcwODgwODUtOTVhYy1lMmE4LTE0Njg3JTIyJTJDJTIycnUlMjIlM0ElMjJodHRwcyUzQSUyRiUyRnd3dy51c3B0by5nb3YlMkYlMjIlMkMlMjJyJTIyJTNBJTIyd3d3LnVzcHRvLmdvdiUyMiUyQyUyMnN0JTIyJTNBJTIyJTIyJTJDJTIyY3AlMjIlM0ElN0IlMjJQYXRlbnRzJTIyJTNBJTIyWSUyMiUyQyUyMlRyYWRlbWFya3MlMjIlM0ElMjJOJTIyJTJDJTIySVBfTGF3X1BvbGljeV9JbmZvJTIyJTNBJTIyTiUyMiUyQyUyMlZlbmRvcl9JbmZvJTIyJTNBJTIyTiUyMiUyQyUyMkF2YWlsX0VsZWN0cm9uaWNfQml6X1N5cyUyMiUzQSUyMk4lMjIlMkMlMjJJbnRsX0FjdGl2aXRpZXMlMjIlM0ElMjJOJTIyJTJDJTIyVGVzdGltb255X1NwZWVjaGVzJTIyJTNBJTIyTiUyMiU3RCUyQyUyMnRvJTIyJTNBMy4yJTJDJTIyYyUyMiUzQSUyMmh0dHBzJTNBJTJGJTJGd3d3LnVzcHRvLmdvdiUyRnBhdGVudHMtYXBwbGljYXRpb24tcHJvY2VzcyUyRmFwcGx5aW5nLW9ubGluZSUyRmFib3V0LWVmcy13ZWIlMjIlMkMlMjJwdiUyMiUzQTIlMkMlMjJsYyUyMiUzQSU3QiUyMmQxJTIyJTNBJTdCJTIydiUyMiUzQTIlMkMlMjJzJTIyJTNBdHJ1ZSU3RCU3RCUyQyUyMmNkJTIyJTNBMSUyQyUyMmYlMjIlM0ExNTE3NTg2MTE1MzQ2JTJDJTIyc2QlMjIlM0ExJTdEOyBfZ2E9R0ExLjIuMjA0NzAzNjYwNC4xNTE1MzkzNjI1OyBfZ2lkPUdBMS4yLjE2MDM3NTE1MzkuMTUxNzU4NjExODsgRW50cnVzdFRydWVQYXNzUmVkaXJlY3RVcmw9Imh0dHBzOi8vZWZzLnVzcHRvLmdvdi9FRlNXZWJVSVJlZ2lzdGVyZWQvRUZTV2ViUmVnaXN0ZXJlZCI_.info (File name too long)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
at net.sourceforge.jnlp.util.PropertiesFile.store(PropertiesFile.java:167)
at net.sourceforge.jnlp.cache.CacheEntry.store(CacheEntry.java:225)
at net.sourceforge.jnlp.cache.ResourceDownloader.initializeFromURL(ResourceDownloader.java:193)
at net.sourceforge.jnlp.cache.ResourceDownloader.initializeOnlineResource(ResourceDownloader.java:128)
at net.sourceforge.jnlp.cache.ResourceDownloader.initializeResource(ResourceDownloader.java:118)
at net.sourceforge.jnlp.cache.ResourceDownloader.run(ResourceDownloader.java:107)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

pyspark cython gets file.so too short

I am following the documentation here
https://docs.databricks.com/user-guide/faq/cython.html
I can get this short sample code to work, but when incorporating into my longer code, I get this
File "./jobs.zip/jobs/util.py", line 51, in wrapped
cython_function_ = getattr(__import__(module), method)
File "/usr/local/lib64/python2.7/site-packages/pyximport/pyximport.py", line 458, in load_module
language_level=self.language_level)
File "/usr/local/lib64/python2.7/site-packages/pyximport/pyximport.py", line 233, in load_module
exec("raise exc, None, tb", {'exc': exc, 'tb': tb})
File "/usr/local/lib64/python2.7/site-packages/pyximport/pyximport.py", line 216, in load_module
mod = imp.load_dynamic(name, so_path)
ImportError: Building module cython_util failed: ['ImportError: /home/.pyxbld/lib.linux-x86_64-2.7/cython_util.so: file too short\n']
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Yet, after this error the spark program seems to keep running. Any one knows what the heck is this complain? How can the .so file be too short? Can I ignore it and move on since the program continues to run.