I am trying to write OCR code using spark and pytesseract and I am running into pytesseract module not found error even though pytesseract module is installed.
import pytesseract
from PIL import Image
path='/XXXX/JupyterLab/notebooks/testdir'
rdd = sc.binaryFiles(path)
rdd.keys().collect()
-->['file:XXX/JupyterLab/notebooks/testdir/copy.png']
input=rdd.keys().map(lambda s: s.replace("file:",""))
def read(x):
import pytesseract
image=Image.open(x)
text=pytesseract.image_to_open(image)
return text
newRdd= input.map(lambda x : read(x))
newRdd.collect()
"On newRdd.collect() I get following error"
ModuleNotFoundError: No module named 'pytesseract'at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$16.apply(RDD.scala:960)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$16.apply(RDD.scala:960)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2111)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2111)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:420)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I am not sure how can I pass the rdd.key() which holds the image path to pytesseract.image_to_String() using Image.open().
Thank you.
My error was resolved by adding
sc.addPyFile('/pathto........../pytesseract/pytesseract.py')
I am running sqoop - import command via Java but getting an error
failed to import: parameter directory is not a directory
Why am I getting this error?
Please check the directory path in --target-dir parameter.
Thanks
I am new to pyspark and I did some initial tutorials. When I am trying to load a CSV file on my local host in the Spark framework using Jupyter Notebook, the below mentioned error pops. My java version is 8.0
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName('sql based spark data analysis') \
.config('spark.some.config.option', 'some-value') \
.getOrCreate()
df = spark.read.csv('C:/Users/sitaram/Downloads/creditcardfraud/creditcard.csv')
My error is as follows:
Py4JJavaError: An error occurred while calling o55.csv.
: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException:
java.lang.RuntimeException: Error while running command to get file
permissions : java.io.IOException: (null) entry in command string: null ls -F C:\tmp\hive
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:65
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Error
while running command to get file permissions : java.io.IOException: (null) entry in command string: null ls -F C:\tmp\hive
Please try C://Users//sitaram//Downloads//creditcardfraud//creditcard.csv
I'm running Jupyter (v4.2.1) with Apache Toree - PySpark. When I try to invoke plotly's init_notebook_mode function, I run into the following error :
import numpy as np
import pandas as pd
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()
Error :
Name: org.apache.toree.interpreter.broker.BrokerException
Message: Traceback (most recent call last):
File "/tmp/kernel-PySpark-6415c581-01c4-4c90-b4d9-81773c2bc03f/pyspark_runner.py", line 134, in <module>
eval(compiled_code)
File "<string>", line 7, in <module>
File "/usr/local/lib/python3.4/dist-packages/plotly/offline/offline.py", line 151, in init_notebook_mode
display(HTML(script_inject))
File "/usr/local/lib/python3.4/dist-packages/IPython/core/display.py", line 158, in display
format = InteractiveShell.instance().display_formatter.format
File "/usr/local/lib/python3.4/dist-packages/traitlets/config/configurable.py", line 412, in instance
inst = cls(*args, **kwargs)
File "/usr/local/lib/python3.4/dist-packages/IPython/core/interactiveshell.py", line 499, in __init__
self.init_io()
File "/usr/local/lib/python3.4/dist-packages/IPython/core/interactiveshell.py", line 658, in init_io
io.stdout = io.IOStream(sys.stdout)
File "/usr/local/lib/python3.4/dist-packages/IPython/utils/io.py", line 34, in __init__
raise ValueError("fallback required, but not specified")
ValueError: fallback required, but not specified
StackTrace: org.apache.toree.interpreter.broker.BrokerState$$anonfun$markFailure$1.apply(BrokerState.scala:140)
org.apache.toree.interpreter.broker.BrokerState$$anonfun$markFailure$1.apply(BrokerState.scala:140)
scala.Option.foreach(Option.scala:236)
org.apache.toree.interpreter.broker.BrokerState.markFailure(BrokerState.scala:139)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
py4j.Gateway.invoke(Gateway.java:259)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:209)
java.lang.Thread.run(Thread.java:745)
I'm unable to find any info about this on the web. When I digged into the code where this is failing - io.py in IPython utils, I see that the stream that is being passed must have both the attributes - write as well as flush. But for some reason, the stream passed in this case - sys.stdout has only the "write" attribute, and not the "flush" attribute.
I believe this happens because plotly's notebook mode assumes that it is running inside an IPython jupyter kernel doing the notebook communictation; you see in the stacktrace that it's trying to call into IPython packages.
Toree, however, is a different jupyter kernel and has its own protocol handling for communicating with the notebook server. Even when you use toree to run a PySpark interpreter, you get a "plain" PySpark (just like when you start it from a shell) and toree drives the input/output of that interpreter.
So the IPython machinery is not set up and calling init_notebook_mode() in that environment will fail, just like it would when you run in in a PySpark started directly from the shell, which knows nothing about notebooks.
To my knowledge, there is currently no way to get plotting output from a PySpark session run via toree -- we recently faced the same problem. Instead of running python via toree, you need to run an IPython kernel, import the PySpark libs there and connect to your Spark cluster. See https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook for a dockerized example to do that.
I'm new to Spark and trying to figure out how the pipe method works. I have the following code in Scala
sc.textFile(hdfsLocation).pipe("preprocess.py").saveAsTextFile(hdfsPreprocessedLocation)
The values hdfsLocation and hdfsPreprocessedLocation are fine. As proof, the following code works from the command line
hadoop fs -cat hdfsLocation/* | ./preprocess.py | head
When I run the above Spark code I get the following errors
14/11/25 09:41:50 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Cannot run program "preprocess.py": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:135)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
... 12 more
In order to solve this for Hadoop streaming I would just use the --files attribute, so I tried the same thing for Spark. I start Spark with the following command
bin/spark-shell --files ./preprocess.py
but that gave the same error.
I couldn't find a good example of using Spark with an external process via pipe, so I'm not sure if I'm doing this correctly. Any help would be greatly appreciated.
Thanks
I'm not sure if this is the correct answer, so I won't finalize this, but it appears that the file paths are different when running spark in local and cluster mode. When running spark without --master the paths to the pipe command are relative to the local machine. When running spark with --master the paths to the pipe command are ./
UPDATE:
This actually isn't correct. I was using SparkFiles.get() to get the file name. It turns out that when calling .pipe() on an RDD the command string is evaluated on the driver and then passed to the worker. Because of this SparkFiles.get() is not the appropriate way to get the file name. The file name should be ./ because SparkContext.addFile() should put that file on ./ relative to to where each worker is run from. But I'm so sour on .pipe now that I've take .pipe out of my code in total in favor of .mapPartitions in combination of a PipeUtils object that I wrote here. This is actually more efficient because I only have to incur the script startup costs once per partition instead of once per example.