Why does pyspark throws me cannot run program "python3"?

Why does pyspark throws me cannot run program "python3"? - pyspark

I am in my anaconda environment and started the cmd shell. I typed in pyspark and it loaded the interactive spark-shell.
Then I tried followed command:
l = [('Alice', 1)]
spark.createDataFrame(l).collect()
And got followed error:
22/02/14 19:16:03 ERROR Executor: Exception in task 5.0 in stage 2.0 (TID 21)
java.io.IOException: Cannot run program "python3": CreateProcess error=2, Das System kann die angegebene Datei nicht finden
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:166)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:108)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:121)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:162)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: CreateProcess error=2, Das System kann die angegebene Datei nicht finden
at java.lang.ProcessImpl.create(Native Method)
at java.lang.ProcessImpl.<init>(ProcessImpl.java:453)
at java.lang.ProcessImpl.start(ProcessImpl.java:140)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 30 more
I think something is wrong with my environment variables but I dont know what.

Try to run before SparkSession/SparkContext:
import os
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

Related

Pytesseract with Pyspark throws Error :- pytesseract module not found

I am trying to write OCR code using spark and pytesseract and I am running into pytesseract module not found error even though pytesseract module is installed.
import pytesseract
from PIL import Image
path='/XXXX/JupyterLab/notebooks/testdir'
rdd = sc.binaryFiles(path)
rdd.keys().collect()
-->['file:XXX/JupyterLab/notebooks/testdir/copy.png']
input=rdd.keys().map(lambda s: s.replace("file:",""))
def read(x):
import pytesseract
image=Image.open(x)
text=pytesseract.image_to_open(image)
return text
newRdd= input.map(lambda x : read(x))
newRdd.collect()
"On newRdd.collect() I get following error"
ModuleNotFoundError: No module named 'pytesseract'at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$16.apply(RDD.scala:960)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$16.apply(RDD.scala:960)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2111)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2111)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:420)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I am not sure how can I pass the rdd.key() which holds the image path to pytesseract.image_to_String() using Image.open().
Thank you.

My error was resolved by adding
sc.addPyFile('/pathto........../pytesseract/pytesseract.py')

Add py file to spark scala

I am trying to execute a python script from Scala Spark job (Spark 2.3) like below
val pyScript = "wasb://scripts#myAccount.blob.core.windows.net/print.py"
val pyScriptName = "print.py"
spark.sparkContext.addFile(pyScript)
val ipData = sc.parallelize(List("abc","def","ghi"))
val opData = ipData.pipe(org.apache.spark.SparkFiles.get(pyScriptName))
opData.foreach(println)
However I get the below exception. Any ideas what could be wrong?
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 2.0 failed 4 times, most recent failure: Lost task 2.3 in stage 2.0 (TID 141, wn4-novakv.oetevw42cdoe3jls1dzdeclktg.ex.internal.cloudapp.net, executor 2): java.io.IOException: Cannot run program "/mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1539905356890_0003/spark-5da59a08-a6a4-443d-b3b1-c31643e195c5/userFiles-e3ca8a1b-44f6-4804-9c95-625b1742fb77/print.py": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:111)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: error=2, No such file or directory

javaws name too long

I am trying to use iced tea / javaws to authenticate as user at the USPTO (patent/trademark office). I download the jnlp file and try to use it with javaws but hit a 'name too long' error - does anyone know how to work around this?
jeremy#jeremy-Blade ~/Downloads $ javaws uspto-auth.authenticate.jnlp
java.io.FileNotFoundException: /home/jeremy/.cache/icedtea-web/cache/0/https/efs.uspto.gov/TruePassWebStart/uspto-auth.authenticate.jnlp.q_SlNFU1NJT05JRD1GN0Q5ODBCNzhCODI3QTQwMkZEOEI0ODU4M0M5OTYzMi5wcm9kX3RwdG9tY2F0MjE1X2p2bTsgbXl1c3B0b19zaWduaW49aW5pdGlhdGVkOyBmc3Iucj0lN0IlMjJkJTIyJTNBOTAlMkMlMjJpJTIyJTNBJTIyZDU1NDUwNS01MTYyNjUxMy04NTMwLTJkNmItYTJmMmUlMjIlMkMlMjJlJTIyJTNBMTUxNTk5ODY2NDIxNCU3RDsgX191dG1hPTIxNzEyMjUwNS4yMDQ3MDM2NjA0LjE1MTUzOTM2MjUuMTUxNTU3NjYzOS4xNTE2ODAwMTU2LjI7IF9fdXRtej0yMTcxMjI1MDUuMTUxNTU3NjYzOS4xLjEudXRtY3NyPShkaXJlY3QpfHV0bWNjbj0oZGlyZWN0KXx1dG1jbWQ9KG5vbmUpOyBmc3Iucz0lN0IlMjJ2MSUyMiUzQS0xJTJDJTIydjIlMjIlM0EtMiUyQyUyMnJpZCUyMiUzQSUyMmQwNDgwMTItNTcwODgwODUtOTVhYy1lMmE4LTE0Njg3JTIyJTJDJTIycnUlMjIlM0ElMjJodHRwcyUzQSUyRiUyRnd3dy51c3B0by5nb3YlMkYlMjIlMkMlMjJyJTIyJTNBJTIyd3d3LnVzcHRvLmdvdiUyMiUyQyUyMnN0JTIyJTNBJTIyJTIyJTJDJTIyY3AlMjIlM0ElN0IlMjJQYXRlbnRzJTIyJTNBJTIyWSUyMiUyQyUyMlRyYWRlbWFya3MlMjIlM0ElMjJOJTIyJTJDJTIySVBfTGF3X1BvbGljeV9JbmZvJTIyJTNBJTIyTiUyMiUyQyUyMlZlbmRvcl9JbmZvJTIyJTNBJTIyTiUyMiUyQyUyMkF2YWlsX0VsZWN0cm9uaWNfQml6X1N5cyUyMiUzQSUyMk4lMjIlMkMlMjJJbnRsX0FjdGl2aXRpZXMlMjIlM0ElMjJOJTIyJTJDJTIyVGVzdGltb255X1NwZWVjaGVzJTIyJTNBJTIyTiUyMiU3RCUyQyUyMnRvJTIyJTNBMy4yJTJDJTIyYyUyMiUzQSUyMmh0dHBzJTNBJTJGJTJGd3d3LnVzcHRvLmdvdiUyRnBhdGVudHMtYXBwbGljYXRpb24tcHJvY2VzcyUyRmFwcGx5aW5nLW9ubGluZSUyRmFib3V0LWVmcy13ZWIlMjIlMkMlMjJwdiUyMiUzQTIlMkMlMjJsYyUyMiUzQSU3QiUyMmQxJTIyJTNBJTdCJTIydiUyMiUzQTIlMkMlMjJzJTIyJTNBdHJ1ZSU3RCU3RCUyQyUyMmNkJTIyJTNBMSUyQyUyMmYlMjIlM0ExNTE3NTg2MTE1MzQ2JTJDJTIyc2QlMjIlM0ExJTdEOyBfZ2E9R0ExLjIuMjA0NzAzNjYwNC4xNTE1MzkzNjI1OyBfZ2lkPUdBMS4yLjE2MDM3NTE1MzkuMTUxNzU4NjExODsgRW50cnVzdFRydWVQYXNzUmVkaXJlY3RVcmw9Imh0dHBzOi8vZWZzLnVzcHRvLmdvdi9FRlNXZWJVSVJlZ2lzdGVyZWQvRUZTV2ViUmVnaXN0ZXJlZCI_.info (File name too long)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
at net.sourceforge.jnlp.util.PropertiesFile.store(PropertiesFile.java:167)
at net.sourceforge.jnlp.cache.CacheEntry.store(CacheEntry.java:225)
at net.sourceforge.jnlp.cache.ResourceDownloader.initializeFromURL(ResourceDownloader.java:193)
at net.sourceforge.jnlp.cache.ResourceDownloader.initializeOnlineResource(ResourceDownloader.java:128)
at net.sourceforge.jnlp.cache.ResourceDownloader.initializeResource(ResourceDownloader.java:118)
at net.sourceforge.jnlp.cache.ResourceDownloader.run(ResourceDownloader.java:107)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

pyspark cython gets file.so too short

I am following the documentation here
https://docs.databricks.com/user-guide/faq/cython.html
I can get this short sample code to work, but when incorporating into my longer code, I get this
File "./jobs.zip/jobs/util.py", line 51, in wrapped
cython_function_ = getattr(__import__(module), method)
File "/usr/local/lib64/python2.7/site-packages/pyximport/pyximport.py", line 458, in load_module
language_level=self.language_level)
File "/usr/local/lib64/python2.7/site-packages/pyximport/pyximport.py", line 233, in load_module
exec("raise exc, None, tb", {'exc': exc, 'tb': tb})
File "/usr/local/lib64/python2.7/site-packages/pyximport/pyximport.py", line 216, in load_module
mod = imp.load_dynamic(name, so_path)
ImportError: Building module cython_util failed: ['ImportError: /home/.pyxbld/lib.linux-x86_64-2.7/cython_util.so: file too short\n']
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Yet, after this error the spark program seems to keep running. Any one knows what the heck is this complain? How can the .so file be too short? Can I ignore it and move on since the program continues to run.

Jenkins Error: Powershell Integration with Jenkins

I mm trying to integrate PowerShell with Jenkins. I am finding it hard to resolve this error as my PowerShell script job is failing as shown below.
[own-machine-powershell-job] $ powershell.exe -NonInteractive -ExecutionPolicy ByPass "& 'C:\Windows\TEMP\hudson3566059468296731803.ps1'"
The system cannot find the file specified
FATAL: command execution failed
java.io.IOException: Cannot run program "powershell.exe" (in directory "C:\jenkins\workspace\own-machine-powershell-job"): CreateProcess error=2, The system cannot find the file specified
at java.lang.ProcessBuilder.start(Unknown Source)
at hudson.Proc$LocalProc.(Proc.java:244)
at hudson.Proc$LocalProc.(Proc.java:216)
at hudson.Launcher$LocalLauncher.launch(Launcher.java:816)
at hudson.Launcher$ProcStarter.start(Launcher.java:382)
at hudson.Launcher$RemoteLaunchCallable.call(Launcher.java:1149)
at hudson.Launcher$RemoteLaunchCallable.call(Launcher.java:1114)
at hudson.remoting.UserRequest.perform(UserRequest.java:120)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:326)
at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at hudson.remoting.Engine$1$1.run(Engine.java:62)
at java.lang.Thread.run(Unknown Source)
at ......remote call to Powershell-Job(Native Method)
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1416)
at hudson.remoting.UserResponse.retrieve(UserRequest.java:220)
at hudson.remoting.Channel.call(Channel.java:781)
at hudson.Launcher$RemoteLauncher.launch(Launcher.java:929)
at hudson.Launcher$ProcStarter.start(Launcher.java:382)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:97)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:785)
at hudson.model.Build$BuildExecution.build(Build.java:205)
at hudson.model.Build$BuildExecution.doRun(Build.java:162)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537)
at hudson.model.Run.execute(Run.java:1741)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:98)
at hudson.model.Executor.run(Executor.java:408)
Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified
at java.lang.ProcessImpl.create(Native Method)
at java.lang.ProcessImpl.(Unknown Source)
at java.lang.ProcessImpl.start(Unknown Source)
at java.lang.ProcessBuilder.start(Unknown Source)
at hudson.Proc$LocalProc.(Proc.java:244)
at hudson.Proc$LocalProc.(Proc.java:216)
at hudson.Launcher$LocalLauncher.launch(Launcher.java:816)
at hudson.Launcher$ProcStarter.start(Launcher.java:382)
at hudson.Launcher$RemoteLaunchCallable.call(Launcher.java:1149)
at hudson.Launcher$RemoteLaunchCallable.call(Launcher.java:1114)
at hudson.remoting.UserRequest.perform(UserRequest.java:120)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:326)
at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at hudson.remoting.Engine$1$1.run(Engine.java:62)
at java.lang.Thread.run(Unknown Source)
Build step 'Windows PowerShell' marked build as failure
Finished: FAILURE

My recommendation is to check this locally before trying in jenkins.
If you tried to use the plugin or to execute an external file try to re-check the file path:
The system cannot find the file specified FATAL: command execution failed java.io.IOException: Cannot run program "powershell.exe"
The powershell exe path is wrong.

I got it resolved finally.
Basically, in my system environment variables, powershell.exe path was incorrect which I corrected it. Then ran my Jenkins job, but still the same error.
So, I restarted my system, then ran my Jenkins Powershell Job successfully.
Thanks

Not sure if this will help anyone else but gonna write it down just for posterity. Make sure your slave is running and connected. I saw this error when my Windows Jenkins slave had been disconnected, but I hadn't set a "restrict where this build can run" so it was running on my master (a Linux machine). Obviously it couldn't find a powershell.exe on there. :D

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Why does pyspark throws me cannot run program "python3"? - pyspark

Try to run before SparkSession/SparkContext: import os import sys os.environ['PYSPARK_PYTHON'] = sys.executable os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

Related

Pytesseract with Pyspark throws Error :- pytesseract module not found

Add py file to spark scala

javaws name too long

pyspark cython gets file.so too short

Jenkins Error: Powershell Integration with Jenkins

Categories

Resources