Add py file to spark scala - scala

I am trying to execute a python script from Scala Spark job (Spark 2.3) like below
val pyScript = "wasb://scripts#myAccount.blob.core.windows.net/print.py"
val pyScriptName = "print.py"
spark.sparkContext.addFile(pyScript)
val ipData = sc.parallelize(List("abc","def","ghi"))
val opData = ipData.pipe(org.apache.spark.SparkFiles.get(pyScriptName))
opData.foreach(println)
However I get the below exception. Any ideas what could be wrong?
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 2.0 failed 4 times, most recent failure: Lost task 2.3 in stage 2.0 (TID 141, wn4-novakv.oetevw42cdoe3jls1dzdeclktg.ex.internal.cloudapp.net, executor 2): java.io.IOException: Cannot run program "/mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1539905356890_0003/spark-5da59a08-a6a4-443d-b3b1-c31643e195c5/userFiles-e3ca8a1b-44f6-4804-9c95-625b1742fb77/print.py": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:111)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: error=2, No such file or directory

Related

Why does pyspark throws me cannot run program "python3"?

I am in my anaconda environment and started the cmd shell. I typed in pyspark and it loaded the interactive spark-shell.
Then I tried followed command:
l = [('Alice', 1)]
spark.createDataFrame(l).collect()
And got followed error:
22/02/14 19:16:03 ERROR Executor: Exception in task 5.0 in stage 2.0 (TID 21)
java.io.IOException: Cannot run program "python3": CreateProcess error=2, Das System kann die angegebene Datei nicht finden
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:166)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:108)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:121)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:162)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: CreateProcess error=2, Das System kann die angegebene Datei nicht finden
at java.lang.ProcessImpl.create(Native Method)
at java.lang.ProcessImpl.<init>(ProcessImpl.java:453)
at java.lang.ProcessImpl.start(ProcessImpl.java:140)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 30 more
I think something is wrong with my environment variables but I dont know what.
Try to run before SparkSession/SparkContext:
import os
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

Pytesseract with Pyspark throws Error :- pytesseract module not found

I am trying to write OCR code using spark and pytesseract and I am running into pytesseract module not found error even though pytesseract module is installed.
import pytesseract
from PIL import Image
path='/XXXX/JupyterLab/notebooks/testdir'
rdd = sc.binaryFiles(path)
rdd.keys().collect()
-->['file:XXX/JupyterLab/notebooks/testdir/copy.png']
input=rdd.keys().map(lambda s: s.replace("file:",""))
def read(x):
import pytesseract
image=Image.open(x)
text=pytesseract.image_to_open(image)
return text
newRdd= input.map(lambda x : read(x))
newRdd.collect()
"On newRdd.collect() I get following error"
ModuleNotFoundError: No module named 'pytesseract'at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$16.apply(RDD.scala:960)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$16.apply(RDD.scala:960)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2111)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2111)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:420)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I am not sure how can I pass the rdd.key() which holds the image path to pytesseract.image_to_String() using Image.open().
Thank you.
My error was resolved by adding
sc.addPyFile('/pathto........../pytesseract/pytesseract.py')

pyspark cython gets file.so too short

I am following the documentation here
https://docs.databricks.com/user-guide/faq/cython.html
I can get this short sample code to work, but when incorporating into my longer code, I get this
File "./jobs.zip/jobs/util.py", line 51, in wrapped
cython_function_ = getattr(__import__(module), method)
File "/usr/local/lib64/python2.7/site-packages/pyximport/pyximport.py", line 458, in load_module
language_level=self.language_level)
File "/usr/local/lib64/python2.7/site-packages/pyximport/pyximport.py", line 233, in load_module
exec("raise exc, None, tb", {'exc': exc, 'tb': tb})
File "/usr/local/lib64/python2.7/site-packages/pyximport/pyximport.py", line 216, in load_module
mod = imp.load_dynamic(name, so_path)
ImportError: Building module cython_util failed: ['ImportError: /home/.pyxbld/lib.linux-x86_64-2.7/cython_util.so: file too short\n']
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Yet, after this error the spark program seems to keep running. Any one knows what the heck is this complain? How can the .so file be too short? Can I ignore it and move on since the program continues to run.

Spark Pipe example

I'm new to Spark and trying to figure out how the pipe method works. I have the following code in Scala
sc.textFile(hdfsLocation).pipe("preprocess.py").saveAsTextFile(hdfsPreprocessedLocation)
The values hdfsLocation and hdfsPreprocessedLocation are fine. As proof, the following code works from the command line
hadoop fs -cat hdfsLocation/* | ./preprocess.py | head
When I run the above Spark code I get the following errors
14/11/25 09:41:50 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Cannot run program "preprocess.py": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:135)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
... 12 more
In order to solve this for Hadoop streaming I would just use the --files attribute, so I tried the same thing for Spark. I start Spark with the following command
bin/spark-shell --files ./preprocess.py
but that gave the same error.
I couldn't find a good example of using Spark with an external process via pipe, so I'm not sure if I'm doing this correctly. Any help would be greatly appreciated.
Thanks
I'm not sure if this is the correct answer, so I won't finalize this, but it appears that the file paths are different when running spark in local and cluster mode. When running spark without --master the paths to the pipe command are relative to the local machine. When running spark with --master the paths to the pipe command are ./
UPDATE:
This actually isn't correct. I was using SparkFiles.get() to get the file name. It turns out that when calling .pipe() on an RDD the command string is evaluated on the driver and then passed to the worker. Because of this SparkFiles.get() is not the appropriate way to get the file name. The file name should be ./ because SparkContext.addFile() should put that file on ./ relative to to where each worker is run from. But I'm so sour on .pipe now that I've take .pipe out of my code in total in favor of .mapPartitions in combination of a PipeUtils object that I wrote here. This is actually more efficient because I only have to incur the script startup costs once per partition instead of once per example.

java.lang.ClassNotFoundException: Class scala.runtime.Nothing when running the scoobi WordCount example

I'm trying to run the word count example from the Quick start page
import com.nicta.scoobi.Scoobi._
import Reduction._
object WordCount extends ScoobiApp {
def run() {
val lines = fromTextFile(args(0))
val counts = lines.mapFlatten(_.split(" "))
.map(word => (word, 1))
.groupByKey
.combine(Sum.int)
counts.toTextFile(args(1)).persist
}
}
It works fine when I use in memory mode, but when trying local mode (or cluster mode) I fail with the errors:
[WARN] LocalJobRunner - job_local_0001 <java.lang.RuntimeException: java.lang.ClassNotFoundException: Class scala.runtime.Nothing$ not found>java.lang.RuntimeException: java.lang.ClassNotFoundException: Class scala.runtime.Nothing$ not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1439)
at com.nicta.scoobi.impl.mapreducer.ChannelOutputFormat.com$nicta$scoobi$impl$mapreducer$ChannelOutputFormat$$mkTaskContext$1(ChannelOutputFormat.scala:63)
at com.nicta.scoobi.impl.mapreducer.ChannelOutputFormat$$anonfun$getContext$1.apply(ChannelOutputFormat.scala:75)
at com.nicta.scoobi.impl.mapreducer.ChannelOutputFormat$$anonfun$getContext$1.apply(ChannelOutputFormat.scala:75)
at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189)
at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
at com.nicta.scoobi.impl.mapreducer.ChannelOutputFormat.getContext(ChannelOutputFormat.scala:75)
at com.nicta.scoobi.impl.mapreducer.ChannelOutputFormat.write(ChannelOutputFormat.scala:43)
at com.nicta.scoobi.impl.plan.mscr.MscrOutputChannel$$anon$5$$anonfun$write$1.apply(OutputChannel.scala:137)
at com.nicta.scoobi.impl.plan.mscr.MscrOutputChannel$$anon$5$$anonfun$write$1.apply(OutputChannel.scala:135)
at scala.collection.immutable.List.foreach(List.scala:318)
at com.nicta.scoobi.impl.plan.mscr.MscrOutputChannel$$anon$5.write(OutputChannel.scala:135)
at com.nicta.scoobi.impl.plan.mscr.GbkOutputChannel$$anonfun$reduce$1.apply$mcV$sp(OutputChannel.scala:201)
at com.nicta.scoobi.impl.plan.mscr.GbkOutputChannel$$anonfun$reduce$1.apply(OutputChannel.scala:201)
at com.nicta.scoobi.impl.plan.mscr.GbkOutputChannel$$anonfun$reduce$1.apply(OutputChannel.scala:201)
at scala.Option.getOrElse(Option.scala:120)
at com.nicta.scoobi.impl.plan.mscr.GbkOutputChannel.reduce(OutputChannel.scala:200)
at com.nicta.scoobi.impl.mapreducer.MscrReducer$$anonfun$reduce$1.apply(MscrReducer.scala:55)
at com.nicta.scoobi.impl.mapreducer.MscrReducer$$anonfun$reduce$1.apply(MscrReducer.scala:52)
at scala.Option.foreach(Option.scala:236)
at com.nicta.scoobi.impl.mapreducer.MscrReducer.reduce(MscrReducer.scala:52)
at com.nicta.scoobi.impl.mapreducer.MscrReducer.reduce(MscrReducer.scala:33)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:572)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:309)
Caused by: java.lang.ClassNotFoundException: Class scala.runtime.Nothing$ not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1350)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1437)
... 25 more
[INFO] TrackerDistributedCacheManager - Deleted path /tmp/scoobi-root/WordCount$-1124-035523--1298047809/step1 of 1/archive/3757970833182018747_-1642337927_156373685/file/tmp/scoobi-root/WordCount$-1124-035523--1298047809/dist-objs/scoobi.combiners-step1 of 1
[INFO] TrackerDistributedCacheManager - Deleted path /tmp/scoobi-root/WordCount$-1124-035523--1298047809/step1 of 1/archive/1307074498433974065_910223079_156373685/file/tmp/scoobi-root/WordCount$-1124-035523--1298047809/dist-objs/scoobi.mappers-step1 of 1
[INFO] TrackerDistributedCacheManager - Deleted path /tmp/scoobi-root/WordCount$-1124-035523--1298047809/step1 of 1/archive/-624792843022440048_-470268278_156372685/file/tmp/scoobi-root/WordCount$-1124-035523--1298047809/dist-objs/scoobi.metadata.TG23
[INFO] TrackerDistributedCacheManager - Deleted path /tmp/scoobi-root/WordCount$-1124-035523--1298047809/step1 of 1/archive/-7527273518266336656_-470264434_156372685/file/tmp/scoobi-root/WordCount$-1124-035523--1298047809/dist-objs/scoobi.metadata.TK23
[INFO] TrackerDistributedCacheManager - Deleted path /tmp/scoobi-root/WordCount$-1124-035523--1298047809/step1 of 1/archive/-7162952586058180219_-470259629_156372685/file/tmp/scoobi-root/WordCount$-1124-035523--1298047809/dist-objs/scoobi.metadata.TP23
[INFO] TrackerDistributedCacheManager - Deleted path /tmp/scoobi-root/WordCount$-1124-035523--1298047809/step1 of 1/archive/-1228551315878554095_-470253863_156372685/file/tmp/scoobi-root/WordCount$-1124-035523--1298047809/dist-objs/scoobi.metadata.TV23
[INFO] TrackerDistributedCacheManager - Deleted path /tmp/scoobi-root/WordCount$-1124-035523--1298047809/step1 of 1/archive/6598684265640022340_1943382592_156373685/file/tmp/scoobi-root/WordCount$-1124-035523--1298047809/dist-objs/scoobi.reducers-step1 of 1
[INFO] TrackerDistributedCacheManager - Deleted path /tmp/scoobi-root/WordCount$-1124-035523--1298047809/step1 of 1/archive/1699308645513763631_1905624154_156371685/file/tmp/scoobi-root/WordCount$-1124-035523--1298047809/env/a88809af-334b-499e-bafc-1a2ebeffdfbd
[INFO] MapReduceJob - Map 100% Reduce 0%
[error] (run-main) com.nicta.scoobi.impl.exec.JobExecException: MapReduce job 'job_local_0001' failed! Please see http://localhost:8080/ for more info.
com.nicta.scoobi.impl.exec.JobExecException: MapReduce job 'job_local_0001' failed! Please see http://localhost:8080/ for more info.
at com.nicta.scoobi.impl.exec.MapReduceJob.report(MapReduceJob.scala:80)
at com.nicta.scoobi.impl.exec.HadoopMode$Execution$$anonfun$reportMscr$1.apply(HadoopMode.scala:157)
at com.nicta.scoobi.impl.exec.HadoopMode$Execution$$anonfun$reportMscr$1.apply(HadoopMode.scala:154)
at scala.Function2$$anonfun$tupled$1.apply(Function2.scala:54)
at scala.Function2$$anonfun$tupled$1.apply(Function2.scala:53)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at com.nicta.scoobi.impl.exec.HadoopMode$Execution.runMscrs(HadoopMode.scala:133)
at com.nicta.scoobi.impl.exec.HadoopMode$Execution.execute(HadoopMode.scala:115)
at com.nicta.scoobi.impl.exec.HadoopMode$$anonfun$executeLayer$1.apply(HadoopMode.scala:105)
at com.nicta.scoobi.impl.exec.HadoopMode$$anonfun$executeLayer$1.apply(HadoopMode.scala:104)
at org.kiama.attribution.AttributionCore$CachedAttribute.apply(AttributionCore.scala:61)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at com.nicta.scoobi.impl.exec.HadoopMode.com$nicta$scoobi$impl$exec$HadoopMode$$executeLayers$1(HadoopMode.scala:68)
at com.nicta.scoobi.impl.exec.HadoopMode$$anonfun$executeNode$1.apply(HadoopMode.scala:91)
at com.nicta.scoobi.impl.exec.HadoopMode$$anonfun$executeNode$1.apply(HadoopMode.scala:84)
at org.kiama.attribution.AttributionCore$CachedAttribute.apply(AttributionCore.scala:61)
at scalaz.syntax.IdOps$class.$bar$greater(IdOps.scala:15)
at scalaz.syntax.ToIdOps$$anon$1.$bar$greater(IdOps.scala:78)
at com.nicta.scoobi.impl.exec.HadoopMode.execute(HadoopMode.scala:52)
at com.nicta.scoobi.impl.exec.HadoopMode.execute(HadoopMode.scala:48)
at com.nicta.scoobi.impl.Persister.persist(Persister.scala:44)
at com.nicta.scoobi.impl.ScoobiConfigurationImpl.persist(ScoobiConfigurationImpl.scala:355)
at com.nicta.scoobi.application.Persist$class.persist(Persist.scala:33)
at p.WordCount$.persist(scoobi-test.scala:6)
at com.nicta.scoobi.application.Persist$PersistableList.persist(Persist.scala:151)
at p.WordCount$.run(scoobi-test.scala:14)
at com.nicta.scoobi.application.ScoobiApp$$anonfun$main$1.apply$mcV$sp(ScoobiApp.scala:80)
at com.nicta.scoobi.application.ScoobiApp$$anonfun$main$1.apply(ScoobiApp.scala:75)
at com.nicta.scoobi.application.ScoobiApp$$anonfun$main$1.apply(ScoobiApp.scala:75)
at com.nicta.scoobi.application.LocalHadoop$class.runOnLocal(LocalHadoop.scala:41)
at p.WordCount$.runOnLocal(scoobi-test.scala:6)
at com.nicta.scoobi.application.LocalHadoop$class.executeOnLocal(LocalHadoop.scala:35)
at p.WordCount$.executeOnLocal(scoobi-test.scala:6)
at com.nicta.scoobi.application.LocalHadoop$$anonfun$onLocal$1.apply(LocalHadoop.scala:29)
at com.nicta.scoobi.application.InMemoryHadoop$class.withTimer(InMemory.scala:71)
at p.WordCount$.withTimer(scoobi-test.scala:6)
at com.nicta.scoobi.application.InMemoryHadoop$class.showTime(InMemory.scala:79)
at p.WordCount$.showTime(scoobi-test.scala:6)
at com.nicta.scoobi.application.LocalHadoop$class.onLocal(LocalHadoop.scala:29)
at p.WordCount$.onLocal(scoobi-test.scala:6)
at com.nicta.scoobi.application.Hadoop$class.onHadoop(Hadoop.scala:60)
at p.WordCount$.onHadoop(scoobi-test.scala:6)
at com.nicta.scoobi.application.ScoobiApp$class.main(ScoobiApp.scala:75)
at p.WordCount$.main(scoobi-test.scala:6)
at p.WordCount.main(scoobi-test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
[trace] Stack trace suppressed: run last compile:runMain for the full output.
[INFO] Task - Communication exception: java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:648)
at java.lang.Thread.run(Thread.java:679)
java.lang.RuntimeException: Nonzero exit code: 1
at scala.sys.package$.error(package.scala:27)
[trace] Stack trace suppressed: run last compile:runMain for the full output.
[error] (compile:runMain) Nonzero exit code: 1
[error] Total time: 9 s, completed Nov 24, 2013 3:55:30 AM
running the very similar example from github (https://github.com/NICTA/scoobi/tree/SCOOBI-0.7.3/examples/wordCount) does work.
any ideas?
EDIT
I ran the sample according to the explanations in scoobi quick start
running the sample is done using sbt commands:
sbt compile
sbt "run-main mypackage.myapp.WordCount input-files output"
There is no reference regarding how or where to supply parameters such as the location of external jars.
Installed scala version and the one specified in build.sbt must be same. Once this is same, scala-library is auto included into classpath.
We've faced the same issue recently and solved it by adding fork := true to the sbt settings.