Scalding tutorial: com.twitter.scalding.InvalidSourceException: Data is missing from one or more paths - scala

With Hadoop 2.2 installed on single node I try to run Scalding tutorial, part 1, with command:
$ yarn jar target/scalding-tutorial-0.8.11.jar Tutorial0 --hdfs
https://github.com/Cascading/scalding-tutorial/
Before running tutorial I Have copied required file hello.txt to HDFS:
$ hdfs dfs -ls /data
Found 2 items
drwxr-xr-x - hdfs hdfs 0 2014-02-04 16:35 /data/10gsort
-rw-r--r-- 3 hdfs hdfs 26 2014-07-03 15:07 /data/hello.txt
It looks like tutorial can not find input file:
Exception in thread "main" com.twitter.scalding.InvalidSourceException:[TextLine(data/hello.txt)] Data is missing from one or more paths in: List(data/hello.txt)
at com.twitter.scalding.FileSource.validateTaps(FileSource.scala:102)
at com.twitter.scalding.Job$$anonfun$validateSources$1.apply(Job.scala:158)
at com.twitter.scalding.Job$$anonfun$validateSources$1.apply(Job.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1156)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at com.twitter.scalding.Job.validateSources(Job.scala:153)
at com.twitter.scalding.Job.buildFlow(Job.scala:91)
at com.twitter.scalding.Job.run(Job.scala:126)
at com.twitter.scalding.Tool.start$1(Tool.scala:109)
at com.twitter.scalding.Tool.run(Tool.scala:125)
at com.twitter.scalding.Tool.run(Tool.scala:72)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at JobRunner$.main(JobRunner.scala:27)
at JobRunner.main(JobRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Any ideas how to make it work?

TextLine turns out to build a Hadoop Path according to the given path and configuration.
Hadoop Path API shows "A path string is absolute if it begins with a slash."
Tutorial I fixes the input to be "data/hello.txt", which actually ends up with a relative path. Current working directory will be prepended to form an absolute and solid path.

Related

Spark Submit: Class Not Found Exception

I am trying to submit a job to spark on my machine as so:
$ spark-submit --master local --class ai.affable.flint.Foo target/scala-2.11/flint.jar
However, this fails with the following error:
java.lang.ClassNotFoundException: ai.affable.flint.Foo
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
I have verfied that the JAR file exists and has a class called Foo:
$ jar tvf ./target/scala-2.11/flint.jar | grep Foo
2003 Fri Dec 14 20:53:40 MYT 2018 ai/affable/flint/Foo.class
...
This baffles me because:
a) the JAR exists b) the class exists in the jar 3) I have specified the fully qualified path and double checked for any path errors or mispellings.
Does anyone know what I am missing?
EDIT:
I got it to work by recreating the project in a fresh directory.I literally copy pasted the code and repeated the steps.
I will still like to know what I can do in situations like this short of recreating the project.

Scala FileNotFoundException

While doing I/O in Scala using Cygwin, I have copied data at this location:
/cygdrive/c/DataResearch/retail_db/order_items/part-00000
but when I am trying to access the file from the Scala prompt with the following command I get this error:
val orderItems = Source
.fromFile("/cygdrive/c/DataResearch/retail_db/order_items/part-00000")
Error:
java.i[![enter image description here][1]][1]o.FileNotFoundException: \c\DataResearch\retail_db\order_items\part-
00000
(The system cannot find the path specified)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(Unknown Source)
at java.io.FileInputStream.<init>(Unknown Source)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)
What can I try to resolve this?

Cannot find NetLogo model through command line

I'm trying to get to grips with command line operations of NetLogo on a Windows 10 machine. I want to run the Fire.nlogo model provided.
I set the directory with cd C:\Program Files\NetLogo 6.0.2
Then I try to run a simple experiment called experiment1 which I've written beforehand in BehaviourSpace
netlogo-headless --model Fire.nlogo --experiment experiment1
This gives me the following error:
Exception in thread "main" java.io.FileNotFoundException: C:\Program Files\NetLogo 6.0.2\Fire.nlogo (The system cannot find the file specified)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromURI(Source.scala:121)
at org.nlogo.fileformat.AbstractNLogoFormat.$anonfun$sections$1(NLogoFormat.scala:37)
at scala.util.Try$.apply(Try.scala:209)
at org.nlogo.fileformat.AbstractNLogoFormat.sections(NLogoFormat.scala:36)
at org.nlogo.fileformat.AbstractNLogoFormat.sections$(NLogoFormat.scala:34)
at org.nlogo.fileformat.NLogoFormat.sections(NLogoFormat.scala:16)
at org.nlogo.api.ModelFormat.load(ModelFormat.scala:53)
at org.nlogo.api.ModelFormat.load$(ModelFormat.scala:51)
at org.nlogo.fileformat.NLogoFormat.load(NLogoFormat.scala:16)
at org.nlogo.api.FormatterPair.load(ModelLoader.scala:26)
at org.nlogo.api.ModelLoader.readModel(ModelLoader.scala:60)
at org.nlogo.api.ModelLoader.readModel$(ModelLoader.scala:57)
at org.nlogo.api.ConfigurableModelLoader.readModel(ModelLoader.scala:90)
at org.nlogo.headless.HeadlessWorkspace.open(HeadlessWorkspace.scala:491)
at org.nlogo.headless.Main$.newWorkspace$1(Main.scala:18)
at org.nlogo.headless.Main$.runExperiment(Main.scala:21)
at org.nlogo.headless.Main$.$anonfun$main$1(Main.scala:12)
at org.nlogo.headless.Main$.$anonfun$main$1$adapted(Main.scala:12)
at scala.Option.foreach(Option.scala:257)
at org.nlogo.headless.Main$.main(Main.scala:12)
at org.nlogo.headless.Main.main(Main.scala)
I notice that the output gives the path as C:\Program Files\NetLogo 6.0.2\Fire.nlogobut the model is actually located at C:\Program Files\NetLogo 6.0.2\app\models\Sample Models\Earth Science\Fire.nlogo
Though I seem to be following the tutorials as they're written here https://ccl.northwestern.edu/netlogo/docs/behaviorspace.html
Any ideas where I'm going wrong here? Thanks.
A quick look suggests that you need to give the full file path to the --model argument. So the command would look like:
netlogo-headless --model "C:\Program Files\NetLogo 6.0.2\app\models\Sample Models\Earth Science\Fire.nlogo" --experiment experiment1
Since you have set cd C:\Program Files\NetLogo 6.0.2 you can probably go with
netlogo-headless --model "app\models\Sample Models\Earth Science\Fire.nlogo" --experiment experiment1
Alternatively, you can go to the directory that contains the model you want to run and instead provide the path (again with quotes) to the .bat file
"c:\Program Files\NetLogo 6.0.2\netlogo-headless.bat" --model Fire.nlogo --experiment experiment1

"ERROR 6000, Output location validation failed" using PIG MongoDB-Hadoop Connector on EMR

I get an "output location validation failed" exception in my pig script on EMR.
It fails when saving data back S3.
I use this simple script to narrow the problem:
REGISTER /home/hadoop/lib/mongo-java-driver-2.13.0.jar
REGISTER /home/hadoop/lib/mongo-hadoop-core-1.3.2.jar
REGISTER /home/hadoop/lib/mongo-hadoop-pig-1.3.2.jar
example = LOAD 's3://xxx/example-full.bson'
USING com.mongodb.hadoop.pig.BSONLoader();
STORE example INTO 's3n://xxx/out/example.bson' USING com.mongodb.hadoop.pig.BSONStorage();
This is the Stacktrace Produced:
================================================================================
Pig Stack Trace
---------------
ERROR 6000:
<line 8, column 0> Output Location Validation Failed for: 's3://xxx/out/example.bson More info to follow:
Output directory not set.
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias example
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1637)
at org.apache.pig.PigServer.registerQuery(PigServer.java:577)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1091)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:543)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 6000:
<line 8, column 0> Output Location Validation Failed for: 's3://xxx/out/example.bson More info to follow:
Output directory not set.
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:95)
at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator.validate(InputOutputFileValidator.java:45)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:317)
at org.apache.pig.PigServer.compilePp(PigServer.java:1382)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1307)
at org.apache.pig.PigServer.execute(PigServer.java:1299)
at org.apache.pig.PigServer.access$400(PigServer.java:124)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1632)
... 13 more
Caused by: org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set.
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:138)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:80)
... 26 more
To setup the MongoConnector I used this Bootstrap script:
#!/bin/sh
wget -P /home/hadoop/lib http://central.maven.org/maven2/org/mongodb/mongo-java-driver/2.13.0/mongo-java-driver-2.13.0.jar
wget -P /home/hadoop/lib https://github.com/mongodb/mongo-hadoop/releases/download/r1.3.2/mongo-hadoop-core-1.3.2.jar
wget -P /home/hadoop/lib https://github.com/mongodb/mongo-hadoop/releases/download/r1.3.2/mongo-hadoop-pig-1.3.2.jar
wget -P /home/hadoop/lib https://github.com/mongodb/mongo-hadoop/releases/download/r1.3.2/mongo-hadoop-hive-1.3.2.jar
cp /home/hadoop/lib/mongo* /home/hadoop/hive/lib
cp /home/hadoop/lib/mongo* /home/hadoop/pig/lib
The error suggests that the output directory does not exist.
Of course the solution would be to create the output directory.
For a quick check it is also possible to make the output directory equal to the input directory. If the directory actually does exist, it may be a rights issue.

Spark Pipe example

I'm new to Spark and trying to figure out how the pipe method works. I have the following code in Scala
sc.textFile(hdfsLocation).pipe("preprocess.py").saveAsTextFile(hdfsPreprocessedLocation)
The values hdfsLocation and hdfsPreprocessedLocation are fine. As proof, the following code works from the command line
hadoop fs -cat hdfsLocation/* | ./preprocess.py | head
When I run the above Spark code I get the following errors
14/11/25 09:41:50 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Cannot run program "preprocess.py": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:135)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
... 12 more
In order to solve this for Hadoop streaming I would just use the --files attribute, so I tried the same thing for Spark. I start Spark with the following command
bin/spark-shell --files ./preprocess.py
but that gave the same error.
I couldn't find a good example of using Spark with an external process via pipe, so I'm not sure if I'm doing this correctly. Any help would be greatly appreciated.
Thanks
I'm not sure if this is the correct answer, so I won't finalize this, but it appears that the file paths are different when running spark in local and cluster mode. When running spark without --master the paths to the pipe command are relative to the local machine. When running spark with --master the paths to the pipe command are ./
UPDATE:
This actually isn't correct. I was using SparkFiles.get() to get the file name. It turns out that when calling .pipe() on an RDD the command string is evaluated on the driver and then passed to the worker. Because of this SparkFiles.get() is not the appropriate way to get the file name. The file name should be ./ because SparkContext.addFile() should put that file on ./ relative to to where each worker is run from. But I'm so sour on .pipe now that I've take .pipe out of my code in total in favor of .mapPartitions in combination of a PipeUtils object that I wrote here. This is actually more efficient because I only have to incur the script startup costs once per partition instead of once per example.