Spark Pipe example - scala

I'm new to Spark and trying to figure out how the pipe method works. I have the following code in Scala
sc.textFile(hdfsLocation).pipe("preprocess.py").saveAsTextFile(hdfsPreprocessedLocation)
The values hdfsLocation and hdfsPreprocessedLocation are fine. As proof, the following code works from the command line
hadoop fs -cat hdfsLocation/* | ./preprocess.py | head
When I run the above Spark code I get the following errors
14/11/25 09:41:50 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Cannot run program "preprocess.py": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:135)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
... 12 more
In order to solve this for Hadoop streaming I would just use the --files attribute, so I tried the same thing for Spark. I start Spark with the following command
bin/spark-shell --files ./preprocess.py
but that gave the same error.
I couldn't find a good example of using Spark with an external process via pipe, so I'm not sure if I'm doing this correctly. Any help would be greatly appreciated.
Thanks

I'm not sure if this is the correct answer, so I won't finalize this, but it appears that the file paths are different when running spark in local and cluster mode. When running spark without --master the paths to the pipe command are relative to the local machine. When running spark with --master the paths to the pipe command are ./
UPDATE:
This actually isn't correct. I was using SparkFiles.get() to get the file name. It turns out that when calling .pipe() on an RDD the command string is evaluated on the driver and then passed to the worker. Because of this SparkFiles.get() is not the appropriate way to get the file name. The file name should be ./ because SparkContext.addFile() should put that file on ./ relative to to where each worker is run from. But I'm so sour on .pipe now that I've take .pipe out of my code in total in favor of .mapPartitions in combination of a PipeUtils object that I wrote here. This is actually more efficient because I only have to incur the script startup costs once per partition instead of once per example.

Related

Spark Failing to write to hdfs because field With AVG

I'm running a spark script in scala from an .sh. When running the same code in a Zeppelin notebook I had no problem. But running it from the script returns the following:
ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 2032, Column 28: Redefinition of parameter "agg_expr_51"
The cause of this is a column which has an average calculated. Why is this happening? Does it have a solution?
Thanks.

unable to start a Process in Scala/Play application

I have a ccm.py script which runs Cassandra cluster on local machine. I am able to run the command using windows command prompt. But I get error if I try to do so using Process class in Scala.
I want to run it before starting my test cases. So I am calling Process in beforeAll.
override def beforeAll():Unit = {
println("starting cassandra cluster locally")
val ccmCommand = Process("ccm.py start").!
}
The error I get is
An exception or error caused a run to abort: Cannot run program "ccm.py": CreateProcess error=193, %1 is not a valid Win32 application
java.io.IOException: Cannot run program "ccm.py": CreateProcess error=193, %1 is not a valid Win32 application
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
I also tried using cassandra -f command (which also works from cmd prompt) but I got error
An exception or error caused a run to abort: Cannot run program "cassandra": CreateProcess error=2, The system cannot find the file specified
java.io.IOException: Cannot run program "cassandra": CreateProcess error=2, The system cannot find the file specified
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
How can I solve the issue?
Update
Thanks to Alex Ott, I have made some progress but I think there has to be a better and more reliable way to test the code and also make the code more portable
The code which seem to work so far is
override def beforeAll():Unit = {
println("starting cassandra cluster locally")
val ccmCommand = Process("python",Seq("C:\\Users\\manu\\Documents\\manu\\ccm-3.1.4.tar\\dist\\ccm-3.1.4\\ccm.py","start")).run //NOT GOOD. Using hardcoded path!!
println(s"ccm returned ${ccmCommand}")
Thread.sleep(30000) //the wait is too long and there is no guarantee that the cluster will be up properly
}
Is there a better (w.r.t reliable without hard coding the path) to bring up the cluster before tests get executed.

Write Spark DF to a flat file on local pc from Eclipse

I need to write a Spark DF to a flat file on my local PC.
I'm executing my program on Scala IDE on Eclipse (again on my local PC)
This is the command I use:
df.coalesce(1).rdd.saveAsTextFile(s"file:///C:/myfile.csv")
It creates C:\myfile.csv_temporary\0_temporary\attempt_20180208105406_0016_m_000000_819 foder and even part-00000 file in it, but the file is empty
This is the error message I'm getting on the console:
Exception in task 0.0 in stage 16.0 (TID 819)
java.io.IOException: (null) entry in command string: null chmod 0644 C:\myfile.csv_temporary\0_temporary\attempt_20180208105406_0016_m_000000_819\part-00000*
Try set HADOOP_HOME to the subdirectory with bin\winuitls.exe

"ERROR 6000, Output location validation failed" using PIG MongoDB-Hadoop Connector on EMR

I get an "output location validation failed" exception in my pig script on EMR.
It fails when saving data back S3.
I use this simple script to narrow the problem:
REGISTER /home/hadoop/lib/mongo-java-driver-2.13.0.jar
REGISTER /home/hadoop/lib/mongo-hadoop-core-1.3.2.jar
REGISTER /home/hadoop/lib/mongo-hadoop-pig-1.3.2.jar
example = LOAD 's3://xxx/example-full.bson'
USING com.mongodb.hadoop.pig.BSONLoader();
STORE example INTO 's3n://xxx/out/example.bson' USING com.mongodb.hadoop.pig.BSONStorage();
This is the Stacktrace Produced:
================================================================================
Pig Stack Trace
---------------
ERROR 6000:
<line 8, column 0> Output Location Validation Failed for: 's3://xxx/out/example.bson More info to follow:
Output directory not set.
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias example
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1637)
at org.apache.pig.PigServer.registerQuery(PigServer.java:577)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1091)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:543)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 6000:
<line 8, column 0> Output Location Validation Failed for: 's3://xxx/out/example.bson More info to follow:
Output directory not set.
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:95)
at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator.validate(InputOutputFileValidator.java:45)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:317)
at org.apache.pig.PigServer.compilePp(PigServer.java:1382)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1307)
at org.apache.pig.PigServer.execute(PigServer.java:1299)
at org.apache.pig.PigServer.access$400(PigServer.java:124)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1632)
... 13 more
Caused by: org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set.
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:138)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:80)
... 26 more
To setup the MongoConnector I used this Bootstrap script:
#!/bin/sh
wget -P /home/hadoop/lib http://central.maven.org/maven2/org/mongodb/mongo-java-driver/2.13.0/mongo-java-driver-2.13.0.jar
wget -P /home/hadoop/lib https://github.com/mongodb/mongo-hadoop/releases/download/r1.3.2/mongo-hadoop-core-1.3.2.jar
wget -P /home/hadoop/lib https://github.com/mongodb/mongo-hadoop/releases/download/r1.3.2/mongo-hadoop-pig-1.3.2.jar
wget -P /home/hadoop/lib https://github.com/mongodb/mongo-hadoop/releases/download/r1.3.2/mongo-hadoop-hive-1.3.2.jar
cp /home/hadoop/lib/mongo* /home/hadoop/hive/lib
cp /home/hadoop/lib/mongo* /home/hadoop/pig/lib
The error suggests that the output directory does not exist.
Of course the solution would be to create the output directory.
For a quick check it is also possible to make the output directory equal to the input directory. If the directory actually does exist, it may be a rights issue.

Scalding tutorial: com.twitter.scalding.InvalidSourceException: Data is missing from one or more paths

With Hadoop 2.2 installed on single node I try to run Scalding tutorial, part 1, with command:
$ yarn jar target/scalding-tutorial-0.8.11.jar Tutorial0 --hdfs
https://github.com/Cascading/scalding-tutorial/
Before running tutorial I Have copied required file hello.txt to HDFS:
$ hdfs dfs -ls /data
Found 2 items
drwxr-xr-x - hdfs hdfs 0 2014-02-04 16:35 /data/10gsort
-rw-r--r-- 3 hdfs hdfs 26 2014-07-03 15:07 /data/hello.txt
It looks like tutorial can not find input file:
Exception in thread "main" com.twitter.scalding.InvalidSourceException:[TextLine(data/hello.txt)] Data is missing from one or more paths in: List(data/hello.txt)
at com.twitter.scalding.FileSource.validateTaps(FileSource.scala:102)
at com.twitter.scalding.Job$$anonfun$validateSources$1.apply(Job.scala:158)
at com.twitter.scalding.Job$$anonfun$validateSources$1.apply(Job.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1156)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at com.twitter.scalding.Job.validateSources(Job.scala:153)
at com.twitter.scalding.Job.buildFlow(Job.scala:91)
at com.twitter.scalding.Job.run(Job.scala:126)
at com.twitter.scalding.Tool.start$1(Tool.scala:109)
at com.twitter.scalding.Tool.run(Tool.scala:125)
at com.twitter.scalding.Tool.run(Tool.scala:72)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at JobRunner$.main(JobRunner.scala:27)
at JobRunner.main(JobRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Any ideas how to make it work?
TextLine turns out to build a Hadoop Path according to the given path and configuration.
Hadoop Path API shows "A path string is absolute if it begins with a slash."
Tutorial I fixes the input to be "data/hello.txt", which actually ends up with a relative path. Current working directory will be prepended to form an absolute and solid path.