Spark Pipe function throws No such file or directory - scala

I am running the spark pipe function on EMR master server in REPL just to test out the pipe functionality. I am using the following examples
https://stackoverflow.com/a/32978183/8876462
http://blog.madhukaraphatak.com/pipe-in-spark/
http://hadoop-makeitsimple.blogspot.com/2016/05/pipe-in-spark.html
This is my code ::
import org.apache.spark._
val distScript = "/home/hadoop/PipeEx.sh"
val distScriptName = "PipeEx.sh"
sc.addFile(distScript)
val ipData =
sc.parallelize(List("asd","xyz","zxcz","sdfsfd","Ssdfd","Sdfsf"))
val opData = ipData.pipe(SparkFiles.get(distScriptName))
opData.foreach(println)
I have tried different things like making the file executable, placed in file in /usr/lib/spark/bin as suggested in another post. I changed the distScript to say
"file:///home/hadoop/PipeEx.sh"
I always get no such file or directory in tmp/spark*/userFiles* location. I have tried to access and run the shell program from the tmp location and it runs fine.
My shell script is the same as http://blog.madhukaraphatak.com/pipe-in-spark/
Here is the first part of the log::
[Stage 9:> (0 + 2)
/ 2]18/03/19 19:58:22 WARN TaskSetManager: Lost task 1.0 in stage 9.0 (TID
72, ip-172-31-42-11.ec2.internal, executor 9): java.io.IOException: Cannot
run program "/mnt/tmp/spark-bdd582ec-a5ac-4bb1-874e-832cd5427b18/userFiles-
497f6051-6f49-4268-b9c5-a28c2ad5edc6/PipeEx.sh": error=2, No such file or
directory
Does any one have any idea? I am using Spark 2.2.1 and scala 2.11.8
Thanks

I was able to solve this , once I removed the
SparkFiles.get(distScriptName)
command.
So my final code looks like this
val distScript = "/home/hadoop/PipeEx.sh"
val distScriptName = "./PipeEx.sh"
sc.addFile(distScript)
val ipData = sc.parallelize(List("asd","xyz","zxcz","sdfsfd","Ssdfd","Sdfsf"))
val opData = ipData.pipe(distScriptName)
opData.collect().foreach(println)
I am not very sure why removing the SparkFiles.get() solved the problem

Related

How to provide a text file location in spark when file is on server

I am Learning spark to implement in my project. I want to run command in spark shell-
val rddFromFile = spark.sparkContext.textFile("abc");
where abc is file location. My file is on remote server and through that remote server I am opening spark shell, how should I specify file location.
I tried to put a text file in local C drive and provided the location to read that, it also did not worked. I am getting similar error for all the file location.
Error :
scala> val rddFromFile = spark.sparkContext.textFile("C:/Users/eee/Spark test/Testspark.txt")
rddFromFile: org.apache.spark.rdd.RDD[String] = C:/Users/eee/Spark test/Testspark.txt MapPartitionsRDD[1] at textFile at <console>:23
scala> rddFromFile.collect().foreach(f=>{
| println(f)
| })
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "C"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:239)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
Spark is expecting the file to be present in the Hadoop FS, as it looks like that's the default file system set in your app.
To load a file from local FS, you need to put it like
val rddFromFile = spark.sparkContext.textFile("file:///C:/Users/eee/Spark test/Testspark.txt")
That will work when you run Spark in local mode.
If you run Spark in the cluster, then the file would have to be present on all executor nodes.

spark - hadoop argument

I am running both hadoop and spark and I want to use files from hdfs as an argument on spark-submit, so I made a folder in hdfs with the files
eg. /user/hduser/test/input
and I want to run spark-submit like this:
$SPARK_HOME/bin/spark-submit --master spark://admin:7077 ./target/scala-2.10/test_2.10-1.0.jar hdfs://user/hduser/test/input
but I cant make it work, what's the right way to do it?
the error I am getting is :
WARN FileInputDStream: Error finding new files
java.lang.NullPointerException
Check if you are able to access HDFS from Spark code, If yes then you need to add following line of code in your Scala import.
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import org.apache.spark.SparkFiles
then in your code add following lines
var hadoopConf = new org.apache.hadoop.conf.Configuration()
var fileSystem = FileSystem.get(hadoopConf)
var path = new Path(args(0))
actually the problem was the path. I had to use hdfs://localhost:9000/user/hduser/...

What is correct directory path format on Windows for StreamingContext.textFileStream?

I am trying to execute a spark streaming application to process the stream of files data to perform word count.
The directory I am reading is from Windows. As shown I using the local directory like "Users/Name/Desktop/Stream".It is not HDFS.
I created a folder as "Stream" in desktop.
I started the Spark Streaming application and after that I added some text files into the folder 'Stream'. But my spark application is not able to read the files. It is always giving the empty results.
Here is my code.
//args(0) = local[2]
object WordCount {
def main(args: Array[String]) {
val ssc = new StreamingContext(args(0), "word_count",Seconds(5))
val lines = ssc.textFileStream("Users/name/Desktop/Stream")
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
Output: Getting empty data every 5 seconds
17/05/18 07:35:00 INFO Executor: Running task 0.0 in stage 71.0 (TID 35)
-------------------------------------------
Time: 1495107300000 ms
-------------------------------------------
I tried giving the path as C:/Users/name/Desktop/Stream as well - still the same issue and application could not read the files.
Can anyone please guide if I am giving the incorrect directory path ?
Your code's fine so the only issue is to use proper path to the directory. Please use file:// prefix to denote local file system that would give file://C:/Users/name/Desktop/Stream.
Please start one step at a time to confirm that our understanding is at the same level.
When you execute the Spark Streaming application, create the directory to be in the same directory where you start the application, say Stream. Once you confirm that the application works fine with the local directory we'll fix it globally to read from any directory on Windows (if that's still needed).
Please also make sure that you "move" your files as the operation to create a file in the monitored directory has to be atomic (partial writes will mark the file as processed - see StreamingContext).
Files must be written to the monitored directory by "moving" them from another location within the same file system.
As you can see in the code the directory path will eventually be "wrapped" using Hadoop's File so the issue is to convince it to accept your path:
if (_path == null) _path = new Path(directory)

Issue in accessing HDFS file inside spark map function

My use case requires to access the file stored in HDFS from inside the spark map function. ThIs use case uses custom input format that does not provide any data to the map function whereas the map function obtains the input split and access the data. I am using the below code to do this
val hConf: Configuration = sc.hadoopConfiguration
hConf.set("fs.hdfs.impl", classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
hConf.set("fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
var job = new Job(hConf)
FileInputFormat.setInputPaths(job,new Path("hdfs:///user/bala/MyBinaryFile"));
var hRDD = new NewHadoopRDD(sc, classOf[RandomAccessInputFormat],
classOf[IntWritable],
classOf[BytesWritable],
job.getConfiguration()
)
val count = hRDD.mapPartitionsWithInputSplit{ (split, iter) => myfuncPart(split, iter)}.collect()
As of now, I am not doing anything inside the myfuncPart. This simple returns a map as below
iter.map { tpl ⇒ (tpl._1, tpl._2.getCapacity) }
When i submit the job along with the dependencies, I get the below error
15/10/30 11:11:39 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 40.221.94.235): java.io.IOException: No FileSystem for scheme: spark
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
At first glance, it seems a small error related to spark jars but could not crack. Any help will be greatly appreciated.
It turned out to be a mistake from my side with the way I was launching the job. The command I was using did not have proper option in it. Hence, the issue. I was using the command below
spark-submit --class org.myclass --jars myjar spark://myhost:7077 myjob.jar
Below is the correct one
spark-submit --class org.myclass --jars myjar --master spark://myhost:7077 myjob.jar
This is a small mistake but somehow I missed it. Now it is working

Test Spark with Tachyon

I have installed Tachyon and Spark according to instructions:
http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html
However, as a newbie I have no idea how to put file "X" into Tachyon File System as they said:
$ ./spark-shell
$ val s = sc.textFile("tachyon-ft://stanbyHost:19998/X")
$ s.count()
$ s.saveAsTextFile("tachyon-ft://activeHost:19998/Y")
What I did was to point to an existing file (that I find through the management UI):
scala> val s = sc.textFile("tachyon-ft://localhost:19998/root/default_tests_files/BasicFile_THROUGH")
s: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21
When I run count, I got this below error:
scala> s.count()
java.lang.NullPointerException: connectionString cannot be null
I assume my path was wrong. So two questions:
How to copy a file into Tachyon?
What is the proper path for its FS?
Sorry, very very newbie !!
UPDATE 1
I am not sure if tachyon-ft://localhost:19998/root/default_tests_files/BasicFile_THROUGH is correct path. I cannot get it either via the browser or wget
This is what I saw in the file system browser
I found out the issue. I didn't do this
sc.hadoopConfiguration.set("fs.tachyon.impl", "tachyon.hadoop.TFS")
After I went through this exercise http://ampcamp.berkeley.edu/5/exercises/tachyon.html#run-spark-on-tachyon, I found out the proper path is this:
val file = sc.textFile("tachyon://localhost:19998/LICENSE")
So my setup was fine afterall. The documentation here http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html was causing me a lot of confusion.