Test Spark with Tachyon - scala

I have installed Tachyon and Spark according to instructions:
http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html
However, as a newbie I have no idea how to put file "X" into Tachyon File System as they said:
$ ./spark-shell
$ val s = sc.textFile("tachyon-ft://stanbyHost:19998/X")
$ s.count()
$ s.saveAsTextFile("tachyon-ft://activeHost:19998/Y")
What I did was to point to an existing file (that I find through the management UI):
scala> val s = sc.textFile("tachyon-ft://localhost:19998/root/default_tests_files/BasicFile_THROUGH")
s: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21
When I run count, I got this below error:
scala> s.count()
java.lang.NullPointerException: connectionString cannot be null
I assume my path was wrong. So two questions:
How to copy a file into Tachyon?
What is the proper path for its FS?
Sorry, very very newbie !!
UPDATE 1
I am not sure if tachyon-ft://localhost:19998/root/default_tests_files/BasicFile_THROUGH is correct path. I cannot get it either via the browser or wget
This is what I saw in the file system browser

I found out the issue. I didn't do this
sc.hadoopConfiguration.set("fs.tachyon.impl", "tachyon.hadoop.TFS")
After I went through this exercise http://ampcamp.berkeley.edu/5/exercises/tachyon.html#run-spark-on-tachyon, I found out the proper path is this:
val file = sc.textFile("tachyon://localhost:19998/LICENSE")
So my setup was fine afterall. The documentation here http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html was causing me a lot of confusion.

Related

How to provide a text file location in spark when file is on server

I am Learning spark to implement in my project. I want to run command in spark shell-
val rddFromFile = spark.sparkContext.textFile("abc");
where abc is file location. My file is on remote server and through that remote server I am opening spark shell, how should I specify file location.
I tried to put a text file in local C drive and provided the location to read that, it also did not worked. I am getting similar error for all the file location.
Error :
scala> val rddFromFile = spark.sparkContext.textFile("C:/Users/eee/Spark test/Testspark.txt")
rddFromFile: org.apache.spark.rdd.RDD[String] = C:/Users/eee/Spark test/Testspark.txt MapPartitionsRDD[1] at textFile at <console>:23
scala> rddFromFile.collect().foreach(f=>{
| println(f)
| })
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "C"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:239)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
Spark is expecting the file to be present in the Hadoop FS, as it looks like that's the default file system set in your app.
To load a file from local FS, you need to put it like
val rddFromFile = spark.sparkContext.textFile("file:///C:/Users/eee/Spark test/Testspark.txt")
That will work when you run Spark in local mode.
If you run Spark in the cluster, then the file would have to be present on all executor nodes.

Spark-SQL: access file in current worker node directory

I need to read a file using spark-sql, and the file is in the current directory.
I use this command to decompress a list of files I have stored on HDFS.
val decompressCommand = Seq(laszippath, "-i", inputFileName , "-o", "out.las").!!
The file is outputted in the current worker node directory, and I know this because executing "ls -a"!! through scala I can see that the file is there. I then try to access it with the following command:
val dataFrame = sqlContext.read.las("out.las")
I assumed that the sql context would try to find the file in the current directory, but it doesn't. Also, it doesn't throw an error but a warning stating that the file could not be found (so spark continues to run).
I attempted to add the file using: sparkContext.addFile("out.las") and then access the location using: val location = SparkFiles.get("out.las") but this didn't work either.
I even ran the command val locationPt = "pwd"!! and then did val fullLocation = locationPt + "/out.las" and attempted to use that value but it didn't work either.
The actual exception that gets thrown is the following:
User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: [];
org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: []
And this happens when I try to access column "x" from a dataframe. I know that column 'X' exists because I've downloaded some of the files from HDFS, decompressed them locally and ran some tests.
I need to decompress files one by one because I have 1.6TB of data and so I cannot decompress it at one go and access them later.
Can anyone tell me what I can do to access files which are being outputted to the worker node directory? Or maybe should I be doing it some other way?
So I managed to do it now. What I'm doing is I'm saving the file to HDFS, and then retrieving the file using the sql context through hdfs. I overwrite "out.las" each time in HDFS so that I don't have take too much space.
I have used the hadoop API before to get to files, I dunno if it will help you here.
val filePath = "/user/me/dataForHDFS/"
val fs:FileSystem = FileSystem.get(new java.net.URI(filePath + "out.las"), sc.hadoopConfiguration)
And I've not tested the below, but I'm pretty sure I'm passing the java array to scala illegally. But just giving an idea of what to do afterward.
var readIn: Array[Byte] = Array.empty[Byte]
val fileIn: FSDataInputStream = fs.open(file)
val fileIn.readFully(0, readIn)

java.io.FileNotFoundException in Spark

I'm new here to learn Spark and Scala using the Notebook and Cluster in Databricks.com, here is my very simple code to load a file:
import sys.process._
val localpath="file:/tmp/myfile.json"
dbutils.fs.mkdirs("dbfs:/datasets/")
dbutils.fs.cp(localpath, "dbfs:/datasets/")
but I got error like this:
java.io.FileNotFoundException: File file:/tmp/myfile.json does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:402)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:255)
at com.databricks.backend.daemon.dbutils.FSUtils$.cp(DBUtilsCore.scala:82)
at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.cp(DbfsUtilsImpl.scala:40)
I'm using Mac and I've made sure that the file exists in this absolute path, is this a spark error? Thanks!
The line:
val localpath="file:/tmp/myfile.json"
should be:
val localpath="file://tmp/myfile.json"
Basically all URIs are of the format ://path see RFC-3986

going to specific path in scala using scala.sys.process

I have to go to the path of an application to deploy it, I tried using scala.sys.process and did "cd /home/path/somepath" !
It is throwing an exception, Can anyone guide me how I can go to the directory, I cannot deploy it using absolute path because of the dependency the run file has.
Thanks in advance
Although this question is a couple of years old, it's a good question.
To use scala.sys.process to execute something from a specific working directory, pass the required directory as a parameter to ProcessBuilder, as in this working example:
import scala.sys.process._
val scriptPath = "/home/path/myShellScript.sh"
val command = Seq("/bin/bash","-c",scriptPath)
val proc = Process(command,new java.io.File("."))
var output = Vector.empty[String]
val exitValue = proc ! ProcessLogger (
(out) => if( out.trim.length > 0 )
output +:= out.trim,
(err) =>
System.err.printf("e:%s\n",err) // can be quite noisy!
)
printf("exit value: %d\n",exitValue)
printf("output[%s]\n",output.mkString("\n"))
If the goal instead is to insure that the environment of the caller defaults to a specific working directory, that can be accomplished by setting the required working directory before launching the jvm.

Spark Tachyon: How to delete a file?

In Scala, as an experiment I create a sequence file on Tachyon using Spark and read it back in. I want to delete the file from Tachyon using the Spark script also.
val rdd = sc.parallelize(Array(("a",2), ("b",3), ("c",1)))
rdd.saveAsSequenceFile("tachyon://127.0.0.1:19998/files/123.sf2")
val rdd2 = sc.sequenceFile[String,Int]("tachyon://127.0.0.1:19998/files/123.sf2")
I don't understand the Scala language very well and I cannot find a reference about file path manipulation. I did find a way of somehow using Java in Scala to do this, but I cannot get it to work using Tachyon.
import java.io._
new File("tachyon://127.0.0.1:19998/files/123.sf2").delete()
There are different approaches, e.g.:
CLI:
./bin/tachyon tfs rm filePath
More info: http://tachyon-project.org/Command-Line-Interface.html
API:
TachyonFS sTachyonClient = TachyonFS.get(args[0]);
sTachyonClient.delete(filePath, true);
More info:
https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/examples/BasicOperations.java