Error when reading a file in Spark - scala

I'm having a hard time figuring out why Spark is not accessing a file that I add to the context. Below is my code in the repl:
scala> sc.addFile("/home/ubuntu/my_demo/src/main/resources/feature_matrix.json")
scala> val featureFile = sc.textFile(SparkFiles.get("feature_matrix.json"))
featureFile: org.apache.spark.rdd.RDD[String] = /tmp/spark/ubuntu/spark-d7a13d92-2923-4a04-a9a5-ad93b3650167/feature_matrix.json MappedRDD[1] at textFile at <console>:60
scala> featureFile.first()
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: cfs://172.30.26.95/tmp/spark/ubuntu/spark-d7a13d92-2923-4a04-a9a5-ad93b3650167/feature_matrix.json
The file does in fact exist at /tmp/spark/ubuntu/spark-d7a13d92-2923-4a04-a9a5-ad93b3650167/feature_matrix.json
Any help appreciated.

If you are using addFile, then you need to use get to retrieve it. Also, the addFile method is lazy, so it is very possible that it was not put in the location you are finding it until you actually call first, so you are creating this kind of circle.
All that being said, I don't know that using SparkFiles as the first action is ever going to be a smart idea. Use something like --files with SparkSubmit and the files will be put in your working directory.

Related

How to pass variable arguments to my scala program?

I am very new to scala spark. Here I have a wordcount program wherein I pass the inputfile as an argument instead of hardcoding it and reading it. But when I run the program I get an error Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException : 0
I think it's because I have not mentioned the argument I am taking in the main class but don't know how to do so.
I tried running the program as is and also tried changing the run configurations. i do not know how to pass the filename (in code) as an argument in my main class
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.types.{StructType,StructField,StringType};
import org.apache.spark.sql.Row;
object First {
def main(args : Array[String]): Unit = {
val filename = args(0)
val cf = new SparkConf().setAppName("Tutorial").setMaster("local")
val sc = new SparkContext(cf)
val input = sc.textFile(filename)
val w = input.flatMap(line => line.split(" ")).map(word=>
(word,1)).reduceByKey(_ + _)
w.collect.foreach(println)
w.saveAsTextFile(args(1))
}
}
I wish to run this program by passing the right arguments (input file and save output file as arguments) in my main class. I am using scala eclipse IDE. I do not know what changes to make in my program please help me out here as I am new.
In the run configuration for the project, there is an option right next to main called '(x)=Arguments' where you can pass in arguments to main in the 'Program Arguments' section.
Additionally, you may print args.length to see the number of arguments your code is actually receiving after doing the above.
It appears you are running Spark on Windows, so I'm not sure if this will work exactly as-is, but you can definitely pass arguments like any normal command line application. The only difference is that you have to pass the arguments AFTER specifying the Spark-related parameters.
For example, the JAR filename is the.jar and the main object is com.obrigado.MyMain, then you could run a Spark submit job like so: spark-submit --class com.obrigado.MyMain the.jar path/to/inputfile. I believe args[0] should then be path/to/inputfile.
However, like any command-line program, it's generally better to use POSIX-style arguments (or at least named arguments), and there are several good ones out there. Personally, I love using Scallop as it's easy to use and doesn't seem to interfere with Spark's own CLI parsing library.
Hopefully this fixes your issue!

Source.fromFile not working for HDFS file path

i am trying to read file contents from my hdfs for that i am using Source.fromFile(). It is working fine when my file is in local system but throwing error when i am trying to read file from HDFS.
object CheckFile{
def main(args:Array[String]) {
for (line <- Source.fromFile("/user/cloudera/xxxx/File").getLines()) {
println(line)
}
}
}
Error:
java.io.FileNotFoundException: hdfs:/quickstart.cloudera:8080/user/cloudera/xxxx/File (No such file or directory)
i searched but i am not able to find any solutions to this.
Please help
If you are using Spark you should use SparkContext to load the files. Source.fromFile uses the local file system.
Say you have your SparkContext at sc,
val fromFile = sc.textFile("hdfs://path/to/file.txt")
Should do the trick. You might have to specify the node address, though.
UPDATE:
To add to the comment. You want to read some data from hdfs and store it as a Scala collection. This is bad practice as the file might contain milions of lines and it will crash due to insufficient amount of memory; you should use RDDs and not built-in Scala collections. Nevertheless, if this is what you want, you could do:
val fromFile = sc.textFile("hdfs://path/to/file.txt").toLocalIterator.toArray
Which would produce a local collection of desired type (Array in this case).
sc.textFile("hdfs://path/to/file.txt").toLocalIterator.toArray.mkString will give the result as string

Pass List[String] to function that takes f(args: String*) scala

I need to read in specific parquet files with spark, I know this can be done like so:
sqlContext
.read
.parquet("s3://bucket/key", "s3://bucket/key")
Right now I have a List[String] object with all these s3 paths in it but I don't know how I can pass this programmatically to the parquet function in Scala? There are way to many files to do it manually, any ideas how to get the files into the parquet function programmatically?
I've answer a similar question earlier concerning repeated parameters here.
As #Dima mentioned, you are looking for a splat operator because .parquet expected repeated arguments :
sqlContext.read.parquet(listOfStrings:_*)
More on repeated arguments in the Scala Language Specification seciton 4.6.2
Although it's the specs of scala 2.9, this part didn't change.

How to write program in Spark to replace word

it is easy for Hadoop to use .replace() for example
String[] valArray = value.toString().replace("\N", "")
But it dosen't work in Spark,I write Scala in Spark-shell like below
val outFile=inFile.map(x=>x.replace("\N",""))
So,how to deal with it?
For some reason your x is an Array[String]. How did you get it like that? You can .toString.replace it if you like, but that will probably not get you what you want (and would give the wrong output in java anyway); you probably want to do another layer of map, inFile.map(x => x.map(_.replace("\N","")))

Readin a two-dimensional array using scala

Suppose I have a txt file named "input.txt" and I want to use scala to read it in. The dimension of the file is not available in the beginning.
So, how to construct such an Array[Array[Float]]? What I want is a simple and neat way rather than write some code like in Java to iterates over lines and parse each number. I think functional programming should be quite good at it.. but cannot think of one up to now.
Best Regards
If your input is correct, you can do it in such way:
val source = io.Source.fromFile("input.txt")
val data = source.getLines().map(line => line.split(" ").map(_.toFloat)).toArray
source.close()
Update: for additional information about using Source check this thread