Load CSV file as dataframe from resources within an Uber Jar - scala

So, I made an Scala Application to run in Spark, and created the Uber Jar using sbt> assembly.
The file I load is a lookup needed by the application, thus the idea is to package it together. It works fine from within InteliJ using the path "src/main/resources/lookup01.csv"
I am developing in Windows, testing locally, to after deploy it to a remote test server.
But when I call spark-submit on the Windows machine, I get the error :
"org.apache.spark.sql.AnalysisException: Path does not exist: file:/H:/dev/Spark/spark-2.4.3-bin-hadoop2.7/bin/src/main/resources/"
Seems it tries to find the file in the sparkhome location instead of from inside the JAr file.
How could I express the Path so it works looking the file from within the JAR package?
Example code of the way I load the Dataframe. After loading it I transform it into other structures like Maps.
val v_lookup = sparkSession.read.option( "header", true ).csv( "src/main/resources/lookup01.csv")
What I would like to achieve is getting as way to express the path so it works in every environment I try to run the JAR, ideally working also from within InteliJ while developing.
Edit: scala version is 2.11.12
Update:
Seems that to get a hand in the file inside the JAR, I have to read it as a stream, the bellow code worked, but I cant figure out a secure way to extract the headers of the file such as SparkSession.read.option has.
val fileStream = scala.io.Source.getClass.getResourceAsStream("/lookup01.csv")
val inputDF = sparkSession.sparkContext.makeRDD(scala.io.Source.fromInputStream(fileStream).getLines().toList).toDF
When the makeRDD is applied, I get the RDD and then can convert it to a dataframe, but it seems I lost the ability tu use the option from "read" that parsed out the headers as the schema.
Any way around it when using makeRDD ?
Other problem with this is that seems that I will have to manually parse the lines into columns.

You have to get the correct path from classPath
Considering that your file is under src/main/resources:
val path = getClass.getResource("/lookup01.csv")
val v_lookup = sparkSession.read.option( "header", true ).csv(path)

So, it all points to that after the file is inside JAR, it can only be accessed as a inputstream to read the chunk of data from within the compressed file.
I arrived at a solution, even though its not pretty it does what I need, that is to read a csv file, take the 2 first columns and make it into a dataframe and after load it inside a key-value structure (in this case i created a case class to hold these pairs).
I am considering migrating these lookups to a HOCON file, that may make the process less convoluted to load these lookups
import sparkSession.implicits._
val fileStream = scala.io.Source.getClass.getResourceAsStream("/lookup01.csv")
val input = sparkSession.sparkContext.makeRDD(scala.io.Source.fromInputStream(fileStream).getLines().toList).toDF()
val myRdd = input.map {
line =>
val col = utils.Utils.splitCSVString(line.getString(0))
KeyValue(col(0), col(1))
}
val myDF = myRdd.rdd.map(x => (x.key, x.value)).collectAsMap()
fileStream.close()

Related

Searching all file names recursively in hdfs using Spark

I’ve been looking for a while now for a way to get all filenames in a directory and its sub-directories in Hadoop file system (hdfs).
I found out I can use these commands to get it :
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
sc.wholeTextFiles(path).map(_._1)
Here is "wholeTextFiles" documentation:
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Parameters:
path - Directory to the input data files, the path can be
comma separated paths as the list of inputs.
minPartitions - A
suggestion value of the minimal splitting number for input data.
Returns:
RDD representing tuples of file path and the corresponding
file content
Note: Small files are preferred, large file is also
allowable, but may cause bad performance., On some filesystems,
.../path/* can be a more efficient way to read all files in a
directory rather than .../path/ or .../path, Partitioning is
determined by data locality. This may result in too few partitions by
default.
As you can see "wholeTextFiles" returns a pair RDD with both the filenames and their content. So I tried mapping it and taking only the file names, but I suspect it still reads the files.
The reason I suspect so: if I try to count (for example) and I get the spark equivalent of "out of memory" (losing executors and not being able to complete the tasks).
I would rather use Spark to achieve this goal the fastest way possible, however, if there are other ways with a reasonable performance I would be happy to give them a try.
EDIT:
To clear it - I want to do it using Spark, I know I can do it using HDFS commands and such thing - I would like to know how to do such thing with the existing tools provided with Spark and maybe an explanation on how I can make "wholeTextFiles" not reading the text itself (kind of like how transformations only happen after an action and some of the "commands" never really happen).
Thank you very much!
This is the way to list out all the files till the depth of last subdirectory....and is with out using wholetextfiles
and is recursive call till the depth of subdirectories...
val lb = new scala.collection.mutable[String] // variable to hold final list of files
def getAllFiles(path:String, sc: SparkContext):scala.collection.mutable.ListBuffer[String] = {
val conf = sc.hadoopConfiguration
val fs = FileSystem.get(conf)
val files: RemoteIterator[LocatedFileStatus] = fs.listLocatedStatus(new Path(path))
while(files.hasNext) {// if subdirectories exist then has next is true
var filepath = files.next.getPath.toString
//println(filepath)
lb += (filepath)
getAllFiles(filepath, sc) // recursive call
}
println(lb)
lb
}
Thats it. it was tested with success. you can use as is..

Scala Spark and Twitter feed

I am following some code that connects to twitter and then writes out that data to a local text file. Here is my code:
System.setProperty("twitter4j.oauth.consumerKey", "Mycode - Not going to put real one in for obvious reasons")
System.setProperty("twitter4j.oauth.consumerSecret", "Mycode")
System.setProperty("twitter4j.oauth.accessToken", "Mycode")
System.setProperty("twitter4j.oauth.accessTokenSecret", "Mycode")
val ssc = new StreamingContext(spark.sparkContext, Seconds(5))
val twitterStream = TwitterUtils.createStream(ssc, None)
twitterStream.saveAsTextFiles("streamouts/tweets", "txt")
ssc.start()
Thread.sleep(30000)
ssc.stop(false)
Now, the code is not complaining about any missing references or anything. I believe I have the correct SBT dependencies.
The following code seems to run. It creates the folder structure and text files within. However, ALL of the text files are completely blank. 0kb in size.
What am i doing wrong? Anyone any ideas, as to why it look likes it is creating the output text files, but not actually writing into the files?
By the way:
I have triple checked the consumer keys, access tokens etc from the twitter app. I'm certain I have copied them over correctly.
Conor
The code looks fine in your case.
why it look likes it is creating the output text files, but not actually writing into the files?
As per here new StreamingContext(spark.sparkContext, Seconds(5))
For each interval of 5 seconds, it collects the data that are in and creates an RDD, So each RDD are written with prefix and suffix that you pass in saveAsTextFiles
So your files may be empty in case your RDD is empty otherwise look in the files that are generated inside the folder as part-00000, part-00001, part-00002 should contain data and not in _SUCCESS and .part-00000.crc
I hope this helps you,

Spark - Get from a directory with nested folders all filenames of a particular data type

I have a directory with some subfolders which content different parquet files. Something like this:
2017-09-05
10-00
part00000.parquet
part00001.parquet
11-00
part00000.parquet
part00001.parquet
12-00
part00000.parquet
part00001.parquet
What I want is by passing the path to the directory 05-09 to get a list of names of all parquet files.
I was able to achieve it, but in a very inefficient way:
val allParquetFiles = sc.wholeTextFiles("C:/MyDocs/2017-09-05/*/*.parquet")
allParquetFiles.keys.foreach((k) => println("The path to the file is: "+k))
So each key is the name I am looking for, but this process requires me to load all files as well, which then I can't use, since I get them in binary (and I don't know how to convert them into a dataframe).
Once I have the keys (so the list of filePaths) I am planning to invoke:
val myParquetDF = sqlContext.read.parquet(filePath);
As you may have already understood I am quite new in Spark. So please if there is a faster or easier approach to read a list of parquet files located in different folders, please let me know.
My Partial Solution: I wasn't able to get all paths for all filenames in a folder, but I was able to get the content of all files of that type into the same dataframe. Which was my ultimate goal. In case someone may need it in the future, I used the following line:
val df = sqlContext.read.parquet("C:/MyDocs/2017-05-09/*/*.parquet")
Thanks for your time
You can do it using the hdfs api like this
import org.apache.hadoop.fs._
import org.apache.hadoop.conf._
val fs = FileSystem.get(new Configuration())
val files = ( fs.listStatus(new Path("C:/MyDocs/2017-09-05/*/*.parquet")) ).map(_.getPath.toString)
First, it is better to avoid using wholeTextFiles. This method reads the whole file at once. Try to use textFile method. read more
Second, if you need to get all files recursively in one directory, you can achieve it by textFile method:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
This configuration will enable recursive search (works for spark jobs as for mapreduce jobs). And then just invoke sc.textFile(path).

saveAsTextFile method in spark

In my project, I have three input files and make the file names as args(0) to args(2), I also have a output filename as args(3), in the source code, I use
val sc = new SparkContext()
var log = sc.textFile(args(0))
for(i <- 1 until args.size - 1) log = log.union(sc.textFile(args(i)))
I do nothing to the log but save it as a text file by using
log.coalesce(1, true).saveAsTextFile(args(args.size - 1))
but it still save to 3 file as part-00000、part-00001、part-00002, So is there any way that I can save the three input files to an output file?
Having multiple output files is a standard behavior of multi-machine clusters like Hadoop or Spark. The number of output files depends on the number of reducers.
How to "solve" it in Hadoop:
merge output files after reduce phase
How to "solve" in Spark:
how to make saveAsTextFile NOT split output into multiple file?
A good info you can get also here:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-Spark-merge-the-output-file-td322.html
So, you were right about coalesce(1,true). However, it is very inefficient. Interesting is that (as #climbage mentioned in his remark) your code is working if you run it locally.
What you might try is to read the files first and then save the output.
...
val sc = new SparkContext()
var str = new String("")
for(i <- 0 until args.size - 1){
val file = sc.textFile(args(i))
file.foreach(line => str+= line)
}
//and now you might save the content
str.coalesce(1, true).saveAsTextFile("out")
Note: this code is also extremely inefficient and working for small files only!!! You need to come up with a better code. I wouldn't try to reduce number of file but process multiple outputs files instead.
As mentioned your problem is somewhat unavoidable via the standard API's as the assumption is that you are dealing with large quanatities of data. However, if I assume your data is manageable you could try the following
import java.nio.file.{Paths, Files}
import java.nio.charset.StandardCharsets
Files.write(Paths.get("./test_file"), data.collect.mkString("\n").getBytes(StandardCharsets.UTF_8))
What I am doing here is converting the RDD into a String by performing a collect and then mkString. I would suggest not doing this in production. It works fine for local data analysis (Working with 5gb~ of local data)

Private assets in Play 2.1

In a Play 2.1 application, where is the proper place to store private assets?
By "private asset", I mean a data file that is used by the application but not accessible to the user.
For example, if I have a text file (Foo.json) that contains sample data that is parsed every time the application starts, what would be the proper directory in the project to store it?
Foo.json needs to be included in the deployment, and needs to be uniformly accessible from the code in both development and production.
Some options:
Usually the files goes to conf folder. ie: conf/privatefiles/Foo.json
If they are subject of often change you can consider adding to your application.conf path to the external folder somwhere in the filesystem (full path), in such case you'll be able to edit the content easily without redeploying the apps: /home/scrapdog/privatefiles/Foo.json
You can store them in database as well, benefits are the same as in previous option - easy editing.
In all cases consider using memory cache to avoid reading it from filesystem/database every time when required.
I simply use a folder called data at the application root. You can use the name you want or better, store the actual name in the configuration file.
To resolve its path, I use the following snippet:
lazy val rootPath = {
import play.api.Play.current
play.api.Play.application.path.getPath
}
lazy val dataPath = rootPath + "/data/"
You can do what I did, I got the answer from #Marius Soutier here. Please upvote his answer there if you like it:
You can put "internal" documents in the conf folder, it's the equivalent to resources in standard sbt projects.
Basically create a dir under conf called json and to access it, you'd use Play.resourceAsStream(). Note that this gives you a java.io.InputStream because your file will be part of the JAR created by activator dist.
My example is using it in a view but you can modify it as you want.
Play.resourceAsStream("json/Foo.json") map { inputStream =>
Ok(views.html.xxx(XXX.do_something_with_stream(inputStream)))
} getOrElse (InternalServerError)
You can also use Play.resource(), this will give you a java.net.URL, you can use getFile() to get the java.io.File out of it.
Play.resource("json/Foo.json") map { fileURL =>
Ok(views.html.xxx(XXX.do_something_with_file(fileURL.getFile())))
} getOrElse (InternalServerError)