Finding the union of RDDs which may not exist - scala

I'm trying to get the union of a few RDDs. The RDDs are being read in via SparkContext.textFile, but some may not exist on the file system.
val rdd1 = Try(Repository.fetch(data1Path))
val rdd2 = Try(Repository.fetch(data2Path))
val rdd3 = Try(Repository.fetch(data3Path))
val rdd4 = Try(Repository.fetch(data4Path))
val all = Seq(rdd1, rdd2, rdd3, rdd4)
val union = sc.union(all.map {case Success(r) => r})
val results = union.filter(some-filter-logic).collect
However due to lazy evaluation, all those Try statements evaluate to Success regardless of whether the files are present or not, and I end up with a FileNotFoundException upon evaluation when collect is called.
Is there a way around this?

You can try to run a loop to check whether the file exists and in the same loop you can create the RDDs and get a union.
OR
you can use wholeTextFiles API to read all the files present in one directory as key,value pair.
val rdd=sc.wholeTextFiles(path, minPartitions)
If any file will be empty also, it will not not create any issue.

Related

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()

null pointer exception while converting dataframe to list inside udf

I am reading 2 different .csv files which has only column as below:
val dF1 = sqlContext.read.csv("some.csv").select($"ID")
val dF2 = sqlContext.read.csv("other.csv").select($"PID")
trying to search if dF2("PID") exists in dF1("ID"):
val getIdUdf = udf((x:String)=>{dF1.collect().map(_(0)).toList.contains(x)})
val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))
This gives me null pointer exception.
but if I convert dF1 outside and use list in udf it works:
val dF1 = sqlContext.read.csv("some.csv").select($"ID").collect().map(_(0)).toList
val getIdUdf = udf((x:String)=>{dF1.contains(x)})
val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))
I know I can use join to get this done but want to know what is the reason of null pointer exception here.
Thanks.
Please check this question about accessing dataframe inside the transformation of another dataframe. This is exactly what you are doing with your UDF, and this is not possible in spark. Solution is either to use join, or collect outside of transformation and broadcast.

Save two or more different RDDs in a single text file in scala

When I use saveAsTextFile like,
rdd1.saveAsTextFile("../savefile")
rdd2.saveAsTextFile("../savefile")
I can't put two different RDDs into a single text file. Is there a way I can do so?
Besides, is there a way I can apply some format to the text I am wring to the text file? For example, add a \n or some other format.
A single text file is rather ambiguous in Spark. Each partition is saved individually and it means you get a single file per partition. If you want a single for a RDD you have to move your data to a single partition or collect, and most of the time it is either to expensive or simply not feasible.
You can get an union of RDDs using union method (or ++ as mentioned by lpiepiora in the comments) but it works only if both RDDs are of the same type:
val rdd1 = sc.parallelize(1 to 5)
val rdd2 = sc.parallelize(Seq("a", "b", "c", "d", "e"))
rdd1.union(rdd2)
// <console>:26: error: type mismatch;
// found : org.apache.spark.rdd.RDD[String]
// required: org.apache.spark.rdd.RDD[Int]
// rdd1.union(rdd2)
If types are different a whole idea smells fishy though.
If you want a specific format you have to apply it before calling saveAsTextFile. saveAsTextFile simply calls toString on each element.
Putting all of the above together:
import org.apache.spark.rdd.RDD
val rddStr1: RDD[String] = rdd1.map(x => ???) // Map to RDD[String]
val rddStr2: RDD[String] = rdd2.map(x => ???)
rdd1.union(rdd2)
.repartition(1) // Not recommended!
.saveAsTextFile(some_path)

Spark Union inside a loop gives void

I try to make a RDD from iterative union from another RDD inside a loop but the result works exclusively if i perform an action on the result RDD inside the loop.
var rdd : RDD[Int] = sc.emptyRDD
for ( i <- 1 to 5 ) {
val rdd1 = sc.parallelize(Array(1))
rdd = rdd ++ rdd1
}
// rdd.foreach(println) => void
for ( i <- 1 to 5 ) {
val rdd1 = sc.parallelize(Array(1))
rdd = rdd ++ rdd1
rdd.foreach(x=>x)
}
// rdd.foreach(println) => ( 1,1,1,1,1)
If I create rdd1 outside the loop everything works fine but not inside.
Does it exist a specific lightweight action to solve this problem ?
One thing to keep in mind is that when you apply the foreach action to your RDD, the action while apply on each individual worker. Therefore in the first case, if you check the stdout's of each executor, you will find the printed values from rdd. If you want these values to be printed to your console, you can aggregate the elements of an RDD (or a subset of them) at the driver, and then apply your function (e.g. rdd.collect.foreach(println), rdd.take(3).foreach(println), etc).

Joining two HDFS files in in Spark

I want to join two files from HDFS using spark shell.
Both the files are tab separated and I want to join on second column
Tried code
But not giving any output
val ny_daily= sc.parallelize(List("hdfs://localhost:8020/user/user/NYstock /NYSE_daily"))
val ny_daily_split = ny_daily.map(line =>line.split('\t'))
val enKeyValuePair = ny_daily_split.map(line => (line(0).substring(0, 5), line(3).toInt))
val ny_dividend= sc.parallelize(List("hdfs://localhost:8020/user/user/NYstock/NYSE_dividends"))
val ny_dividend_split = ny_dividend.map(line =>line.split('\t'))
val enKeyValuePair1 = ny_dividend_split.map(line => (line(0).substring(0, 4), line(3).toInt))
enKeyValuePair1.join(enKeyValuePair)
But I am not getting any information for how to join files on particular column
Please suggest
I am not getting any information for how to join files on particular column
RDDs are joined on their keys, so you decided the column to join on when you wrote:
val enKeyValuePair = ny_daily_split.map(line => (line(0).substring(0, 5), line(3).toInt))
...
val enKeyValuePair1 = ny_daily_split.map(line => (line(0).substring(0, 4), line(3).toInt))
Your RDDs will be joined on the values coming from line(0).substring(0, 5) and line(0).substring(0, 4).
You can find the join function (and many other useful functions) here and the Spark Programming Guide is a great reference to understand how Spark works.
Tried code But not giving any output
In order to see the output, you have to ask Spark to print it:
enKeyValuePair1.join(enKeyValuePair).foreach(println)
Note: to load data from files you should use sc.textFile(): sc.parallelize() is only used to make RDDs out of Scala collections.
The following code should do the job:
val ny_daily_split = sc.textFile("hdfs://localhost:8020/user/user/NYstock/NYSE_daily").map(line =>line.split('\t'))
val ny_dividend_split = sc.textFile("hdfs://localhost:8020/user/user/NYstock/NYSE_dividends").map(line =>line.split('\t'))
val enKeyValuePair = ny_daily_split.map(line => line(0).substring(0, 5) -> line(3).toInt)
val enKeyValuePair1 = ny_dividend_split.map(line => line(0).substring(0, 4) -> line(3).toInt)
enKeyValuePair1.join(enKeyValuePair).foreach(println)
By the way, you mentioned that you want to join on the second column but you are actually using line(0), is this intended?
Hope this helps!