Flattening an element of a RDD - scala

I am using Spark scala API.
prods_grpd has this type: String, mutable.HashSet[String]
val prods_grpd = all_meds.aggregateByKey(initialSet)(addToSet, mergePartitionSets)
prods_grpd.saveAsTextFile("scratch/prods_grpdby_users.tsv")
When I save this rdd, I get this o/p. The first value is key and then I get a set of keys.
(8635214,Set(2013-01-01))
(3580112,Set(2013-01-01))
(146086,Set(2010-01-01, 2012-01-01))
(112220,Set(2013-01-01))
(2020,Set(2013-01-01))
(24218,Set(2013-01-01))
However, I want o/p like:
(8635214, 2013-01-01)
(3580112, 2013-01-01)
(146086, 2010-01-01, 2012-01-01)
(112220, 2013-01-01)
(2020, 2013-01-01)
(24218, 2013-01-01)
I which like to know how do I unnest/flatten the 2nd parameter of RDD.

You cannot simply convert Set to Tuple because tuples are not collections and don't support arbitrary number of elements. Instead you can map entries to strings with desired format:
prods_grpd.map{case (k, s) => {
val sstr = s.mkString(",")
s"($k,$sstr)"
}}.saveAsTextFile(...)

Related

Converting literal to RDD for subsequent Cartesian Product

Cannot find in the documentation how the result of below:
val DIM_Key_Max = rddA.map(x => (x._1)).max
can be subsequently converted to a single entry RDD for JOINing with another RDD, or rather cartesian product.
Nowhere I can see that. Who can help?
max returns a single object. To turn it into a single entry RDD, use parallelize:
sc.parallelize(List(DIM_Key_Max))
This returns an RDD with a single entry that can be used e.g. as an argument to cartesian.
You are getting something wrong here. max will not retrun an RDD which can be joined with another RDD.
val rdd=sc.parallelize(Array((1,2),(3,4),(5,6))).map(x=>x._1).max
rdd
rdd: Int = 5
rdd.getClass
res2: Class[Int] = int

Spark: format of an rdd to convert to dataframe

Assuming I am having the following rdd:
val rdd = sc.parallelize(Seq(('a'.toString,1.1,Array(1.1,2.2),0),
('b'.toString,1.5,Array(1.4,4.2),3),
('d'.toString,2.1,Array(3.3,7.4),4)))
>>>rdd: org.apache.spark.rdd.RDD[(String,Double,Array[Double],Int)]
And I want to write the output to csv format by using .write.format("com.databricks.spark.csv") which takes a dataframe.
So firstly i need to convert the current schema to -> rdd[(String, String, String, String, String)] and after convert it to df. I tried the following:
rdd.map { case((a,b,c,d)) => (a,b,c.mkString(","),d)}
but this outputs:
rdd[(string,double,string,int)]
Any idea how to do it?
UPDATE
To work with Tuples, you have to know how many elements you're going to put in them and define the use case yourself. Hence, to work with variable number of elements, you'll probably need to work with some collection.
For your use case, something like this can work:
rdd.map { case((a,b,c,d)) => a +: (b +: c) :+ d}.map(_.mkString(","))
This will result in an RDD[String] corresponding to each line of the csv file.
You're prepending and appending the other elements to the Array "c" to result in a single Array.

Split and choose in scala

I found some explanation to do this but i still can't do it !!
I want to split val data=sc.textFile("hdfs://ncdc/isd-history.csv")
the data have the form : ("949999","00338","PORTLAND (CASHMORE)","AS","","","-38.320","+141.480","+0081.0","19690724","19781113")
I want to split data and take only the 1st (949999) and the 3rd (PORTLAND (CASHMORE))
I have done this ,
val RDD = (data.filter(s => (s.split(',')(0) , s.split(',')(2))))
But, it doesn't work.
RDD.filter filters records, not "columns" - it expects a function from the record type (String, I assume, in this case) to Boolean, and would filter out all records for which this function returned false.
You're trying to transform each record from a String into a tuple (while "filtering" out parts of that string), so you should use RDD.map instead of RDD.filter:
val RDD = data.map(s => (s.split(',')(0), s.split(',')(2)))
Or better yet:
val RDD = data.map(_.split(',')).map(arr => (arr(0), arr(2)))
You should use split to split strings and not collections.
If this is a RDD of tuples, this should work:
val RDD = data map(row => (row._1, row._3))
If this is a RDD of Array/Seq[String] just sub _1 and _3 for indexes 0 and 2.

How do I split a Spark rdd Array[(String, Array[String])]?

I'm practicing on doing sorts in the Spark shell. I have an rdd with about 10 columns/variables. I want to sort the whole rdd on the values of column 7.
rdd
org.apache.spark.rdd.RDD[Array[String]] = ...
From what I gather the way to do that is by using sortByKey, which in turn only works on pairs. So I mapped it so I'd have a pair consisting of column7 (string values) and the full original rdd (array of strings)
rdd2 = rdd.map(c => (c(7),c))
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
I then apply sortByKey, still no problem...
rdd3 = rdd2.sortByKey()
rdd3: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
But now how do I split off, collect and save that sorted original rdd from rdd3 (Array[String])? Whenever I try a split on rdd3 it gives me an error:
val rdd4 = rdd3.map(_.split(',')(2))
<console>:33: error: value split is not a member of (String, Array[String])
What am I doing wrong here? Are there other, better ways to sort an rdd on one of its columns?
what you did with rdd2 = rdd.map(c => (c(7),c)) is to map it to a tuple.
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])]
exactly as it says :).
now if you want to split the record you need to get it from this tuple.
you can map again, taking only the second part of the tuple (which is the array of Array[String]...) like so : rdd3.map(_._2)
but i would strongly suggest to use try rdd.sortBy(_(7)) or something of this sort. this way you do not need to bother yourself with tuple and such.
if you want to sort the rdd using the 7th string in the array, you can just do it directly by
rdd.sortBy(_(6)) // array starts at 0 not 1
or
rdd.sortBy(arr => arr(6))
That will save you all the hassle of doing multiple transformations. The reason why rdd.sortBy(_._7) or rdd.sortBy(x => x._7) won't work is because that's not how you access an element inside an Array. To access the 7th element of an array, say arr, you should do arr(6).
To test this, i did the following:
val rdd = sc.parallelize(Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd")))
// I want to sort it using the 3rd String
val sorted_rdd = rdd.sortBy(_(2))
Here's the result:
Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd"))
just do this:
val rdd4 = rdd3.map(_._2)
I thought you don't familiar with Scala,
So, below should help you understand more,
rdd3.map(kv => {
println(kv._1) // This represent String
println(kv._2) // This represent Array[String]
})

transform rdd into pairRDD

This is a newbie question.
Is it possible to transform an RDD like (key,1,2,3,4,5,5,666,789,...) with a dynamic dimension into a pairRDD like (key, (1,2,3,4,5,5,666,789,...))?
I feel like it should be super-easy but I cannot get how to.
The point of doing it is that I would like to sum all the values, but not the key.
Any help is appreciated.
I am using Spark 1.2.0
EDIT enlightened by the answer I explain my use case deeplier. I have N (unknown at compile time) different pairRDD (key, value), that have to be joined and whose values must be summed up. Is there a better way than the one I was thinking?
First of all if you just wanna sum all integers but first the simplest way would be:
val rdd = sc.parallelize(List(1, 2, 3))
rdd.cache()
val first = rdd.sum()
val result = rdd.count - first
On the other hand if you want to have access to the index of elements you can use rdd zipWithIndex method like this:
val indexed = rdd.zipWithIndex()
indexed.cache()
val result = (indexed.first()._2, indexed.filter(_._1 != 1))
But in your case this feels like overkill.
One more thing i would add, this looks like questionable desine to put key as first element of your rdd. Why not just instead use pairs (key, rdd) in your driver program. Its quite hard to reason about order of elements in rdd and i cant not think about natural situation in witch key is computed as first element of rdd (ofc i dont know your usecase so i can only guess).
EDIT
If you have one rdd of key value pairs and you want to sum them by key then do just:
val result = rdd.reduceByKey(_ + _)
If you have many rdds of key value pairs before counting you can just sum them up
val list = List(pairRDD0, pairRDD1, pairRDD2)
//another pairRDD arives in runtime
val newList = anotherPairRDD0::list
val pairRDD = newList.reduce(_ union _)
val resultSoFar = pairRDD.reduceByKey(_ + _)
//another pairRDD arives in runtime
val result = resultSoFar.union(anotherPairRDD1).reduceByKey(_ + _)
EDIT
I edited example. As you can see you can add additional rdd when every it comes up in runtime. This is because reduceByKey returns rdd of the same type so you can iterate this operation (Ofc you will have to consider performence).