aggregate data for uniquely tagged values in a list in scala - scala

I was wondering if somebody could help.
I'm trying to aggregate some data in a list based on id values, I have a listBuffer which is updated from a foreach function. My output means I have an id number and a value, because the foreach applies a function to each id often more than once, the list I end up with looks something like the following:
ListBuffer(3106;0, 3106;3, 3108;2, 3108;0, 3110;1, 3110;2, 3113;0, 3113;2, 3113;0)
What I want to do is apply a simple function to aggregate this data, so I am left with
List(3106;3 ,3108;2, 3110;3, 3113;2)
I thought this could be done with foldLeft or groupBy, however I'm not sure how to get it to recognise id values and normal values.
Any help or pointers would be much appreciated

First of all, you can't group key-value pairs this way. In scala you have tuples which are written as
val pair: (Int, Int) = (3106,3), where
pair._1 == 3106
pair._2 == 3
are true statements.
So you have:
val l = ListBuffer((3106,0), (3106,3), (3108,2), (3108,0), (3110,1), (3110,2), (3113,0), (3113,2), (3113,0))
val result = l.groupBy(x => x._1).map(x => (x._1, x._2.map(_._2))).map(x => (x._1, x._2.sum)).toList
println(result)
will give you
List((3106,3), (3108,2), (3110,3), (3113,2))

Related

Scala - How to select the last element from an RDD?

First I had a salesList: List[Sale] and in order to get an ID of the last Sale in the List I've used lastOption:
val lastSaleId: Option[Any] = salesList.lastOption.map(_.saleId)
But now I've modified a method with List[Sale] to work with salesListRdd: List[RDD[Sale]]. So I've changed the way I'm getting an ID of the last Sale:
val lastSaleId: Option[Any] = SparkContext
.union(salesListRdd)
.collect().toList
.lastOption.map(_.saleId)
I'm not sure that it is the best way to go. Because here I'm still collecting RDD to a List which brings it to the driver node and it may cause the driver to run out of memory.
Is there a way to get an ID of the last Sale from an RDD preserving the initial order of records? Not any kind of sorting but the way the Sale objects were originally stored in the List?
There at least two efficient solutions. You can either use top with zipWithIndex:
def lastValue[T](rdd: RDD[T]): Option[T] = {
rdd.zipWithUniqueId.map(_.swap).top(1)(Ordering[Long].on(_._1)).headOption.map(_._2)
}
or top with custom key:
def lastValue[T](rdd: RDD[T]): Option[T] = {
rdd.mapPartitionsWithIndex(
(i, iter) => iter.zipWithIndex.map { case (x, j) => ((i, j), x) }
).top(1)(Ordering[(Int, Long)].on(_._1)).headOption.map(_._2)
}
The former one requires additional action for zipWithIndex while the latter one doesn't.
Before using please be sure to understand the limitation. Quoting the docs:
Note that some RDDs, such as those returned by groupBy(), do not guarantee order of elements in a partition. The unique ID assigned to each element is therefore not guaranteed, and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee the same index assignments, you should sort the RDD with sortByKey() or save it to a file.
In particular, depending on the exact input, Union might not preserve the input order at all.
You could use zipWithIndex and sort descending by it, so that the last record will be on the top, then take(1):
salesListRdd
.zipWithIndex()
.map({ case (x, y) => (y, x) })
.sortByKey(ascending = false)
.map({ case (x, y) => y })
.take(1)
Solution is taken from here: http://www.swi.com/spark-rdd-getting-bottom-records/
However, it is highly inefficient, since It does lots of partition shuffling.

Is there a way to filter out the elements of a List by checking them against elements of an Array in Scala?

I have a List in Scala:
val hdtList = hdt.split(",").toList
hdtList.foreach(println)
Output:
forecast_id bigint,period_year bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_system_name string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string
There is an array which is obtained from a dataframe and converting its column to array as below:
val partition_columns = spColsDF.select("partition_columns").collect.flatMap(x => x.getAs[String](0).split(","))
partition_columns.foreach(println)
Output:
source_system_name
period_year
Is there a way to filter out the elements: source_system_name string, period_year bigint from hdtList by checking them against the elements in the Array: partition_columns and put them into new List.
I am confused on applying filter/map on the right collections appropriately and compare them.
Could anyone let me know how can I achieve that ?
Unless I'm misunderstanding the question, I think this is what you need:
val filtered = hdtList.filter { x =>
!partition_columns.exists { col => x.startsWith(col) }
}
In your case you need to use filter, because you need to remove elements from hdtList.
Map is a function that transform elements, there is no way to remove elements from a collection using map. If you have a List of X elements, after map execution, you have X elements, not less, not more.
val newList = hdtList.filter( x => partition_columns.exists(x.startsWith) )
Be aware that the combination filter+exists between two List is an algorithm NxM. If your Lists are big, you will have a performance problem.
One way to solve that problem is using Sets.
It might be useful to have both lists: the hdt elements referenced in partition_columns, and the hdt elements that aren't.
val (pc
,notPc) = hdtList.partition( w =>
partition_columns.contains(w.takeWhile(_!=' ')))
//pc: List[String] = List(period_year bigint, source_system_name string)
//notPc: List[String] = List(forecast_id bigint, period_num bigint, ... etc.

Count operation in reduceByKey in spark

val temp1 = tempTransform.map({ temp => ((temp.getShort(0), temp.getString(1)), (USAGE_TEMP.getDouble(2), USAGE_TEMP.getDouble(3)))})
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
Here I have performed Sum operation But Is it possible to do count operation inside reduceByKey.
Like what i think,
reduceByKey((x, y) => (math.count(x._1),(x._2+y._2)))
But this is not working any suggestion please.
Well, counting is equivalent to summing 1s, so just map the first item in each value tuple into 1 and sum both parts of the tuple like you did before:
val temp1 = tempTransform.map { temp =>
((temp.getShort(0), temp.getString(1)), (1, USAGE_TEMP.getDouble(3)))
}
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
Result would be an RDD[((Short, String), (Int, Double))] where the first item in the value tuple (the Int) is the number of original records matching that key.
That's actually the classic map-reduce example - word count.
No, you can't do that. RDD provide iterator model for lazy computation. So every element will be visited only once.
If you really want to do sum as described, re-partition your rdd first, then use mapWithPartition, implement your calculation in closure( Keep in mind that elements in RDD is not in order).

Comparing Subsets of an RDD

I’m looking for a way to compare subsets of an RDD intelligently.
Lets say I had an RDD with key/value pairs of type (Int->T). I eventually need to say “compare all values of key 1 with all values of key 2 and compare values of key 3 to the values of key 5 and key 7”, how would I go about doing this efficiently?
The way I’m currently thinking of doing it is by creating a List of filtered RDDs and then using RDD.cartesian()
def filterSubset[T] = (b:Int, r:RDD[(Int, T)]) => r.filter{case(name, _) => name == b}
Val keyPairs:(Int, Int) // all key pairs
Val rddPairs = keyPairs.map{
case (a, b) =>
filterSubset(a,r).cartesian(filterSubset(b,r))
}
rddPairs.map{whatever I want to compare…}
I would then iterate the list and perform a map on each of the RDDs of pairs to gather the relational data that I need.
What I can’t tell about this idea is whether it would be extremely inefficient to set up possibly of hundreds of map jobs and then iterate through them. In this case, would the lazy valuation in spark optimize the data shuffling between all of the maps? If not, can someone please recommend a possibly more efficient way to approach this issue?
Thank you for your help
One way you can approach this problem is to replicate and partition your data to reflect key pairs you want to compare. Lets start with creating two maps from the actual keys to the temporary keys we'll use for replication and joins:
def genMap(keys: Seq[Int]) = keys
.zipWithIndex.groupBy(_._1)
.map{case (k, vs) => (k -> vs.map(_._2))}
val left = genMap(keyPairs.map(_._1))
val right = genMap(keyPairs.map(_._2))
Next we can transform data by replicating with new keys:
def mapAndReplicate[T: ClassTag](rdd: RDD[(Int, T)], map: Map[Int, Seq[Int]]) = {
rdd.flatMap{case (k, v) => map.getOrElse(k, Seq()).map(x => (x, (k, v)))}
}
val leftRDD = mapAndReplicate(rddPairs, left)
val rightRDD = mapAndReplicate(rddPairs, right)
Finally we can cogroup:
val cogrouped = leftRDD.cogroup(rightRDD)
And compare / filter pairs:
cogrouped.values.flatMap{case (xs, ys) => for {
(kx, vx) <- xs
(ky, vy) <- ys
if cosineSimilarity(vx, vy) <= threshold
} yield ((kx, vx), (ky, vy)) }
Obviously in the current form this approach is limited. It assumes that values for arbitrary pair of keys can fit into memory and require a significant amount of network traffic. Still it should give you some idea how to proceed.
Another possible approach is to store data in the external system (for example database) and fetch required key-value pairs on demand.
Since you're trying to find similarity between elements I would also consider completely different approach. Instead of naively comparing key-by-key I would try to partition data using custom partitioner which reflects expected similarity between documents. It is far from trivial in general but should give much better results.
Using Dataframe you can easily do the cartesian operation using join:
dataframe1.join(dataframe2, dataframe1("key")===dataframe2("key"))
It will probably do exactly what you want, but efficiently.
If you don't know how to create an Dataframe, please refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#creating-dataframes

transform rdd into pairRDD

This is a newbie question.
Is it possible to transform an RDD like (key,1,2,3,4,5,5,666,789,...) with a dynamic dimension into a pairRDD like (key, (1,2,3,4,5,5,666,789,...))?
I feel like it should be super-easy but I cannot get how to.
The point of doing it is that I would like to sum all the values, but not the key.
Any help is appreciated.
I am using Spark 1.2.0
EDIT enlightened by the answer I explain my use case deeplier. I have N (unknown at compile time) different pairRDD (key, value), that have to be joined and whose values must be summed up. Is there a better way than the one I was thinking?
First of all if you just wanna sum all integers but first the simplest way would be:
val rdd = sc.parallelize(List(1, 2, 3))
rdd.cache()
val first = rdd.sum()
val result = rdd.count - first
On the other hand if you want to have access to the index of elements you can use rdd zipWithIndex method like this:
val indexed = rdd.zipWithIndex()
indexed.cache()
val result = (indexed.first()._2, indexed.filter(_._1 != 1))
But in your case this feels like overkill.
One more thing i would add, this looks like questionable desine to put key as first element of your rdd. Why not just instead use pairs (key, rdd) in your driver program. Its quite hard to reason about order of elements in rdd and i cant not think about natural situation in witch key is computed as first element of rdd (ofc i dont know your usecase so i can only guess).
EDIT
If you have one rdd of key value pairs and you want to sum them by key then do just:
val result = rdd.reduceByKey(_ + _)
If you have many rdds of key value pairs before counting you can just sum them up
val list = List(pairRDD0, pairRDD1, pairRDD2)
//another pairRDD arives in runtime
val newList = anotherPairRDD0::list
val pairRDD = newList.reduce(_ union _)
val resultSoFar = pairRDD.reduceByKey(_ + _)
//another pairRDD arives in runtime
val result = resultSoFar.union(anotherPairRDD1).reduceByKey(_ + _)
EDIT
I edited example. As you can see you can add additional rdd when every it comes up in runtime. This is because reduceByKey returns rdd of the same type so you can iterate this operation (Ofc you will have to consider performence).