Scala - How to select the last element from an RDD?

Scala - How to select the last element from an RDD? - scala

First I had a salesList: List[Sale] and in order to get an ID of the last Sale in the List I've used lastOption:
val lastSaleId: Option[Any] = salesList.lastOption.map(_.saleId)
But now I've modified a method with List[Sale] to work with salesListRdd: List[RDD[Sale]]. So I've changed the way I'm getting an ID of the last Sale:
val lastSaleId: Option[Any] = SparkContext
.union(salesListRdd)
.collect().toList
.lastOption.map(_.saleId)
I'm not sure that it is the best way to go. Because here I'm still collecting RDD to a List which brings it to the driver node and it may cause the driver to run out of memory.
Is there a way to get an ID of the last Sale from an RDD preserving the initial order of records? Not any kind of sorting but the way the Sale objects were originally stored in the List?

There at least two efficient solutions. You can either use top with zipWithIndex:
def lastValue[T](rdd: RDD[T]): Option[T] = {
rdd.zipWithUniqueId.map(_.swap).top(1)(Ordering[Long].on(_._1)).headOption.map(_._2)
}
or top with custom key:
def lastValue[T](rdd: RDD[T]): Option[T] = {
rdd.mapPartitionsWithIndex(
(i, iter) => iter.zipWithIndex.map { case (x, j) => ((i, j), x) }
).top(1)(Ordering[(Int, Long)].on(_._1)).headOption.map(_._2)
}
The former one requires additional action for zipWithIndex while the latter one doesn't.
Before using please be sure to understand the limitation. Quoting the docs:
Note that some RDDs, such as those returned by groupBy(), do not guarantee order of elements in a partition. The unique ID assigned to each element is therefore not guaranteed, and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee the same index assignments, you should sort the RDD with sortByKey() or save it to a file.
In particular, depending on the exact input, Union might not preserve the input order at all.

You could use zipWithIndex and sort descending by it, so that the last record will be on the top, then take(1):
salesListRdd
.zipWithIndex()
.map({ case (x, y) => (y, x) })
.sortByKey(ascending = false)
.map({ case (x, y) => y })
.take(1)
Solution is taken from here: http://www.swi.com/spark-rdd-getting-bottom-records/
However, it is highly inefficient, since It does lots of partition shuffling.

Related

Order Spark RDD based on ordering in another RDD

I have an RDD with strings like this (ordered in a specific way):
["A","B","C","D"]
And another RDD with lists like this:
["C","B","F","K"],
["B","A","Z","M"],
["X","T","D","C"]
I would like to order the elements in each list in the second RDD based on the order in which they appear in the first RDD. The order of the elements that do not appear in the first list is not of concern.
From the above example, I would like to get an RDD like this:
["B","C","F","K"],
["A","B","Z","M"],
["C","D","X","T"]
I know I am supposed to use a broadcast variable to broadcast the first RDD as I process each list in the second RDD. But I am very new to Spark/Scala (and functional programming in general) so I am not sure how to do this.

I am assuming that the first RDD is small since you talk about broadcasting it. In that case you are right, broadcasting the ordering is a good way to solve your problem.
// generating data
val ordering_rdd = sc.parallelize(Seq("A","B","C","D"))
val other_rdd = sc.parallelize(Seq(
Seq("C","B","F","K"),
Seq("B","A","Z","M"),
Seq("X","T","D","C")
))
// let's start by collecting the ordering onto the driver
val ordering = ordering_rdd.collect()
// Let's broadcast the list:
val ordering_br = sc.broadcast(ordering)
// Finally, let's use the ordering to sort your records:
val result = other_rdd
.map( _.sortBy(x => {
val index = ordering_br.value.indexOf(x)
if(index == -1) Int.MaxValue else index
}))
Note that indexOf returns -1 if the element is not found in the list. If we leave it as is, all non-found elements would end up at the beginning. I understand that you want them at the end so I relpace -1 by some big number.
Printing the result:
scala> result.collect().foreach(println)
List(B, C, F, K)
List(A, B, Z, M)
List(C, D, X, T)

Optimize Sorting Iterable Values after grouping in Spark

I have RDD[(String,(Int, Int)], I need to get top 10 values(tuples) for each key after sorting. I tried:
val sortedRDD = rdd.groupByKey.mapValues( x => x.toList.sortWith((x,y) => <<sorting logic>>).take(10))
This throws OutOfMemoryException as Iterable[(Int, Int)] is large for few keys for some keys. How should i handle this?, Is there a way to do this without using .groupByKey().

You should use aggregateByKey instead of groupByKey to perform the sorting and "trimming" (that keeps only top 10) while grouping instead of grouping into potentially-huge groups and only then mapping the result.
Here's how this could look:
// your sorting logic:
val sortingFunction: ((Int, Int), (Int, Int)) => Boolean = ???
val N = 10
val sortedRDD = rdd.aggregateByKey(List[(Int, Int)]())(
// first function: seqOp, how to add another item of the group to the result
{
case (topSoFar, candidate) if topSoFar.size < N => candidate :: topSoFar
case (topTen, candidate) => (candidate :: topTen).sortWith(sortingFunction).take(N)
},
// second function: combOp, how to add combine two partial results created by seqOp
{ case (list1, list2) => (list1 ++ list2).sortWith(sortingFunction).take(N) }
)
Notice that per group, we always create values that are 10 items or less.
NOTE: performance can possibly be improved by performing less "sort" operations (we sort the same list again and again whenever we add another item / list). To solve that, you can consider using a "sorted set" with a limited capacity (see Limited SortedSet) as the value, so that each addition efficiently adds or discards the new value without sorting.

Count operation in reduceByKey in spark

val temp1 = tempTransform.map({ temp => ((temp.getShort(0), temp.getString(1)), (USAGE_TEMP.getDouble(2), USAGE_TEMP.getDouble(3)))})
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
Here I have performed Sum operation But Is it possible to do count operation inside reduceByKey.
Like what i think,
reduceByKey((x, y) => (math.count(x._1),(x._2+y._2)))
But this is not working any suggestion please.

Well, counting is equivalent to summing 1s, so just map the first item in each value tuple into 1 and sum both parts of the tuple like you did before:
val temp1 = tempTransform.map { temp =>
((temp.getShort(0), temp.getString(1)), (1, USAGE_TEMP.getDouble(3)))
}
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
Result would be an RDD[((Short, String), (Int, Double))] where the first item in the value tuple (the Int) is the number of original records matching that key.
That's actually the classic map-reduce example - word count.

No, you can't do that. RDD provide iterator model for lazy computation. So every element will be visited only once.
If you really want to do sum as described, re-partition your rdd first, then use mapWithPartition, implement your calculation in closure( Keep in mind that elements in RDD is not in order).

Comparing Subsets of an RDD

I’m looking for a way to compare subsets of an RDD intelligently.
Lets say I had an RDD with key/value pairs of type (Int->T). I eventually need to say “compare all values of key 1 with all values of key 2 and compare values of key 3 to the values of key 5 and key 7”, how would I go about doing this efficiently?
The way I’m currently thinking of doing it is by creating a List of filtered RDDs and then using RDD.cartesian()
def filterSubset[T] = (b:Int, r:RDD[(Int, T)]) => r.filter{case(name, _) => name == b}
Val keyPairs:(Int, Int) // all key pairs
Val rddPairs = keyPairs.map{
case (a, b) =>
filterSubset(a,r).cartesian(filterSubset(b,r))
}
rddPairs.map{whatever I want to compare…}
I would then iterate the list and perform a map on each of the RDDs of pairs to gather the relational data that I need.
What I can’t tell about this idea is whether it would be extremely inefficient to set up possibly of hundreds of map jobs and then iterate through them. In this case, would the lazy valuation in spark optimize the data shuffling between all of the maps? If not, can someone please recommend a possibly more efficient way to approach this issue?
Thank you for your help

One way you can approach this problem is to replicate and partition your data to reflect key pairs you want to compare. Lets start with creating two maps from the actual keys to the temporary keys we'll use for replication and joins:
def genMap(keys: Seq[Int]) = keys
.zipWithIndex.groupBy(_._1)
.map{case (k, vs) => (k -> vs.map(_._2))}
val left = genMap(keyPairs.map(_._1))
val right = genMap(keyPairs.map(_._2))
Next we can transform data by replicating with new keys:
def mapAndReplicate[T: ClassTag](rdd: RDD[(Int, T)], map: Map[Int, Seq[Int]]) = {
rdd.flatMap{case (k, v) => map.getOrElse(k, Seq()).map(x => (x, (k, v)))}
}
val leftRDD = mapAndReplicate(rddPairs, left)
val rightRDD = mapAndReplicate(rddPairs, right)
Finally we can cogroup:
val cogrouped = leftRDD.cogroup(rightRDD)
And compare / filter pairs:
cogrouped.values.flatMap{case (xs, ys) => for {
(kx, vx) <- xs
(ky, vy) <- ys
if cosineSimilarity(vx, vy) <= threshold
} yield ((kx, vx), (ky, vy)) }
Obviously in the current form this approach is limited. It assumes that values for arbitrary pair of keys can fit into memory and require a significant amount of network traffic. Still it should give you some idea how to proceed.
Another possible approach is to store data in the external system (for example database) and fetch required key-value pairs on demand.
Since you're trying to find similarity between elements I would also consider completely different approach. Instead of naively comparing key-by-key I would try to partition data using custom partitioner which reflects expected similarity between documents. It is far from trivial in general but should give much better results.

Using Dataframe you can easily do the cartesian operation using join:
dataframe1.join(dataframe2, dataframe1("key")===dataframe2("key"))
It will probably do exactly what you want, but efficiently.
If you don't know how to create an Dataframe, please refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#creating-dataframes

Spark closure argument binding

I am working with Apache Spark in Scala.
I have a problem when trying to manipulate one RDD with data from a second RDD. I am trying to pass the 2nd RDD as an argument to a function being 'mapped' against the first RDD, but seemingly the closure created on that function binds an uninitialized version of that value.
Following is a simpler piece of code that shows the type of problem I'm seeing. (My real example where I first had trouble is larger and less understandable).
I don't really understand the argument binding rules for Spark closures.
What I'm really looking for is a basic approach or pattern for how to manipulate one RDD using the content of another (which was previously constructed elsewhere).
In the following code, calling Test1.process(sc) will fail with a null pointer access in findSquare (as the 2nd arg bound in the closure is not initialized)
object Test1 {
def process(sc: SparkContext) {
val squaresMap = (1 to 10).map(n => (n, n * n))
val squaresRDD = sc.parallelize(squaresMap)
val primes = sc.parallelize(List(2, 3, 5, 7))
for (p <- primes) {
println("%d: %d".format(p, findSquare(p, squaresRDD)))
}
}
def findSquare(n: Int, squaresRDD: RDD[(Int, Int)]): Int = {
squaresRDD.filter(kv => kv._1 == n).first._1
}
}

Problem you experience has nothing to do with closures or RDDs which, contrary to popular belief, are serializable.
It is simply breaks a fundamental Spark rule which states that you cannot trigger an action or transformation from another action or transformation* and different variants of this question have been asked on SO multiple times.
To understand why that's the case you have to think about the architecture:
SparkContext is managed on the driver
everything that happens inside transformations is executed on the workers. Each worker have access only to its own part of the data and don't communicate with other workers**.
If you want to use content of multiple RDDs you have to use one of the transformations which combine RDDs, like join, cartesian, zip or union.
Here you most likely (I am not sure why you pass tuple and use only first element of this tuple) want to either use a broadcast variable:
val squaresMapBD = sc.broadcast(squaresMap)
def findSquare(n: Int): Seq[(Int, Int)] = {
squaresMapBD.value
.filter{case (k, v) => k == n}
.map{case (k, v) => (n, k)}
.take(1)
}
primes.flatMap(findSquare)
or Cartesian:
primes
.cartesian(squaresRDD)
.filter{case (n, (k, _)) => n == k}.map{case (n, (k, _)) => (n, k)}
Converting primes to dummy pairs (Int, null) and join would be more efficient:
primes.map((_, null)).join(squaresRDD).map(...)
but based on your comments I assume you're interested in a scenario when there is natural join condition.
Depending on a context you can also consider using database or files to store common data.
On a side note RDDs are not iterable so you cannot simply use for loop. To be able to do something like this you have to collect or convert toLocalIterator first. You can also use foreach method.
* To be precise you cannot access SparkContext.
** Torrent broadcast and tree aggregates involve communication between executors so it is technically possible.

RDD are not serializable, so you can't use an rdd inside an rdd trasformation.
Then I've never seen enumerate an rdd with a for statement, usually I use foreach statement that is part of rdd api.
In order to combine data from two rdd, you can leverage join, union or broadcast ( in case your rdd is small)