Optimize Sorting Iterable Values after grouping in Spark - scala

I have RDD[(String,(Int, Int)], I need to get top 10 values(tuples) for each key after sorting. I tried:
val sortedRDD = rdd.groupByKey.mapValues( x => x.toList.sortWith((x,y) => <<sorting logic>>).take(10))
This throws OutOfMemoryException as Iterable[(Int, Int)] is large for few keys for some keys. How should i handle this?, Is there a way to do this without using .groupByKey().

You should use aggregateByKey instead of groupByKey to perform the sorting and "trimming" (that keeps only top 10) while grouping instead of grouping into potentially-huge groups and only then mapping the result.
Here's how this could look:
// your sorting logic:
val sortingFunction: ((Int, Int), (Int, Int)) => Boolean = ???
val N = 10
val sortedRDD = rdd.aggregateByKey(List[(Int, Int)]())(
// first function: seqOp, how to add another item of the group to the result
{
case (topSoFar, candidate) if topSoFar.size < N => candidate :: topSoFar
case (topTen, candidate) => (candidate :: topTen).sortWith(sortingFunction).take(N)
},
// second function: combOp, how to add combine two partial results created by seqOp
{ case (list1, list2) => (list1 ++ list2).sortWith(sortingFunction).take(N) }
)
Notice that per group, we always create values that are 10 items or less.
NOTE: performance can possibly be improved by performing less "sort" operations (we sort the same list again and again whenever we add another item / list). To solve that, you can consider using a "sorted set" with a limited capacity (see Limited SortedSet) as the value, so that each addition efficiently adds or discards the new value without sorting.

Related

scala to check whether loop through all element in a vector when joining two vectors

I have 2 vectors as below.
val vecBase21=....sortBy(r=>(r._1,r._2))
vecBase21: scala.collection.immutable.Vector[(String, String, Double)] = Vector((036,20210624 0400,2.0), (036,20210624 0405,2.0), (036,20210624 0410,2.0), (036,20210624 0415,2.0), (036,20210624 0420,2.0),...)
val vecBase22=....sortBy(r=>(r._1,r._2))
vecBase22: scala.collection.immutable.Vector[(String, String, Double)] = Vector((036,20210625 0400,2.0), (036,20210625 0405,2.0), (036,20210625 0410,2.0), (036,20210625 0415,2.0), (036,20210625 0420,2.0),...)
Inside, x._1 is ID, x._2 is date time, and x._3 is value.Then I did this to create a 3rd vector as follow.
val vecBase30=vecBase21.map(x=>vecBase22.filter(y=>x._1==y._1 && x._2==y._2).map(y=>(x._1,x._2,x._3,y._3))).flatten
This is literally a join in SQL, a join b on a.id=b.id and a.date_time=b.date_time. It loops in vecBase22 to search one combination of ID and date_time from vecBase21. As each combination is unique in one vector and they are sorted, I want to find out whether the loop in vecBase22 stops once it finds a match or it loops till the end of vecBase22 anyway. I tried this
val vecBase30=vecBase21.map(x=>vecBase22.filter(y=>x._1==y._1 && x._2==y._2).map{y=>
println("x1="+x._1+" y1="+y._1+" x2="+x._2+" y2="+y._2)
(x._1,x._2,x._3,y._3)}).flatten
But it apparently gives only matched results. Is there a way of printing all combinations from two vectors that the machine evaluates whether there is a match?
As each combination is unique in one vector and they are sorted, I want to find out whether the loop in vecBase22 stops once it finds a match or it loops till the end of vecBase22 anyway
When you call filter on vecBase22 you loop through every element of that collection to see if it matches the predicate. This returns a new collection and passes it to the map function. If you want to short-circuit the filtering process you could consider using the method collectFirst (Scala 2.12):
def collectFirst[B](pf: PartialFunction[A, B]): Option[B]
Finds the first element of the traversable or iterator for which the given partial function is defined, and applies the partial function to it.
Note: may not terminate for infinite-sized collections.
Note: might return different results for different runs, unless the underlying collection type is ordered.
pf: the partial function
returns: an option value containing pf applied to the first value for which it is defined, or None if none exists.
Example:
Seq("a", 1, 5L).collectFirst({ case x: Int => x*10 }) = Some(10)
So you could do something like:
val vecBase30: Vector[(String, String, Double, Double)] = vecBase21
.flatMap(x => vecBase22.collectFirst {
case matched: (String, String, Double) if x._1 == matched._1 && x._2 == matched._2 => (x._1, x._2, x._3, matched._3)
})
First off: yes it loop through all items of vecBase22, for each item of vecBase21. That's what the map and filter do.
If the println doesn't work, it is probably because you are executing you code in an interpreter that lose the std out. Some notebook maybe?
Also, if you want it stop once it find a match, use Seq.find
Finally, you can improve readability. here is a couple of ideas:
use case class instead of tuple
add space around operator
add new lines before each monad operation if it doesn't fit one line
use flatMap instead of map followed by flatten
add val type (not necessary but it helps reading the code)
That gives:
case class Item(id: String, time: String, value: Double)
case class Joint(id: String, time: String, v1: Double, v2: Double)
val vecBase21: Vector[Item] = ....sortBy(item => (item.id, item.time))
val vecBase22: Vector[Item] = ....sortBy(item => (item.id, item.time))
val vecBase30: Vector[Joint] = vecBase21.flatMap( x =>
vecBase22
.filter( y => x.id == y.id && x.time == y.time)
.map( y => Joint(x.id, x.time, x.value, y.value))
)

Scala - How to select the last element from an RDD?

First I had a salesList: List[Sale] and in order to get an ID of the last Sale in the List I've used lastOption:
val lastSaleId: Option[Any] = salesList.lastOption.map(_.saleId)
But now I've modified a method with List[Sale] to work with salesListRdd: List[RDD[Sale]]. So I've changed the way I'm getting an ID of the last Sale:
val lastSaleId: Option[Any] = SparkContext
.union(salesListRdd)
.collect().toList
.lastOption.map(_.saleId)
I'm not sure that it is the best way to go. Because here I'm still collecting RDD to a List which brings it to the driver node and it may cause the driver to run out of memory.
Is there a way to get an ID of the last Sale from an RDD preserving the initial order of records? Not any kind of sorting but the way the Sale objects were originally stored in the List?
There at least two efficient solutions. You can either use top with zipWithIndex:
def lastValue[T](rdd: RDD[T]): Option[T] = {
rdd.zipWithUniqueId.map(_.swap).top(1)(Ordering[Long].on(_._1)).headOption.map(_._2)
}
or top with custom key:
def lastValue[T](rdd: RDD[T]): Option[T] = {
rdd.mapPartitionsWithIndex(
(i, iter) => iter.zipWithIndex.map { case (x, j) => ((i, j), x) }
).top(1)(Ordering[(Int, Long)].on(_._1)).headOption.map(_._2)
}
The former one requires additional action for zipWithIndex while the latter one doesn't.
Before using please be sure to understand the limitation. Quoting the docs:
Note that some RDDs, such as those returned by groupBy(), do not guarantee order of elements in a partition. The unique ID assigned to each element is therefore not guaranteed, and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee the same index assignments, you should sort the RDD with sortByKey() or save it to a file.
In particular, depending on the exact input, Union might not preserve the input order at all.
You could use zipWithIndex and sort descending by it, so that the last record will be on the top, then take(1):
salesListRdd
.zipWithIndex()
.map({ case (x, y) => (y, x) })
.sortByKey(ascending = false)
.map({ case (x, y) => y })
.take(1)
Solution is taken from here: http://www.swi.com/spark-rdd-getting-bottom-records/
However, it is highly inefficient, since It does lots of partition shuffling.

How do I perform set theory minus operation between two lists in Scala?

I have the following case class
case class Cart(userId: Int, ProductId :Int, SellerId:Int, Qty: Int)
I have the following lists :
val mergedCart :List[Cart]= List(Cart(900,1,1,2),Cart(900,2,2,2),Cart(901,3,3,2),Cart(901,2,2,2),Cart(901,1,1,2),Cart(900,4,2,1))
val userCart:List[Cart] = List(Cart(900,1,1,2),Cart(900,2,2,2),Cart(900,4,2,1))
val guestCart:List[Cart] = List(Cart(901,3,3,2),Cart(901,2,2,2),Cart(901,1,1,2))
val commonCart = List(Cart(900,2,2,4), Cart(900,1,1,4))
My requirement is that I have to get the following list as the output:
List(Cart(900,2,2,4),Cart(900,1,1,4),Cart(901,3,3,2),Cart(900,4,2,1))
The final list should have the common objects from userCart and guestCart based on the ProductId,SellerId combination and the quantity of both the objects get added. Then, the other objects present in userCart and guestCart which do not match the common objects should also be present in the final list in the output.
I am new to Scala and I am not able to solve this, kindly help me with this code.
If you don't care about ordering in resulting list (so basically your result is a Set) , it's as simple as that:
def sum(a: Cart, b: Cart) = {
//require(a.userId == b.userId)
a.copy(Qty = a.Qty + b.Qty)
}
(userCart ++ guestCart)
.groupBy(x => x.ProductId -> x.SellerId)
.mapValues(_.reduce(sum _))
.values
.toList //toSet is more appropriate here
Results:
List(Cart(900,4,2,1), Cart(900,2,2,4), Cart(900,1,1,4), Cart(901,3,3,2))
(!) Be aware that I just took first userId in case of collision (see sum function). However, it preserves priority of users over guests if that's what implied.
Being represented as a Set, this result equals to your requirement:
scala> val mRes = List(Cart(900,4,2,1), Cart(900,2,2,4), Cart(900,1,1,4), Cart(901,3,3,2))
mRes: List[Cart] = List(Cart(900,4,2,1), Cart(900,2,2,4), Cart(900,1,1,4), Cart(901,3,3,2))
scala> val req = List(Cart(900,2,2,4),Cart(900,1,1,4),Cart(901,3,3,2),Cart(900,4,2,1))
req: List[Cart] = List(Cart(900,2,2,4), Cart(900,1,1,4), Cart(901,3,3,2), Cart(900,4,2,1))
scala> mRes.toSet == req.toSet
res17: Boolean = true
Explanations:
++ concatenates two lists
groupBy groups values by some predicate (like x.ProductId -> x.SellerId which equivalent to a tuple (x.ProductId, x.SellerId) in your case). It preserves order inside group, but groups themselves aren't ordered - that's why order in resulting list is undefined. The operator returns Map[Key, List[Value]], in your case Map[(Int, Int), List[Cart]]
mapValues iterates over lists with carts
reduce inside mapValues reduces List with carts by summing carts using sum function
I didn't have to reattach objects with unique (x.ProductId, x.SellerId) as they were represented just as lists with one element, so reduce function didn't touch them - it just returned first (and only) element.
a.copy(Qty = ...) makes copy of a with modified Qty field. In our case I take left element as a template, so elements that preced in the (userCart ++ guestCart) would have higher priority when userId is chosen.
Answering the headline's question about subtracting two sets:
scala> Set(1,2,3,4) - 4
res16: scala.collection.immutable.Set[Int] = Set(1, 2, 3)
scala> Set(1,2,3,4) -- Set(3,4)
res15: scala.collection.immutable.Set[Int] = Set(1, 2)
If elements of sets are instances of case classes (given that hashCode/equals methods weren't overridden) - it would compare all fields in order to check equality between two elements.
There is a theoretical connection of groupBy solution with a set theory. First, you can easily notice that my solution is representable with SQL's GROUP BY + AGGREGATE (groupBy with reduce-catamorphism in Scala). SQL is mostly based on relational-algebra, which in its turn partially based on set-theory, so here it is.
P.S. field/value/variable name in scala should always start with lowercase letter by convention. First capital letter means a constant.

Count operation in reduceByKey in spark

val temp1 = tempTransform.map({ temp => ((temp.getShort(0), temp.getString(1)), (USAGE_TEMP.getDouble(2), USAGE_TEMP.getDouble(3)))})
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
Here I have performed Sum operation But Is it possible to do count operation inside reduceByKey.
Like what i think,
reduceByKey((x, y) => (math.count(x._1),(x._2+y._2)))
But this is not working any suggestion please.
Well, counting is equivalent to summing 1s, so just map the first item in each value tuple into 1 and sum both parts of the tuple like you did before:
val temp1 = tempTransform.map { temp =>
((temp.getShort(0), temp.getString(1)), (1, USAGE_TEMP.getDouble(3)))
}
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
Result would be an RDD[((Short, String), (Int, Double))] where the first item in the value tuple (the Int) is the number of original records matching that key.
That's actually the classic map-reduce example - word count.
No, you can't do that. RDD provide iterator model for lazy computation. So every element will be visited only once.
If you really want to do sum as described, re-partition your rdd first, then use mapWithPartition, implement your calculation in closure( Keep in mind that elements in RDD is not in order).

Sort a list by an ordered index

Let us assume that I have the following two sequences:
val index = Seq(2,5,1,4,7,6,3)
val unsorted = Seq(7,6,5,4,3,2,1)
The first is the index by which the second should be sorted. My current solution is to traverse over the index and construct a new sequence with the found elements from the unsorted sequence.
val sorted = index.foldLeft(Seq[Int]()) { (s, num) =>
s ++ Seq(unsorted.find(_ == num).get)
}
But this solution seems very inefficient and error-prone to me. On every iteration it searches the complete unsorted sequence. And if the index and the unsorted list aren't in sync, then either an error will be thrown or an element will be omitted. In both cases, the not in sync elements should be appended to the ordered sequence.
Is there a more efficient and solid solution for this problem? Or is there a sort algorithm which fits into this paradigm?
Note: This is a constructed example. In reality I would like to sort a list of mongodb documents by an ordered list of document Id's.
Update 1
I've selected the answer from Marius Danila because it seems the more fastest and scala-ish solution for my problem. It doesn't come with a not in sync item solution, but this could be easily implemented.
So here is the updated solution:
def sort[T: ClassTag, Key](index: Seq[Key], unsorted: Seq[T], key: T => Key): Seq[T] = {
val positionMapping = HashMap(index.zipWithIndex: _*)
val inSync = new Array[T](unsorted.size)
val notInSync = new ArrayBuffer[T]()
for (item <- unsorted) {
if (positionMapping.contains(key(item))) {
inSync(positionMapping(key(item))) = item
} else {
notInSync.append(item)
}
}
inSync.filterNot(_ == null) ++ notInSync
}
Update 2
The approach suggested by Bask.cc seems the correct answer. It also doesn't consider the not in sync issue, but this can also be easily implemented.
val index: Seq[String]
val entities: Seq[Foo]
val idToEntityMap = entities.map(e => e.id -> e).toMap
val sorted = index.map(idToEntityMap)
val result = sorted ++ entities.filterNot(sorted.toSet)
Why do you want to sort collection, when you already have sorted index collection? You can just use map
Concerning> In reality I would like to sort a list of mongodb documents by an ordered list of document Id's.
val ids: Seq[String]
val entities: Seq[Foo]
val idToEntityMap = entities.map(e => e.id -> e).toMap
ids.map(idToEntityMap _)
This may not exactly map to your use case, but Googlers may find this useful:
scala> val ids = List(3, 1, 0, 2)
ids: List[Int] = List(3, 1, 0, 2)
scala> val unsorted = List("third", "second", "fourth", "first")
unsorted: List[String] = List(third, second, fourth, first)
scala> val sorted = ids map unsorted
sorted: List[String] = List(first, second, third, fourth)
I do not know the language that you are using. But irrespective of the language this is how i would have solved the problem.
From the first list (here 'index') create a hash table taking key as the document id and the value as the position of the document in the sorted order.
Now when traversing through the list of document i would lookup the hash table using the document id and then get the position it should be in the sorted order. Then i would use this obtained order to sort in a pre allocated memory.
Note: if the number of documents is small then instead of using hashtable u could use a pre allocated table and index it directly using the document id.
Flat Mapping the index over the unsorted list seems to be a safer version (if the index isn't found it's just dropped since find returns a None):
index.flatMap(i => unsorted.find(_ == i))
It still has to traverse the unsorted list every time (worst case this is O(n^2)). With you're example I'm not sure that there's a more efficient solution.
In this case you can use zip-sort-unzip:
(unsorted zip index).sortWith(_._2 < _._2).unzip._1
Btw, if you can, better solution would be to sort list on db side using $orderBy.
Ok.
Let's start from the beginning.
Besides the fact you're rescanning the unsorted list each time, the Seq object will create, by default a List collection. So in the foldLeft you're appending an element at the end of the list each time and this is a O(N^2) operation.
An improvement would be
val sorted_rev = index.foldLeft(Seq[Int]()) { (s, num) =>
unsorted.find(_ == num).get +: s
}
val sorted = sorted_rev.reverse
But that is still an O(N^2) algorithm. We can do better.
The following sort function should work:
def sort[T: ClassTag, Key](index: Seq[Key], unsorted: Seq[T], key: T => Key): Seq[T] = {
val positionMapping = HashMap(index.zipWithIndex: _*) //1
val arr = new Array[T](unsorted.size) //2
for (item <- unsorted) { //3
val position = positionMapping(key(item))
arr(position) = item
}
arr //6
}
The function sorts a list of items unsorted by a sequence of indexes index where the key function will be used to extract the id from the objects you're trying to sort.
Line 1 creates a reverse index - mapping each object id to its final position.
Line 2 allocates the array which will hold the sorted sequence. We're using an array since we need constant-time random-position set performance.
The loop that starts at line 3 will traverse the sequence of unsorted items and place each item in it's meant position using the positionMapping reverse index
Line 6 will return the array converted implicitly to a Seq using the WrappedArray wrapper.
Since our reverse-index is an immutable HashMap, lookup should take constant-time for regular cases. Building the actual reverse-index takes O(N_Index) time where N_Index is the size of the index sequence. Traversing the unsorted sequence takes O(N_Unsorted) time where N_Unsorted is the size of the unsorted sequence.
So the complexity is O(max(N_Index, N_Unsorted)), which I guess is the best you can do in the circumstances.
For your particular example, you would call the function like so:
val sorted = sort(index, unsorted, identity[Int])
For the real case, it would probably be like this:
val sorted = sort(idList, unsorted, obj => obj.id)
The best I can do is to create a Map from the unsorted data, and use map lookups (basically the hashtable suggested by a previous poster). The code looks like:
val unsortedAsMap = unsorted.map(x => x -> x).toMap
index.map(unsortedAsMap)
Or, if there's a possibility of hash misses:
val unsortedAsMap = unsorted.map(x => x -> x).toMap
index.flatMap(unsortedAsMap.get)
It's O(n) in time*, but you're swapping time for space, as it uses O(n) space.
For a slightly more sophisticated version, that handles missing values, try:
import scala.collection.JavaConversions._
import scala.collection.mutable.ListBuffer
val unsortedAsMap = new java.util.LinkedHashMap[Int, Int]
for (i <- unsorted) unsortedAsMap.add(i, i)
val newBuffer = ListBuffer.empty[Int]
for (i <- index) {
val r = unsortedAsMap.remove(i)
if (r != null) newBuffer += i
// Not sure what to do for "else"
}
for ((k, v) <- unsortedAsMap) newBuffer += v
newBuffer.result()
If it's a MongoDB database in the first place, you might be better retrieving documents directly from the database by index, so something like:
index.map(lookupInDB)
*technically it's O(n log n), as Scala's standard immutable map is O(log n), but you could always use a mutable map, which is O(1)