Unique pairs in cartesian product of an RDD with itself - scala

I have an RDD with 100 items and each item is of the format (String,Int,Option[Set[Int]])
I want to compute the cartesian product of this RDD with itself and want to have only unique pairs. For example without removing unique pairs I would get something like this:
(a,a),(a,b),(a,c),(b,a),(b,b),(b,c),(c,a),(c,b),(c,c)
The output that I want is (a,b),(a,c),(b,c)
I have managed to remove the instances where the pairs are the same value (a,a),(b,b),(c,c) but I am unsure how to remove the reverse pairs.
This is the code I used to do it:
val m = records.cartesian(records).filter{case (a,b) => a != b}.collect()

Related

is there a way in scala i can form a map based on the content of two arrays

I have two scala vectors
value - represents values
occurrences - represents number of occurrences of a character
Ex:
values : (a,b,c,d)
occurrences : (4,2,1,5)
How can I merge this two vectors into Map of form Map[Char,Int)] = Map(a -> 4,b->2,c->1,d->5)
(values zip occurrences).toMap
zip pairs up matching elements in two collections, and toMap converts a collection of pairs into a Map.

combine elements into arrays in rdd

how can I convert an RDD[(Int,Int)] to an RDD[Array[(Int,Int)]] where I combine elements with their key.
Lets say
(0,0),(1,0),(1,1),(0,1)
and I want it to be an Array arr1 = ((0,0),(1,0)) and an arr2 ((1,1),(0,1))
So the resulted rdd will have arr1,arr2 as arrays.
What you're basically trying to do is group an RDD[TupleN] by the ith element. You can use
rdd.groupBy(_._1)
to create a
Map[T, RDD[TupleN]]
where the key will be the ith element (i.e., 0 or 1 in your example).
Then you can map the values of this map to an array with mapValues(_.toArray)

How to get the n top elements of an rdd per value?

I created an RDD of key/values this way:
RDD[(String, Int)]: rdd.map(row => row.split(1) -> 1).reduceByKey(_ + _)
How can I get the top five elements based on values?
You can use rdd.top in order to avoid a full sort of the rdd:
rdd.top(5)(Ordering[Int].on(_._2))
This defines an order on the values and makes a single O(n) pass on the rdd to get the 5 top items per value.

Calculating derived value in Spark Streaming

I have two Key Value Pairs of the type org.apache.spark.streaming.dstream.DStream[Int].
First Key value pair is (word,frequency).
Second key value pair is (Number of rows,Value).
I would like to divide frequency by value in for each word. But, I am getting below error
value / is not a member of org.apache.spark.streaming.dstream.DStream[Int]
Sample Code :
f is frequency of the word and c is the total count
rdd has word and frequency
val cp = rdd.foreachRDD {
x => (x, f/c)
}
First apply map transformation on the DStream object and then inside that you will get RDD now you apply map transformation on RDD object as follow
dStream.map{rdd=>
rdd.map(x=>(x,f/c))
}
if f is the object of DStream then collect it first before use it in RDD or DStream closure.

Scala: Getting the index of an element in a Map

I have a Map[String, Object], where the key is an ID. That ID is now referenced within another object and I have to know what index (what position in the Map) that ID has. I can't believe there isn't an .indexOf
How can I accomplish that?
Do I really have to build myself a list with all the keys or another Map with ID1 -> 1, ID2 -> 2,... ?
I have to get the ID's indexes multiple times. Would a List or that Map be more efficient?
#Dora, as everyone mentioned, maps are unordered so there is no way to index them in place and store id with them.
It's hard to guess use case of storing (K,V) pairs in map and then getting unique id for every K. So, these are few suggestions based on my understanding -
1. You could use LinkedHashMap instead of Map which will maintain the insertion order so you will get stable iteration. Get KeysIterator on this map and convert it into a list which give you an unique index for every key in you map. Something like this-
import scala.collection.mutable.LinkedHashMap
val myMap = LinkedHashMap("a"->11, "b"->22, "c"->33)
val l = myMap.keysIterator.toList
l.indexOf("a") //get index of key a
myMap.+=("d"->44) //insert new element
val l = myMap.keysIterator.toList
l.indexOf("a") //index of a still remains 0 due to linkedHashMap
l.indexOf("d") //get index of newly inserted element.
Obviously, it is expensive to insert elements in linkedHashMap compared to HashMaps.
Deleting element from Map would automatically shift indexes to left.
myMap.-=("b")
val l = myMap.keysIterator.toList
l.indexOf("c") // Index of "c" moves from 2 to 1.
Change you map (K->V) to (K->(index, v)) and generate index manually while inserting new elements.
class ValueObject(val index: Int, val value: Int)
val myMap = scala.collection.mutable.Map[String, ValueObject]()
myMap.+=("a"-> new ValueObject(myMap.size+1, 11))
myMap("a").index<br/> // get index of key a
myMap.+=("b"-> new ValueObject(myMap.size+1, 22))
myMap.+=("c"-> new ValueObject(myMap.size+1, 33))
myMap("c").index<br/> // get index of key c
myMap("b").index<br/> // get index of key b
deletion would be expensive if we need indexes with no gaps, as we need to update all keys accordingly. However, keys insertion and search will be faster.
This problem can be solved efficiently if we know exactly what you need so please explain if above solutions doesn't work for you !!! (May be you really don't need map for solving your problem)
Maps do not have indexes, by definition.
What you can do is enumerate the keys in a map, but this is not guaranteed to be stable/repeatable. Add a new element, and it may change the numbers of every other element randomly due to rehashing. If you want a stable mapping of key to index (or the opposite), you will have to store this mapping somewhere, e.g. by serializing the map into a list.
Convert the Map[A,B] to an ordered collection namely for instance Seq[(A,B)] equipped with indexOf method. Note for
val m = Map("a" -> "x", "b" -> "y")
that
m.toSeq
Seq[(String, String)] = ArrayBuffer((a,x), (b,y))
From the extensive discussions above note this conversion does not guarantee any specific ordering in the resulting ordered collection. This collection can be sorted as necessary.
Alternatively you can index the set of keys in the map, namely for instance,
val idxm = for ( ((k,v),i) <- m.zipWithIndex ) yield ((k,i),v)
Map[(String, Int),String] = Map((a,0) -> x, (b,1) -> y)
where an equivalent to indexOf would be for example
idxm.find { case ((_,i),v) => i == 1 }
Option[((String, Int), String)] = Some(((b,1),y))