Functional Opposite of Subtract by Key - scala

I have two RDDs of the form RDD1[K, V1] and RDD2[K, V2]. I was hoping to remove values in RDD2 which are not in RDD1. (Essentially an inner join on each of the RDD's keys, but I don't want to copy over RDD1's values.)
I understand that there's a method subtractByKey which performs the opposite of this. (Keeps those that are distinct.)

You cannot avoid having some type of value here so applying join and mapping values seems to be the way to go. You can use:
rdd2.join(rdd1.mapValues(_ => None)).mapValues(_._1)
which replaces values with dummies (usually you can skip that because there is not much to gain here unless values are largish):
_.mapValues(_ => None)
joins, and drops placeholders:
_.mapValues(_._1)

Related

Scala RDD mapping

so I have an RDD in scala which is currently stored as a key-value mapping like the following.
(A, (B,C,D,E))
I was wondering if it was possible to somehow map this to an RDD which stores a key-value mapping like the following
(A,B)
(A,C)
(A,D)
(A,E)
i.e. is it possible to make the key separately map to everything?
Found a way to do it. You can use a flatMapValues(x=>x) to turn all of them into key value pairs rather than one key array value pair.

How does scala's VectorMap work and how is it different than ListMap?

How does scala's VectorMap work? It says that it is constant time for look up.
I think ListMap has to iterate through everything to find an entry. Why would vector map be different?
Is it a hash table combined with a vector, where the hash table will map a key to an index in the vector, which has the entries?
Essentially, yes. It has a regular Map inside that maps keys to tuples (index, value), where index is pointing into a Vector of (keys), which is only used for in-order access (.head, .tail, .last, .iterator etc).

Efficiently take one value for each key out of a RDD[(key,value)]

My starting point is an RDD[(key,value)] in Scala using Apache Spark. The RDD contains roughly 15 million tuples. Each key has roughly 50+-20 values.
Now I'd like to take one value (doesn't matter which one) for each key. My current approach is the following:
HashPartition the RDD by the key. (There is no significant skew)
Group the tuples by key resulting in RDD[(key, array of values)]]
Take the first of each value array
Basically looks like this:
...
candidates
.groupByKey()
.map(c => (c._1, c._2.head)
...
The grouping is the expensive part. It is still fast because there is no network shuffle and candidates is in memory but can I do it faster?
My idea was to work on the partitions directly, but I'm not sure what I get out of the HashPartition. If I take the first tuple of each partition, I will get every key but maybe multiple tuples for a single key depending on the number of partitions? Or will I miss keys?
Thank you!
How about reduceByKey with a function that returns the first argument? Like this:
candidates.reduceByKey((x, _) => x)

Spark in Scala: How to avoid linear scan for searching a key in each partition?

I have one huge key-value dataset named A, and a set of keys named B as queries. My task is that for each key in B, return the key exists in A or not, if it exists, return the value.
I partition A by HashParitioner(100) first. Currently I can use A.join(B') to solve it, where B' = B.map(x=>(x,null)). Or we can use A.lookup() for each key in B.
However, the problem is that both join and lookup for PairRDD is linear scan for each partition. This is too slow. As I desire, each partition could be a Hashmap, so that we can find the key in each parition in O(1). So the ideal strategy is that when the master machine receives a bunch of keys, the master assigns each key to its corresponding partition, then the partition uses its Hashmap to find the keys and return the result to the master machine.
Is there an easy way to achieve it?
One potential way:
As I searched online, a similar question is here
http://mail-archives.us.apache.org/mod_mbox/spark-user/201401.mbox/%3CCAMwrk0kPiHoX6mAiwZTfkGRPxKURHhn9iqvFHfa4aGj3XJUCNg#mail.gmail.com%3E
As it said, I built the Hashmap for each partition using the code as follows
val hashpair = A.mapPartitions(iterator => {
val hashmap = new HashMap[Long, Double]
iterator.foreach { case (key, value) => hashmap.getOrElseUpdate(key,value) }
Iterator(hashmap)
})
Now I get 100 Hashmap (if I have 100 partitions for data A). Here I'm lost. I don't know how to ask query, how to use the hashpair to search keys in B, since hashpair is not a regular RDD. Do I need to implement a new RDD and implement RDD methods for hashpair? If so, what is the easiest way to implement join or lookup methods for hashpair?
Thanks all.
You're probably looking for the IndexedRDD:
https://github.com/amplab/spark-indexedrdd

getting number of values within reduceByKey RDD

when reduceByKey operation is called, it is receiving list of values of a particular key. My question is:
are the list of values it receives in a sorted order?
is it possible to know how many values it receive?
i'm trying to calculate first quartile of the list of values of a key within reduceByKey. is this possible to do within reduceByKey?
.1. No, that would be totally going against the whole point of a reduce operation - i.e. to parallelalize an operation into an arbitrary tree of suboperations by taking advantage of associativity and commutativity.
.2. You'll need to define a new monoid by composing the integer monoid and whatever it is your doing. Let's assume your operation is op then
.
yourRdd.map(kv => (kv._1, (kv._2, 1)))
.reduceByKey((left, right) => (left._1 op right._1, left._2 + right._2))
will give you an RDD[(KeyType, (ReducedValueType, Int))] where the Int will be the number of values the reduce received for each key.
.3. You'll have to be more specific about what you mean by first quartile. Given that the answer to 1. is no, then you would have to have a bound that defines the first quartile then you won't need the data to be sorted because you could filter the values out by that bound.