Scala RDD mapping - scala

so I have an RDD in scala which is currently stored as a key-value mapping like the following.
(A, (B,C,D,E))
I was wondering if it was possible to somehow map this to an RDD which stores a key-value mapping like the following
(A,B)
(A,C)
(A,D)
(A,E)
i.e. is it possible to make the key separately map to everything?

Found a way to do it. You can use a flatMapValues(x=>x) to turn all of them into key value pairs rather than one key array value pair.

Related

How does scala's VectorMap work and how is it different than ListMap?

How does scala's VectorMap work? It says that it is constant time for look up.
I think ListMap has to iterate through everything to find an entry. Why would vector map be different?
Is it a hash table combined with a vector, where the hash table will map a key to an index in the vector, which has the entries?
Essentially, yes. It has a regular Map inside that maps keys to tuples (index, value), where index is pointing into a Vector of (keys), which is only used for in-order access (.head, .tail, .last, .iterator etc).

Spark: How to get same result using reduceByKey like we get by using groupByKey any alternative solution? to avoid shuffle

I am new to Spark (using 1.1 version) and Scala .. I am converting my existing Hadoop MapReduce code to spark MR using Scala and bit lost.
I want my mapped RDD to be grouped by Key .. When i read online it's suggested that we should avoid groupByKey and use reducedByKey instead.. but when I apply reduceBykey I am not getting list of values for given key as expected by my code =>Ex.
val rdd = sc.parallelize(List(("k1", "v11"), ("k1", "v21"), ("k2", "v21"), ("k2", "v22"), ("k3", "v31") ))
My "values" for actual task are huge, having 300 plus columns in key-values pair
And when I will do group by on common key it will result in shuffle which i want to avoid.
I want something like this as o/p (key, List OR Array of values) from my mapped RDD =>
rdd.groupByKey()
which gives me following Output
(k3,ArrayBuffer(v31))
(k2,ArrayBuffer(v21, v22))
(k1,ArrayBuffer(v11, v21))
But when I use
rdd.reduceByKey((x,y) => x+y)
I get values connected together like following- If pipe('|') or some other breakable character( (k2,v21|v22) ) would have been there my problem would have been little bit solved but still having list would be great for good coding practice
(k3,v31)
(k2,v21v22)
(k1,v11v21)
Please help
If you refer the spark documentation http://spark.apache.org/docs/latest/programming-guide.html
For groupByKey It says
“When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs.”
The Iterable keyword is very important over here, when you get the value as (v21, v22) it’s iterable.
Further it says
“Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.”
So from this what I understand is, if you want the return RDD to have iterable values use groupByKey if and if you want to have a single added up value like SUM then use reducebyKey.
Now in your tuple instead of having (String,String) => (K1,V1), if you had (String,ListBuffer(String)) => (K1,ListBuffer(“V1”)) then maybe you could have done rdd.reduceByKey( (x,y) = > x += y)

Efficiently take one value for each key out of a RDD[(key,value)]

My starting point is an RDD[(key,value)] in Scala using Apache Spark. The RDD contains roughly 15 million tuples. Each key has roughly 50+-20 values.
Now I'd like to take one value (doesn't matter which one) for each key. My current approach is the following:
HashPartition the RDD by the key. (There is no significant skew)
Group the tuples by key resulting in RDD[(key, array of values)]]
Take the first of each value array
Basically looks like this:
...
candidates
.groupByKey()
.map(c => (c._1, c._2.head)
...
The grouping is the expensive part. It is still fast because there is no network shuffle and candidates is in memory but can I do it faster?
My idea was to work on the partitions directly, but I'm not sure what I get out of the HashPartition. If I take the first tuple of each partition, I will get every key but maybe multiple tuples for a single key depending on the number of partitions? Or will I miss keys?
Thank you!
How about reduceByKey with a function that returns the first argument? Like this:
candidates.reduceByKey((x, _) => x)

Convert RDD[Map[String,Double]] to RDD[(String,Double)]

I did some calculation and returned my values in a RDD containing scala map and now I want to remove this map and want to collect all keys values in a RDD.
Any help will be appreciated.
You can call flatMap with the identity function to 'flatten' the structure of your RDD.
rdd.flatMap(identity)

How to test if a value is a key of an RDD

I am very new to Spark and Scala, and I want to test if a value is a key from an RDD.
The data I have is like this:
RDD data: key -> value
RDD stat: key -> statistics
What I want to do is to filter all the key-value pairs in data that has the key in stat.
My general idea is to convert the keys of an RDD into a set, then test if a value belongs to this set?
Are there better approaches, and how to convert the keys of an RDD into a set using Scala?
Thanks.
You can use lookup
def lookup(key: K): List[V]
Return the list of values in the RDD for key key. This operation is
done efficiently if the RDD has a known partitioner by only searching
the partition that the key maps to.
You asked -
What I want to do is to filter all the key-value pairs in data that
has the key in stat.
I think you should join by key instead of doing a lookup.
join(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset
of (K, (V, W)) pairs with all pairs of elements for each key. Outer
joins are supported through leftOuterJoin, rightOuterJoin, and
fullOuterJoin.
.
"close over an RDD inside another RDD."
Basically using an RDD inside the transformations (in this case filter) of another RDD.
Nesting of one RDD inside another is not allowed in Spark.