Convert RDD[Map[String,Double]] to RDD[(String,Double)] - scala

I did some calculation and returned my values in a RDD containing scala map and now I want to remove this map and want to collect all keys values in a RDD.
Any help will be appreciated.

You can call flatMap with the identity function to 'flatten' the structure of your RDD.
rdd.flatMap(identity)

Related

Split row of tuple to two row in RDD

I try to split tuple of ints to two rows in RDD.
vertices=edges.map(lambda x:(x[0],)).union(edges.map(lambda x:(x[1],))).distinct()
I try this code and it is working, but I want code that run less in runtime, without using the GraphFrames package.
You can use flatMap:
edges.flatMap(lambda x: x).distinct()
In Scala, you would simply call .flatMap(identity) instead.
If you use the DataFrame API you can just use explode on your only column e.g. df.select(explode("edge"))

Scala RDD mapping

so I have an RDD in scala which is currently stored as a key-value mapping like the following.
(A, (B,C,D,E))
I was wondering if it was possible to somehow map this to an RDD which stores a key-value mapping like the following
(A,B)
(A,C)
(A,D)
(A,E)
i.e. is it possible to make the key separately map to everything?
Found a way to do it. You can use a flatMapValues(x=>x) to turn all of them into key value pairs rather than one key array value pair.

How to make an RDD from the first n items of another RDD in Spark?

Given an RDD in pyspark, I would like to make a new RDD which only contains (a copy of) its first n items, something like:
n=100
rdd2 = rdd1.limit(n)
except RDD does not have a method limit(), like DataFrame does.
Note that I do not want to collect the result, the result must still be an RDD, therefore I cannot use RDD.take().
I am using pyspark 2.44.
You can convert the RDD to a DF limit and convert it back
rdd1.toDF().limit(n).rdd

Spark: How to get same result using reduceByKey like we get by using groupByKey any alternative solution? to avoid shuffle

I am new to Spark (using 1.1 version) and Scala .. I am converting my existing Hadoop MapReduce code to spark MR using Scala and bit lost.
I want my mapped RDD to be grouped by Key .. When i read online it's suggested that we should avoid groupByKey and use reducedByKey instead.. but when I apply reduceBykey I am not getting list of values for given key as expected by my code =>Ex.
val rdd = sc.parallelize(List(("k1", "v11"), ("k1", "v21"), ("k2", "v21"), ("k2", "v22"), ("k3", "v31") ))
My "values" for actual task are huge, having 300 plus columns in key-values pair
And when I will do group by on common key it will result in shuffle which i want to avoid.
I want something like this as o/p (key, List OR Array of values) from my mapped RDD =>
rdd.groupByKey()
which gives me following Output
(k3,ArrayBuffer(v31))
(k2,ArrayBuffer(v21, v22))
(k1,ArrayBuffer(v11, v21))
But when I use
rdd.reduceByKey((x,y) => x+y)
I get values connected together like following- If pipe('|') or some other breakable character( (k2,v21|v22) ) would have been there my problem would have been little bit solved but still having list would be great for good coding practice
(k3,v31)
(k2,v21v22)
(k1,v11v21)
Please help
If you refer the spark documentation http://spark.apache.org/docs/latest/programming-guide.html
For groupByKey It says
“When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs.”
The Iterable keyword is very important over here, when you get the value as (v21, v22) it’s iterable.
Further it says
“Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.”
So from this what I understand is, if you want the return RDD to have iterable values use groupByKey if and if you want to have a single added up value like SUM then use reducebyKey.
Now in your tuple instead of having (String,String) => (K1,V1), if you had (String,ListBuffer(String)) => (K1,ListBuffer(“V1”)) then maybe you could have done rdd.reduceByKey( (x,y) = > x += y)

How to test if a value is a key of an RDD

I am very new to Spark and Scala, and I want to test if a value is a key from an RDD.
The data I have is like this:
RDD data: key -> value
RDD stat: key -> statistics
What I want to do is to filter all the key-value pairs in data that has the key in stat.
My general idea is to convert the keys of an RDD into a set, then test if a value belongs to this set?
Are there better approaches, and how to convert the keys of an RDD into a set using Scala?
Thanks.
You can use lookup
def lookup(key: K): List[V]
Return the list of values in the RDD for key key. This operation is
done efficiently if the RDD has a known partitioner by only searching
the partition that the key maps to.
You asked -
What I want to do is to filter all the key-value pairs in data that
has the key in stat.
I think you should join by key instead of doing a lookup.
join(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset
of (K, (V, W)) pairs with all pairs of elements for each key. Outer
joins are supported through leftOuterJoin, rightOuterJoin, and
fullOuterJoin.
.
"close over an RDD inside another RDD."
Basically using an RDD inside the transformations (in this case filter) of another RDD.
Nesting of one RDD inside another is not allowed in Spark.