Seems like a simple problem but I can't figure it out.
Given an rdd
Input
[1, 5, 3, 2, 7]
Output
[(1,5), (5,3), (3,2), (2,7)]
I've tried this but with obvious error.
rdd.map(lambda x,y: (x,y))
I'm assuming I need a helper function of some sort.
Related
I have this contrived example:
val rdd = sc.parallelize(List(("A", List(1, 1)),
("B", List(2, 2, 2, 200)),
("C", List(3, 3))))
and can do this to tally the overall sum of the RDD:
rdd.map(_._2.sum).sum
or
rdd.flatMapValues(identities).values.sum
Can I sum overall taking into account there is a List, Array, etc. in a 1-step process? Or are these two approaches the basics of overall summing that need necessarily to be a two step process?
As for my understanding both your solutions are right.
There are some other options however. For instance, here is an elegant way of doing the same:
rdd.flatMap(_._2).sum
I am working in Scala for programming in Spark on a Standalone machine (PC having Windows 10). I am a newbie and don't have experience in programming in scala and spark. So I will be very thankful for the help.
Problem:
I have a HashMap, hMap1, whose values are HashSets of Integer entries (HashMap>). I then store its values (i.e., many HashSet values) in an RDD. The code is as below
val rdd1 = sc.parallelize(Seq(hMap1.values()))
Now I have another HashMap, hMap2, of same type i.e., HashMap>. Its values are also stored in an RDD as
val rdd2 = sc.parallelize(Seq(hMap2.values()))
I want to know how can I intersect the values of hMap1 and hMap2
For example:
Input:
the data in rdd1 = [2, 3], [1, 109], [88, 17]
and data in rdd2 = [2, 3], [1, 109], [5,45]
Output
so the output = [2, 3], [1, 109]
Problem statement
My understanding of your question is the following:
Given two RDDs of type RDD[Set[Integer]], how can I produce an RDD of their common records.
Sample data
Two RDDs generated by
val rdd1 = sc.parallelize(Seq(Set(2, 3), Set(1, 109), Set(88, 17)))
val rdd2 = sc.parallelize(Seq(Set(2, 3), Set(1, 109), Set(5, 45)))
Possible solution
If my understanding of the problem statement is correct, you could use rdd1.intersection(rdd2) if your RDDs are as I thought. This is what I tried on a spark-shell with Spark 2.2.0:
rdd1.intersection(rdd2).collect
which yielded the output:
Array(Set(2, 3), Set(1, 109))
This works because Spark can compare elements of type Set[Integer], but note that this is not generalisable to any object Set[MyObject] unless you defined the equality contract of MyObject.
I'm trying to generate some undirect edges using the elements of a Set.Set(1, 4, 5), for example, and the result must be like this:
(1,4)
(1,5)
(4,5)
Any solution will be much appreciated.
Here is a simple example using subset and filter
val set = Set(1,4,5)
val result = set.subsets().map(_.toList).toList
Output:
List(1, 4)
List(1, 5)
List(4, 5)
If you want as a list of tuples then you can convert as
result.map(list => (list(0), list(1)))
Output:
(1,4)
(1,5)
(4,5)
Hope this helps!
I have a rdd with the values of
a,b
a,c
a,d
b,a
c,d
d,c
d,e
what I need is an rdd that contains the reciprocal pairs, but just one set. It would have to be:
a,b or b,a
c,d or d,c
I was thinking they could be added to a list and looped thru to find the the opposite pair, if one exists filter the first value out, and delete the reciprocal pair. I am thinking there must be a way of using scala functions like join or case, but I am having difficulty understanding them
If you don't mind the order of each pair to change(e.g., (a,b) to become (b,a)), you can give a simple and easy to parallelize solution. The examples below use numbers but the pairs can be anything; as long as the values are comparable.
In vanilla Scala:
List(
(2, 1),
(3, 2),
(1, 2),
(2, 4),
(4, 2)).map{ case(a,b) => if (a>b) (a,b) else (b,a) }.toSet
This will result in:
res1: Set[(Int, Int)] = Set((2, 1), (3, 2), (4, 2))
In Spark RDD the above can be expressed as:
sc.parallelize((2, 1)::(3, 2)::(2, 1)::(4, 2)::(4, 2)::Nil).map{ case(a,b) =>
if (a>b) (a,b) else (b,a) }.distinct()
The following program fails in the zip step.
x = sc.parallelize([1, 2, 3, 1, 2, 3])
y = sc.parallelize([1, 2, 3])
z = x.distinct()
print x.zip(y).collect()
The error that is produced depends on whether multiple partitions have been specified or not.
I understand that
the two RDDs [must] have the same number of partitions and the same number of elements in each partition.
What is the best way to work around this restriction?
I have been performing the operation with the following code, but I am hoping to find something more efficient.
def safe_zip(left, right):
ix_left = left.zipWithIndex().map(lambda row: (row[1], row[0]))
ix_right = right.zipWithIndex().map(lambda row: (row[1], row[0]))
return ix_left.join(ix_right).sortByKey().values()
I think this would be accomplished by using cartesian() on your RDD
import pyspark
x = sc.parallelize([1, 2, 3, 1, 2, 3])
y = sc.parallelize([1, 2, 3])
x.distinct().cartesian(y.distinct()).collect()