Spark nested loop and RDD transformations

Spark nested loop and RDD transformations - pyspark

I am looking for example code that implements a nested loop in spark. I am looking for the following functionality.
Given a RDD data1 = sc.parallelize(range(10)) and another dataset data2 = sc.parallelize(['a', 'b', 'c']), I am looking for something which will pick each 'key' from data2, append each 'value' from data1 to create a list of key value pairs that look, perhaps in internal memory, something like [(a,1), (a, 2), (a, 3), ..., (c, 8), (c, 9)] and then do a reduce by key using a simple reducer function, say lambda x, y: x+y.
From the logic described above, the expected output is
(a, 45)
(b, 45)
(c, 45)
My attempt
data1 = sc.parallelize(range(100))
data2 = sc.parallelize(['a', 'b', 'c'])
f = lambda x: data2.map(lambda y: (y, x))
data1.map(f).reduceByKey(lambda x, y: x+y)
The obtained error
Exception: It appears that you are attempting to broadcast an RDD or
reference an RDD from an action or transformation. RDD transformations
and actions can only be invoked by the driver, not inside of other
transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x)
is invalid because the values transformation and count action cannot be
performed inside of the rdd1.map transformation. For more information,
see SPARK-5063.
I am a complete newbie this, so any help is highly appreciated!
OS Information
I am running this on a standalone spark installation on linux. Details available if relevant.

Here is a potential solution. I am not too happy with it, though, because it doesn't represent a true for loop.
data1 = sc.parallelize(range(10))
data2 = sc.parallelize(['a', 'b', 'c'])
data2.cartesian(data1).reduceByKey(lambda x, y: x+y).collect()
gives
[('a', 45), ('c', 45), ('b', 45)]

Related

Spark groupBy X then sortBy Y then get topK

case class Tomato(name:String, rank:Int)
case class Potato(..)
I have Spark 2.4 and Dataset[Tomato, Potato] that I want to groupBy name and get topK ranks.
Issue is that groupBy produces an iterator which is not sortable and iterator.toList explodes on large datasets.
Iterator solution:
data.groupByKey{ case (tomato,_) => tomato.name }
.flatMapGroups((k,it)=>it.toList.sortBy(_.rank).take(topK))
I've also tried aggregation functions but I could not find a topK or firstK only first and last.
Another thing I hate about aggregation functions is that they convert the dataset to a dataframe (yuck) so all the types are gone.
Aggregation Fn solution syntax made up by me:
data.agg(row_number.over(Window.partitionBy("_1.name").orderBy("_1.rank").take(topK))
There are already several questions on SO that ask for groupBy then some other operation but none want to sort by a key different than the groupBy key and then get topK

You could go the iterator route without having to create a full list which indeed explodes with big datasets. Something like:
import spark.implicits._
import scala.util.Sorting
case class Tomato(name:String, rank:Int)
case class Potato(taste: String)
case class MyClass(tomato: Tomato, potato: Potato)
val ordering = Ordering.by[MyClass, Int](_.tomato.rank)
val ds = Seq(
(MyClass(Tomato("tomato1", 1), Potato("tasty"))),
(MyClass(Tomato("tomato1", 2), Potato("tastier"))),
(MyClass(Tomato("tomato2", 2), Potato("tastiest"))),
(MyClass(Tomato("tomato3", 2), Potato("yum"))),
(MyClass(Tomato("tomato3", 4), Potato("yummier"))),
(MyClass(Tomato("tomato3", 50), Potato("yummiest"))),
(MyClass(Tomato("tomato7", 50), Potato("yam")))
).toDS
val k = 2
val output = ds
.groupByKey{
case MyClass(tomato, potato) => tomato.name
}
.mapGroups(
(name, iterator)=> {
val topK = iterator.foldLeft(Seq.empty[MyClass]){
(accumulator, element) => {
val newAccumulator = accumulator :+ element
if (newAccumulator.length > k)
newAccumulator.sorted(ordering).drop(1)
else
newAccumulator
}
}
(name, topK)
}
)
output.show(false)
+-------+--------------------------------------------------------+
|_1 |_2 |
+-------+--------------------------------------------------------+
|tomato7|[[[tomato7, 50], [yam]]] |
|tomato2|[[[tomato2, 2], [tastiest]]] |
|tomato1|[[[tomato1, 1], [tasty]], [[tomato1, 2], [tastier]]] |
|tomato3|[[[tomato3, 4], [yummier]], [[tomato3, 50], [yummiest]]]|
+-------+--------------------------------------------------------+
So as you see, for each Tomato.name key, we're keeping the k elements with the largest Tomato.rank values. You get a Dataset[(String, Seq(MyClass))] as result.
This is not really optimized for performance: for each group, we're iterating over all of its elements and sorting the sequence which could become quite intensive computationally. But this all depends on the size of your actual case classes, the size of your data, your requirements, ...
Hope this helps!

Issue is that groupBy produces an iterator which is not sortable and iterator.toList explodes on large datasets.
What you could do is to come up with a topK() method that takes parameters k, Iterator[A] and a A => B mapping to return an Iterator[A] of top k elements (sorted by value of type B) -- all without having to sort the entire iterator:
def topK[A, B : Ordering](k: Int, iter: Iterator[A], f: A => B): Iterator[A] = {
val orderer = implicitly[Ordering[B]]
import orderer._
val listK = iter.take(k).toList
iter.foldLeft(listK.sortWith(f(_) > f(_))){ (lsK, x) =>
if (f(x) < f(lsK.head))
(x :: lsK.tail).sortWith(f(_) > f(_))
else
lsK
}.reverse.iterator
}
Note that topK() only involves iterative sorting of lists of size k, with the assumption k is small compared with the size of the input iterator. If necessary, it could be further optimized to eliminate the sorting of the k-elements lists by only making its first element the largest element while leaving the rest of the lists unsorted.
Using your groupByKey approach, method topK() can be plugged in within flatMapGroups as shown below:
case class T(name: String, rank: Int)
case class P(name: String, rank: Int)
val ds = Seq(
(T("t1", 4), P("p1", 1)),
(T("t1", 5), P("p2", 2)),
(T("t1", 1), P("p3", 3)),
(T("t1", 3), P("p4", 4)),
(T("t1", 2), P("p5", 5)),
(T("t2", 4), P("p6", 6)),
(T("t2", 2), P("p7", 7)),
(T("t2", 6), P("p8", 8))
).toDF("tomato", "potato").as[(T, P)]
val k = 3
ds.
groupByKey{ case (tomato, _) => tomato.name }.
flatMapGroups((_, it) => topK[(T, P), Int](k, it, { case (t, p) => t.rank })).
show
/*
+-------+-------+
| _1| _2|
+-------+-------+
|{t1, 1}|{p3, 3}|
|{t1, 2}|{p5, 5}|
|{t1, 3}|{p4, 4}|
|{t2, 2}|{p7, 7}|
|{t2, 4}|{p6, 6}|
|{t2, 6}|{p8, 8}|
+-------+-------+
*/

How to output a tuple from foldByKey in pyspark?

I was practicing foldByKey on generating tuples in the output.
I have some input in the form:
x = sc.parallelize([[1,2],[3,4],[5,6],[1,1],[1,3],[3,2],[3,6]])
Converting it to a paired rdd:
x2 = x.map(lambda y: (y[0],y[1]))
I want two values for each key in the input: one is adding all elements belonging to each key and the other is just counting the number of elements of each key.
So, the output should be something like this:
[(1,(6,3)),(3,(12,3)),(5,(6,1))]
I have tried code for this as:
x3 = x2.foldByKey((0,0), lambda acc,x: (acc[0] + x,acc[1] + 1))
But, I am getting this error:
TypeError: unsupported operand type(s) for +: 'int' and 'tuple'
I don't understand how acc[0] and acc[1] are tuples. They should be integers.

I was getting this error because foldByKey return type should be the same as the input RDD element type(By definition). I have passed a tuple RDD to foldByKey and I want an integer as its return value. What I was trying to achieve can be done through aggregateByKey() because it can return a different type than its RDD input type.
If I pass a tuple to foldByKey I get the correct output as:
x2 = x.map(lambda y: (y[0],(y[0],y[1])))
x3 = x2.foldByKey((0,0), lambda acc,x: (acc[0] + x[0],acc[1] + 1))
[(1, (3, 2)), (5, (5, 1)), (3, (9, 2))]
Please feel free to provide suggestions.

How to extract only the values from a map

I have the following Map after doing a groupBy and then partition/sliding on an List of Lists. Now i'm only interested in the values of the map, the keys are irrelevant. Basically i'm trying to extract the subset of Lists after groupBy and sliding/partition and perform additional map and reduce functions on them.
var sectionMap : Map[Int,List[List[Any]]] = Map(
1 -> List(List(1,20,"A"), List(1,40,"B")),
2 -> List(List(2,30,"A"), List(2,80,"F")),
3 -> List(List(3,80,"B"))
)
I used sectionMap.values but it returned a format like Iterable[List[List[Any]]] However I want the following type List[List[Any]]. Is there is one step function to apply to achieve the result?
List(
List(1,20,"A"),
List(1,40,"B"),
List(2,30,"A"),
List(2,80,"F"),
List(3,80,"B")
)

You can use sectionMap.values.flatten.toList.
flatten convert types like Seq[Seq[T]] to Seq[T] and toList convert Iterable to List

you need to do map.values which will gives you the List of values. As values are List of List you will get Iterable(List(List(1,20,"A"))) :Iterable[List[List[Any]]] like this so you can do flatten to make it Iterable(List(1,20,"A")): Iterable[List[Any]].
If you want it to be List[List[Any]] do .toList after flatten.
you can use:
sectionMap.values.flatten
//output List(List(1, 20, A), List(1, 40, B), List(2, 30, A), List(2, 80, F), List(3, 80, B))

Using map or flatMap or collect method on sectionMap as below:
sectionMap.map(_._2).flatten.toList
sectionMap.flatMap(_._2).toList
sectionMap.collect{case (x,y) => y}.flatten.toList

You can also use flatMap:
sectionMap.flatMap{ case (_, x) => x }.toList
It combines flattening and extraction into the same iteration.

How can I group RDD by key then count per unique string?

I have an RDD like:
[(1, "Western"),
(1, "Western")
(1, "Drama")
(2, "Western")
(2, "Romance")
(2, "Romance")]
I wish to count per userID the occurances of each movie genres resulting in
1, { "Western", 2), ("Drama", 1) } ...
After that it's my intention to pick the one with the largest number and thus gaining the most popular genre per user.
I have tried userGenre.sortByKey().countByValue()
but to no avail I have no clue on how I can perform this task. I'm using pyspark jupyter notebook.
EDIT:
I have tried the following and it seems to have worked, could someone confirm?
userGenreRDD.map(lambda x: (x, 1)).aggregateByKey(\
0, # initial value for an accumulator \
lambda r, v: r + v, # function that adds a value to an accumulator \
lambda r1, r2: r1 + r2 # function that merges/combines two accumulators \
)

Here is one way of doing
rdd = sc.parallelize([('u1', "Western"),('u2', "Western"),('u1', "Drama"),('u1', "Western"),('u2', "Romance"),('u2', "Romance")])
The occurrence of each movie genre could be
>>> rdd = sc.parallelize(rdd.countByValue().items())
>>> rdd.map(lambda ((x,y),z): (x,(y,z))).groupByKey().map(lambda (x,y): (x, [y for y in y])).collect()
[('u1', [('Western', 2), ('Drama', 1)]), ('u2', [('Western', 1), ('Romance', 2)])]
Most popular genre
>>> rdd.map(lambda (x,y): ((x,y),1)).reduceByKey(lambda x,y: x+y).map(lambda ((x,y),z):(x,(y,z))).groupByKey().mapValues(lambda (x,y): (y)).collect()
[('u1', ('Western', 2)), ('u2', ('Romance', 2))]
Now one could ask what should be most popular genre if more than one genre have the same popularity count?

Spark: Mapping elements of an RDD using other elements from the same RDD

Suppose I have an this rdd:
val r = sc.parallelize(Array(1,4,2,3))
What I want to do is create a mapping. e.g:
r.map(val => val + func(all other elements in r)).
Is this even possible?

It's very likely that you will get an exception, e.g. bellow.
rdd = sc.parallelize(range(100))
rdd = rdd.map(lambda x: x + sum(rdd.collect()))
i.e. you are trying to broadcast the RDD therefore.
Exception: It appears that you are attempting to broadcast an RDD or
reference an RDD from an action or transformation. RDD transformations
and actions can only be invoked by the driver, not inside of other
transformations; for example, rdd1.map(lambda x: rdd2.values.count() *
x) is invalid because the values transformation and count action
cannot be performed inside of the rdd1.map transformation. For more
information, see SPARK-5063.
To achieve this you would have to do something like this:
res = sc.broadcast(rdd.reduce(lambda a,b: a + b))
rdd = rdd.map(lambda x: x + res.value)

Spark already supports Gradient Descent. Maybe you can take a look in how they implemented it.

I don't know if there is a more efficient alternative, but I would first create some structure like:
rdd = sc.parallelize([ (1, [4,2,3]), (4, [1,2,3]), (2, [1,4,3]), (3, [1,4,2]));
rdd = rdd.map(lambda (x,y) => x + func(y));

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark nested loop and RDD transformations - pyspark

Here is a potential solution. I am not too happy with it, though, because it doesn't represent a true for loop. data1 = sc.parallelize(range(10)) data2 = sc.parallelize(['a', 'b', 'c']) data2.cartesian(data1).reduceByKey(lambda x, y: x+y).collect() gives [('a', 45), ('c', 45), ('b', 45)]

Related

Spark groupBy X then sortBy Y then get topK

How to output a tuple from foldByKey in pyspark?

How to extract only the values from a map

How can I group RDD by key then count per unique string?

Spark: Mapping elements of an RDD using other elements from the same RDD

Categories

Resources