How to output a tuple from foldByKey in pyspark?

How to output a tuple from foldByKey in pyspark? - pyspark

I was practicing foldByKey on generating tuples in the output.
I have some input in the form:
x = sc.parallelize([[1,2],[3,4],[5,6],[1,1],[1,3],[3,2],[3,6]])
Converting it to a paired rdd:
x2 = x.map(lambda y: (y[0],y[1]))
I want two values for each key in the input: one is adding all elements belonging to each key and the other is just counting the number of elements of each key.
So, the output should be something like this:
[(1,(6,3)),(3,(12,3)),(5,(6,1))]
I have tried code for this as:
x3 = x2.foldByKey((0,0), lambda acc,x: (acc[0] + x,acc[1] + 1))
But, I am getting this error:
TypeError: unsupported operand type(s) for +: 'int' and 'tuple'
I don't understand how acc[0] and acc[1] are tuples. They should be integers.

I was getting this error because foldByKey return type should be the same as the input RDD element type(By definition). I have passed a tuple RDD to foldByKey and I want an integer as its return value. What I was trying to achieve can be done through aggregateByKey() because it can return a different type than its RDD input type.
If I pass a tuple to foldByKey I get the correct output as:
x2 = x.map(lambda y: (y[0],(y[0],y[1])))
x3 = x2.foldByKey((0,0), lambda acc,x: (acc[0] + x[0],acc[1] + 1))
[(1, (3, 2)), (5, (5, 1)), (3, (9, 2))]
Please feel free to provide suggestions.

Related

How to have the total sum of all the numbers contained in a matrix?

I'm a beginner in Scala and I'm trying to build a function that calculates the total sum of all the numbers contained in a matrix.
I tried to do this code :
val pixels = Vector(
Vector(0, 1, 2),
Vector(1, 2, 3)
)
def sum(matrix: Vector[Vector[Int]]): Int = {
matrix.reduce((a, b) => a + b)
}
println(sum(pixels))
But I get the following error: "value + is not a member of Vector[Int]"
I would like to have the sum total of all the numbers contained in the matrix as an integer.
Can you help me to solve this problem ?
Thank you in advance !

You defined matrix as a Vector of Vectors, so arguments to reduce are two Vectors, not ints. If you want to sum the all, you need to flatten first to pull the actual ints from the inner vectors. Also your function the way you wrote it does not return anything. You don't want that variable assignment:
def sum(matrix: Vector[Vector[Int]]) = matrix.flatten.reduce(_ + _)

How can I group RDD by key then count per unique string?

I have an RDD like:
[(1, "Western"),
(1, "Western")
(1, "Drama")
(2, "Western")
(2, "Romance")
(2, "Romance")]
I wish to count per userID the occurances of each movie genres resulting in
1, { "Western", 2), ("Drama", 1) } ...
After that it's my intention to pick the one with the largest number and thus gaining the most popular genre per user.
I have tried userGenre.sortByKey().countByValue()
but to no avail I have no clue on how I can perform this task. I'm using pyspark jupyter notebook.
EDIT:
I have tried the following and it seems to have worked, could someone confirm?
userGenreRDD.map(lambda x: (x, 1)).aggregateByKey(\
0, # initial value for an accumulator \
lambda r, v: r + v, # function that adds a value to an accumulator \
lambda r1, r2: r1 + r2 # function that merges/combines two accumulators \
)

Here is one way of doing
rdd = sc.parallelize([('u1', "Western"),('u2', "Western"),('u1', "Drama"),('u1', "Western"),('u2', "Romance"),('u2', "Romance")])
The occurrence of each movie genre could be
>>> rdd = sc.parallelize(rdd.countByValue().items())
>>> rdd.map(lambda ((x,y),z): (x,(y,z))).groupByKey().map(lambda (x,y): (x, [y for y in y])).collect()
[('u1', [('Western', 2), ('Drama', 1)]), ('u2', [('Western', 1), ('Romance', 2)])]
Most popular genre
>>> rdd.map(lambda (x,y): ((x,y),1)).reduceByKey(lambda x,y: x+y).map(lambda ((x,y),z):(x,(y,z))).groupByKey().mapValues(lambda (x,y): (y)).collect()
[('u1', ('Western', 2)), ('u2', ('Romance', 2))]
Now one could ask what should be most popular genre if more than one genre have the same popularity count?

Perform a nested for loop with RDD.map() in Scala

I'm rather new to Spark and Scala and have a Java background. I have done some programming in haskell, so not completely new to functional programming.
I'm trying to accomplish some form of a nested for-loop. I have a RDD which I want to manipulate based on every two elements in the RDD. The pseudo code (java-like) would look like this:
// some RDD named rdd is available before this
List list = new ArrayList();
for(int i = 0; i < rdd.length; i++){
list.add(rdd.get(i)._1);
for(int j = 0; j < rdd.length; j++){
if(rdd.get(i)._1 == rdd.get(j)._1){
list.add(rdd.get(j)._1);
}
}
}
// Then now let ._1 of the rdd be this list
My scala solution (that does not work) looks like this:
val aggregatedTransactions = joinedTransactions.map( f => {
var list = List[Any](f._2._1)
val filtered = joinedTransactions.filter(t => f._1 == t._1)
for(i <- filtered){
list ::= i._2._1
}
(f._1, list, f._2._2)
})
I'm trying to achieve to put item _2._1 into a list if ._1 of both items is equal.
I am aware that i cannot do any filter or map function within another map function. I've read that you could achieve something like this with a join, but I don't see how I could actually get these items into a list or any structure that can be used as list.
How do you achieve an effect like this with RDDs?

Assuming your input has the form RDD[(A, (A, B))] for some types A, B, and that the expected result should have the form RDD[A] - not a List (because we want to keep data distributed) - this would seem to do what you need:
rdd.join(rdd.values).keys
Details:
It's hard to understand the exact input and expected output, as the data structure (type) of neither is explicitly stated, and the requirement is not well explained by the code example. So I'll make some assumptions and hope that it will help with your specific case.
For the full example, I'll assume:
Input RDD has type RDD[(Int, (Int, Int))]
Expected output has the form RDD[Int], and would contain a lot of duplicates - if the original RDD has the "key" X multiple times, each match (in ._2._1) would appear once per occurrence of X as a key
If that's the case we're trying to solve - this join would solve it:
// Some sample data, assuming all ints
val rdd = sc.parallelize(Seq(
(1, (1, 5)),
(1, (2, 5)),
(2, (1, 5)),
(3, (4, 5))
))
// joining the original RDD with an RDD of the "values" -
// so the joined RDD will have "._2._1" as key
// then we get the keys only, because they equal the values anyway
val result: RDD[Int] = rdd.join(rdd.values).keys
// result is a key-value RDD with the original keys as keys, and a list of matching _2._1
println(result.collect.toList) // List(1, 1, 1, 1, 2)

How to create a TypedColumn in a Spark Dataset and manipulate it?

I'm trying to perform an aggregation using mapGroups that returns a SparseMatrix as one of the columns, and sum the columns.
I created a case class schema for the mapped rows in order to provide column names. The matrix column is typed org.apache.spark.mllib.linalg.Matrix. If I don't run toDF before performing the aggregation (select(sum("mycolumn")) I get one type mismatch error (required: org.apache.spark.sql.TypedColumn[MySchema,?]). If I include toDF I get another type mismatch error: cannot resolve 'sum(mycolumn)' due to data type mismatch: function sum requires numeric types, not org.apache.spark.mllib.linalg.MatrixUDT. So what's the right way to do it?

It looks you struggle with at least two distinct problems here. Lets assume you have Dataset like this:
val ds = Seq(
("foo", Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))),
("foo", Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0)))
).toDS
Selecting TypedColumn:
using implicit conversions with $:
ds.select(col("_1").as[String])
using o.a.s.sql.functions.col:
ds.select(col("_1").as[String])
Adding matrices:
MLLib Matrix and MatrixUDT don't implement addition. It means you won't be able to sum function or reduce with +
you can use third party linear algebra library but this is not supported in Spark SQL / Spark Dataset
If you really want to do it with Datsets you can try to do something like this:
ds.groupByKey(_._1).mapGroups(
(key, values) => {
val matrices = values.map(_._2.toArray)
val first = matrices.next
val sum = matrices.foldLeft(first)(
(acc, m) => acc.zip(m).map { case (x, y) => x + y }
)
(key, sum)
})
and map back to matrices but personally I would just convert to RDD and use breeze.

Calculate average and remove from list in Scala

I am on my first week with Scala and struggling with the way the code is written in this language.
I am trying to write a function that determines the average number in a list and removes the values below that average. For example:
I have this list:
List[(String, Int)] = List(("A", 1), ("B", 1), ("C", 3), ("D", 2), ("E", 4))
The result should be 2.2.
So the function should also remove the entries ("A", 1), ("B", 1) and ("D", 2) because they are below the average.
Can anyone help me?

You can calculate the average of the second elements of the list of tuples, you don't need to do the sum yourself because Scala has a builtin function for that. First we need to transform the list of tuples to a list of Int values, we can do that using the map function as shown below
val average = list.map(_._2).sum/list.size.toDouble
Now you have the average, you can filter your list based on its value
val newList = list.filter(_._2 < average)
Note that we didn't remove anything from the list, we created a new one with filtered elements

val average = list.map(_._2).sum / list.size.toDouble
list.filter(p => p._2 >= average)
You need to cast to Double, else average would be cast to an Int and be imprecise. The filter only keeps the element greater than the average.

You can do:
val sum = list.map(_._2).sum
val avg: Double = sum / list.size.toDouble
val filtered = list.filter(_._2 > avg)
Note this is traversing the list twice, once for summing and once for filtering. Another thing to note is that Scala List[T] is immutable. When you filter, you're creating a new List object with the filtered data.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to output a tuple from foldByKey in pyspark? - pyspark

Related

How to have the total sum of all the numbers contained in a matrix?

How can I group RDD by key then count per unique string?

Perform a nested for loop with RDD.map() in Scala

How to create a TypedColumn in a Spark Dataset and manipulate it?

Calculate average and remove from list in Scala

Categories

Resources