Error: Value min is not a member of (Int, Int) - scala

I am trying to produce the RDD that contains an array of tuples that has country names as the first element, and the minimum integer of the tuple as the second element.
I have this code here.
val test = sc.parallelize(Array(("US", (4,2)), ("France", (1,2)), ("Italy", (2,3))))
I want to store a variable to a value that looks like this:
Array( ("US", 2), ("France", 1), ("Italy", 2) )
I tried to use this code, but it produced a 'Value min is not a member of (Int, Int)' error.
val test1 = test.map(x => (x._1, x._2.min))
How to get minimum of Tuple2[Int, Int]?

To compute the minimum of numeric elements in a Tuple (x, y), you could use x min y:
val test = sc.parallelize(Array(("US", (4,2)), ("France", (1,2)), ("Italy", (2,3))))
test.map(t => (t._1, t._2._1 min t._2._2)).collect
// res1: Array[(String, Int)] = Array((US,2), (France,1), (Italy,2))
For readability, an alternative is to use case partial function, as follows:
test.map{ case (country, (t1, t2)) => (country, t1 min t2) }

Related

Create a SparseVector from the elements of RDD

Using Spark, I have a data structure of type val rdd = RDD[(x: Int, y:Int), cov:Double] in Scala, where each element of the RDD represents an element of a matrix with x representing the row, y representing the column and cov representing the value of the element:
I need to create SparseVectors from rows of this matrix. So I decided to first convert the rdd to RDD[x: Int, (y:Int, cov:Double)] and then use groupByKey to put all elements of a specific row together like this:
val rdd2 = rdd.map{case ((x,y),cov) => (x, (y, cov))}.groupByKey()
Now I need to create the SparseVectors:
val N = 7 //Vector Size
val spvec = {(x: Int,y: Iterable[(Int, Double)]) => new SparseVector(N.toLong, Array(y.map(el => el._1.toInt)), Array(y.map(el => el._2.toDouble)))}
val vecs = rdd2.map(spvec)
However, this is the error that pops up.
type mismatch; found :Iterable[Int] required:Int
type mismatch; found :Iterable[Double] required:Double
I am guessing that y.map(el => el._1.toInt) is returning an iterable which Array cannot be applied on. I would appreciate if someone could help with how to do this.
The simplest solution is to convert to RowMatrix:
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
val rdd: RDD[((Int, Int), Double)] = ???
val vs: RDD[org.apache.spark.mllib.linalg.SparseVector]= new CoordinateMatrix(
rdd.map{
case ((x, y), cov) => MatrixEntry(x, y, cov)
}
).toRowMatrix.rows.map(_.toSparse)
If you want to preserve row indices you can use toIndexedRowMatrix instead:
import org.apache.spark.mllib.linalg.distributed.IndexedRow
new CoordinateMatrix(
rdd.map{
case ((x, y), cov) => MatrixEntry(x, y, cov)
}
).toIndexedRowMatrix.rows.map { case IndexedRow(i, vs) => (i, vs.toSparse) }

ReduceLeft with Vector of pairs?

I have a Vector of Pairs
Vector((9,1), (16,2), (21,3), (24,4), (25,5), (24,6), (21,7), (16,8), (9,9), (0,10))
and I want to return pair with maximum first element in pair.
I've tried it to do like this:
data reduceLeft[(Int, Int)]((y:(Int, Int),z:(Int,Int))=>y._1 max z._1)
and
data reduceLeft((y:(Int, Int),z:(Int,Int))=>y._1 max z._1)
but there is type mismatch error and I can't understand what is wrong with this code.
Why using reduceLeft ?
Just the default max method works very well
scala> val v = Vector((9,1), (16,2), (21,3), (24,4), (25,5), (24,6), (21,7), (16,8), (9,9), (0,10))
v: scala.collection.immutable.Vector[(Int, Int)] = Vector((9,1), (16,2), (21,3), (24,4), (25,5), (24,6), (21,7), (16,8), (9,9), (0,10))
scala> v.max
res1: (Int, Int) = (25,5)
If you want reduceLeft instead :
v.reduceLeft( (x, y) => if (x._1 >= y._1) x else y )
Your error is you have to return a tuple, not an int
y._1 max z._1
The max function here on two int return an int.
max works great in this example. However, if you are wondering how to do this using reduceLeft here it is:
val v = Vector((9,1), (16,2), (21,3), (24,4), (25,5), (24,6), (21,7), (16,8), (9,9), (0,10))
v.reduceLeft( ( x:(Int, Int), y:(Int,Int) ) => if(y._1 > x._1) y else x)

Summing items within a Tuple

Below is a data structure of List of tuples, ot type List[(String, String, Int)]
val data3 = (List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1)) )
//> data3 : List[(String, String, Int)] = List((id1,a,1), (id1,a,1), (id1,a,1),
//| (id2,a,1))
I'm attempting to count the occurences of each Int value associated with each id. So above data structure should be converted to List((id1,a,3) , (id2,a,1))
This is what I have come up with but I'm unsure how to group similar items within a Tuple :
data3.map( { case (id,name,num) => (id , name , num + 1)})
//> res0: List[(String, String, Int)] = List((id1,a,2), (id1,a,2), (id1,a,2), (i
//| d2,a,2))
In practice data3 is of type spark obj RDD , I'm using a List in this example for testing but same solution should be compatible with an RDD . I'm using a List for local testing purposes.
Update : based on following code provided by maasg :
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
I needed to amend slightly to get into format I expect which is of type
.RDD[(String, Seq[(String, Int)])]
which corresponds to .RDD[(id, Seq[(name, count-of-names)])]
:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => ((id1),(id2,values.sum))}
val counted = result.groupedByKey
In Spark, you would do something like this: (using Spark Shell to illustrate)
val l = List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1))
val rdd = sc.parallelize(l)
val grouped = rdd.groupBy{case (id1,id2,v) => (id1,id2)}
val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}
Another option would be to map the rdd into a PairRDD and use groupByKey:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
Option 2 is a slightly better option when handling large sets as it does not replicate the id's in the cummulated value.
This seems to work when I use scala-ide:
data3
.groupBy(tupl => (tupl._1, tupl._2))
.mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum))
.values.toList
And the result is the same as required by the question
res0: List[(String, String, Int)] = List((id1,a,3), (id2,a,1))
You should look into List.groupBy.
You can use the id as the key, and then use the length of your values in the map (ie all the items sharing the same id) to know the count.
#vptheron has the right idea.
As can be seen in the docs
def groupBy[K](f: (A) ⇒ K): Map[K, List[A]]
Partitions this list into a map of lists according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will >always force the view and return a new list.
K the type of keys returned by the discriminator function.
f the discriminator function.
returns
A map from keys to lists such that the following invariant holds:
(xs partition f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a list of those elements x for which f(x) equals k.
So something like the below function, when used with groupBy will give you a list with keys being the ids.
(Sorry, I don't have access to an Scala compiler, so I can't test)
def f(tupule: A) :String = {
return tupule._1
}
Then you will have to iterate through the List for each id in the Map and sum up the number of integer occurrences. That is straightforward, but if you still need help, ask in the comments.
The following is the most readable, efficient and scalable
data.map {
case (key1, key2, value) => ((key1, key2), value)
}
.reduceByKey(_ + _)
which will give a RDD[(String, String, Int)]. By using reduceByKey it means the summation will paralellize, i.e. for very large groups it will be distributed and summation will happen on the map side. Think about the case where there are only 10 groups but billions of records, using .sum won't scale as it will only be able to distribute to 10 cores.
A few more notes about the other answers:
Using head here is unnecessary: .mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum)) can just use .mapValues(v =>(v_1, v._2, v.map(_._3).sum))
Using a foldLeft here is really horrible when the above shows .map(_._3).sum will do: val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}

Scala Map Integer to (Integer => Integer)

How can I map over a List from 1 to 100 (inclusively) to make each item a Function1 that curries each element to partially apply (itself * _).
I tried this:
scala> val xs: List[Integer => Integer] = List.range(1,101).map { x => _ * x }
<console>:13: error: missing parameter type for expanded function ((x$1) =>
x$1.$times(x))
val xs: List[Integer => Integer] = List.range(1,101).map { x => _ * x }
Desired output:
val xs: List[Integer] = List[1,2,3,4, .., 100]
val desired: List[Integer => Integer] = List[(*1), (*2), ...]
Then, this would be expected too:
desired.get(0).apply(2) = 2 // 1 * 2
The compiler is telling you exactly what you need to fix here: "Missing parameter type". Change _ * x to (_: Integer) * x.
Incidentally, is there some reason you're using java.lang.Integer here? You probably mean scala.Int:
val fns: List[Int => Int] =
List.range(1, 101).map{x => (_: Int) * x}
Explanation:
The compiler can't infer the type of ''_''.
That is because there is no need for ''_'' to be a Integer, just because you used a List of Integers to create these functions.
With other words at this point it is not clear what kind of type should be bound to the free parameter of the partially applied function.
As stated before, this solution works for your case if the input will be Integers
(1 to 101) map ( x => x * (_:Int) )
An example:
val functions = (1 to 101) map ( x => x * (_:Int) )
This will work
functions map (_(2))
res13: scala.collection.immutable.IndexedSeq[Int] = Vector(2, 4 ...
This would fail:
functions map (_(2.5))
<console>:9: error: type mismatch;
found : Double(2.5)
required: Int
functions map (_(2.5))

Map a variable of type of Pair -- impossible

This seems not logical for me:
scala> val a = Map((1, "111"), (2, "222"))
a: scala.collection.immutable.Map[Int,String] = Map(1 -> 111, 2 -> 222)
scala> val b = a.map((key, value) => value)
<console>:8: error: wrong number of parameters; expected = 1
val b = a.map((key, value) => value)
^
scala> val c = a.map(x => x._2)
c: scala.collection.immutable.Iterable[String] = List(111, 222)
I know that I can say val d = a.map({ case(key, value) => value })
But why isn't it possible to say a.map((key, value) => value) ? There is only one argument there of type Tuple2[Int, String] or Pair of Int, String. What's the difference between a.map((key, value) => value) and a.map(x => x._2) ?
UPDATE:
val myTuple2 = (1, 2) -- this is one variable, correct?
for ( (k, v) <- a ) yield v -- (k, v) is also only one variable, correct?
map((key, value) => value) -- 2 variables. weird.
So how do I specify a variable of type Tuple2 (or any other type) in map without using case?
UPDATE2:
What's wrong with that?
Map((1, "111"), (2, "222")).map( ((x,y):Tuple2[Int, String]) => y) -- wrong
Map((1, "111"), (2, "222")).map( ((x):Tuple2[Int, String]) => x._2) -- ok
Okay, you still not convinced. In cases like this it is pretty reasonable to fallback to the source of the truth (well, kinda): The Holy Specification (aka, Scala Language Specification).
So, in anonymous function parameters are treated on individual basis, not as a whole tuple band (and it is pretty smart, otherwise, how would you call the anonymous function with 2, ... n parameters?).
At the same time
val x = (1, 2)
is a single item of type Tiple2[Int,Int] (if you're interested you may find corresponding section of spec as well).
for ( (k, v) <- a ) yield v
In this case you have one variable unpacked to two variables. It is similar to
val x = (1, 2) // one variable -- tuple
val (y,z) = x // two integer variables unpacked from one
Some call this destructuring assignment and this is a particular case of pattern matching. And you've already provided another example of pattern matching in action:
a.map({ case(key, value) => value })
Which we can read as map accepts a function produced by a partial function literal, which enables use of pattern matching.
You're basically asking this same questions:
Scala - can a lambda parameter match a tuple?
You've already listed most of the options they listed there, including the accepted answer of using a PartialFunction.
However, since you're using your lambda in a map function, you could use a for comprehension instead:
for ( (k, v) <- a ) yield v
Alternatively, you can use the Function2.tupled method to fix your lambda's type:
scala> val a = Map((1, "111"), (2, "222"))
a: scala.collection.immutable.Map[Int,String] = Map(1 -> 111, 2 -> 222)
scala> a.map( ((k:Int,v:String) => v).tupled )
res1: scala.collection.immutable.Iterable[String] = List(111, 222)
To answer your question in your thread with om-nom-nom above, look at this output:
scala> ( (x:Int,y:String) => y ).getClass.getSuperclass
res0: Class[?0] forSome { type ?0 >: ?0; type ?0 <: (Int, String) => String } = class scala.runtime.AbstractFunction2
Notice that the superclass of the anonymous function (x:Int,y:String) => y is Function2[Int, String, String], not Function1[(Int, String), String].
You can use pattern matching (or partial function, in this instance this is the same), notice angular brackets:
val b = a.map{ case (key, value) => value }