scala spark reducebykey use custom fuction - scala

I want to use reducebykey but when i try to use it, it show error:
type miss match required Nothing
question: How can I create a custom function for reducebykey?
{(key,value)}
key:string
value: map
example:
rdd = {("a", "weight"->1), ("a", "weight"->2)}
expect{("a"->3)}
def combine(x: mutable.map[string,Int],y:mutable.map[string,Int]):mutable.map[String,Int]={
x.weight = x.weithg+y.weight
x
}
rdd.reducebykey((x,y)=>combine(x,y))

Lets say you have a RDD[(K, V)] (or PairRDD[K, V] to be more accurate) and you want to somehow combine values with same key then you can use reduceByKey which expects a function (V, V) => V and gives you the modified RDD[(K, V)] (or PairRDD[K, V])
Here, your rdd = {("a", "weight"->1), ("a", "weight"->2)} is not real Scala and similary the whole combine function is wrong both syntactically and logically (it will not compile). But I am guessing that what you have is something like following,
val rdd = sc.parallelize(List(
("a", "weight"->1),
("a", "weight"->2)
))
Which means that your rdd is of type RDD[(String, (String, Int))] or PairRDD[String, (String, Int)] which means that reduceByKey wants a function of type ((String, Int), (String, Int)) => (String, Int).
def combine(x: (String, Int), y: (String, Int])): (String, Int) =
(x._1, x._2 + y._2)
val rdd2 = rdd.reducebykey(combine)
If your problem is something else then please update the question to share your problem with real code, so that others can actually understand it.

Related

type mismatch in scala while using reduceByKey

I have separately test my error code in scala shell
scala> val p6 = sc.parallelize(List( ("a","b"),("b","c")))
p6: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[10] at parallelize at <console>:24
scala> val p7 = p6.map(a => ((a._1+a._2), (a._1, a._2, 1)))
p7: org.apache.spark.rdd.RDD[(String, (String, String, Int))] = MapPartitionsRDD[11] at map at <console>:26
scala> val p8 = p7.reduceByKey( (a,b) => (a._1,(a._2, a._3+b._3)))
<console>:28: error: type mismatch;
found : (String, (String, Int))
required: (String, String, Int)
val p8 = p7.reduceByKey( (a,b) => (a._1,(a._2, a._3+b._3)))
I want to use a._1 as the key so that I can further use join operator, and it is required to be (key, value) pairs. But my question is, why there is a required type while I am using reducing function? I think the format is set by ourselves instead of something regulated. Am I wrong?
Also, if I am wrong, then why it is (String, String, Int) required? Why it is not something else?
ps: I know (String, String, Int) is the value type in (a._1+a._2), (a._1, a._2, 1)) which is the map function, but the official example shows that the reduce funtion (a, b) => (a._1 + b._1, a._2 + b._2) is valid. And I think all of these including my code above should be valid
Take a look at the types. Reduce by key is method on RDD[(K, V)] with signature:
def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
In other words, both input arguments and the return argument have to be of the same type.
In your case p7 is
RDD[(String, (String, String, Int))]
where K is String and V is (String, String, Int), so the function used with reduceByKey must be
((String, String, Int), (String, String, Int)) => (String, String, Int)
A valid function would be:
p7.reduceByKey( (a,b) => (a._1, a._2, a._3 + b._3))
which would give you
(bc,(b,c,1))
(ab,(a,b,1))
as a result.
If you want to change the type in byKey method you have to use aggregateByKey or combineByKey.
your p7 is of p7: org.apache.spark.rdd.RDD[(String, (String, String, Int))] but in your reduceByKey you have used (a._1,(a._2, a._3+b._3)) which is of type (String, (String, Int))
The output type of p8 should also be p8: org.apache.spark.rdd.RDD[(String, (String, String, Int))]
so defining like the following should work for you
val p8 = p7.reduceByKey( (a,b) => (a._1, a._2, a._3+b._3))
You can read my answer in pyspark for more detail on how reduceByKey works
and this one should help too

How does flatMap work on Maps in scala? map can be used as mapValues on Maps but how does the flatMap function work on Map objects?

I am not able to understand the functioning of flatMap function on Map objects.
You use flatMap if you want to flatten your result from map-function.
keep in mind:
flatMap(something)
is identical to
map(something).flatten
I think it is good question, cause map cannot be flatten as other collections. First off all we should look at the signature of this method:
def flatMap[B](f: (A) ⇒ GenTraversableOnce[B]): Map[B]
So, the documentation says that it should return Map, but it is not true, cause it can return any GenTraversableOnce and not only it. We can see it in the provided examples:
def getWords(lines: Seq[String]): Seq[String] = lines flatMap (line => line split "\\W+")
// lettersOf will return a Seq[Char] of likely repeated letters, instead of a Set
def lettersOf(words: Seq[String]) = words flatMap (word => word.toSet)
// lettersOf will return a Set[Char], not a Seq
def lettersOf(words: Seq[String]) = words.toSet flatMap (word => word.toSeq)
// xs will be an Iterable[Int]
val xs = Map("a" -> List(11,111), "b" -> List(22,222)).flatMap(_._2)
// ys will be a Map[Int, Int]
val ys = Map("a" -> List(1 -> 11,1 -> 111), "b" -> List(2 -> 22,2 -> 222)).flatMap(_._2)
So let look at the full signature:
def flatMap[B, That](f: ((K, V)) ⇒ GenTraversableOnce[B])(implicit bf: CanBuildFrom[Map[K, V], B, That]): That
Now we see it returns That - something that implicit CanBuildFrom can provide for us.
You can find many explanation how CanBuildFrom works.
But the main idea is put some function from your key -> value pair to GenTraversableOnce, it can be some Map, Seq or even Option and it will be mapped and flattened. Also you can provide your own CanBuildFrom.
If you have a value or a key in the Map having a list then it can be flatMap-ed.
Example:
val a = Map(1->List(1,2),2->List(2,3))
a.map(_._2) gives List(List(1,2),List(2,3))
you can flatten this using flatMap => a.flatMap(_._2) or a.map(_._2).flatten gives List(1,2,2,3)
src: http://www.scala-lang.org/old/node/12158.html
Not sure about any other way of using flatMap on a Map though.

Infering generic type parameters in Scala

Hi, I've been trying to unify collection of nested maps.
So I want to implement a method with signature:
def unifyMaps(seq: Seq[Map[String, Map[WordType, Int]]]): Map[String, Map[WordType, Int]]
(WordType is a Java Enum.) The first approach was to do manual map-merging.
def unifyMapsManually(seq: IndexedSeq[Map[String, Map[WordType, Int]]]): Map[String, Map[WordType, Int]] = {
seq reduce { (acc, newMap) =>
acc ++ newMap.map { case (k, v) =>
val nestedMap = acc.getOrElse(k, Map.empty)
k -> (nestedMap ++ v.map { case (k2, v2) => k2 -> (nestedMap.getOrElse(k2, 0) + v2) })
}
}
}
It works, but what I'm doing here is recursively applying the exact same pattern, so I thought I'd make a recursive-generic version.
Second approach:
def unifyTwoMapsRecursively(m1: Map[String, Map[WordType, Int]], m2: Map[String, Map[WordType, Int]]): Map[String, Map[WordType, Int]] = {
def unifyTwoMaps[K, V](nestedMapOps: (V, (V, V) => V))(m1: Map[K, V], m2: Map[K, V]): Map[K, V] = {
nestedMapOps match {
case (zero, add) =>
m1 ++ m2.map { case (k, v) => k -> add(m1.getOrElse(k, zero), v) }
}
}
val intOps = (0, (a: Int, b: Int) => a + b)
val mapOps = (Map.empty[WordType, Int], unifyTwoMaps(intOps) _)
unifyTwoMaps(mapOps)(m1, m2)
}
But it fails with:
Error:(90, 18) type mismatch;
found : (scala.collection.immutable.Map[pjn.wierzba.DictionaryCLP.WordType,Int], (Map[Nothing,Int], Map[Nothing,Int]) => Map[Nothing,Int])
required: (scala.collection.immutable.Map[_ <: pjn.wierzba.DictionaryCLP.WordType, Int], (scala.collection.immutable.Map[_ <: pjn.wierzba.DictionaryCLP.WordType, Int], scala.collection.immutable.Map[_ <: pjn.wierzba.DictionaryCLP.WordType, Int]) => scala.collection.immutable.Map[_ <: pjn.wierzba.DictionaryCLP.WordType, Int])
unifyTwoMaps(mapOps)(m1, m2)
^
So ok, I have no idea about upper bound on map key, but the curried function clearly is not inferred correctly. I had similar error with intOps, so I tried to provide exact types:
val mapOps = (Map.empty[WordType, Int], unifyTwoMaps(intOps)(_: Map[String, Map[WordType, Int]], _: Map[String, Map[WordType, Int]]))
But this time it fails with:
Error:(89, 67) type mismatch;
found : Map[String,Map[pjn.wierzba.DictionaryCLP.WordType,Int]]
required: Map[?,Int]
val mapOps = (Map.empty[WordType, Int], unifyTwoMaps(intOps)(_: Map[String, Map[WordType, Int]], _: Map[String, Map[WordType, Int]]))
^
And this time I have absolutely no idea what to try next to get it working.
EDIT: I've found solution to my problem, but I'm still wondering why do I get type mismatch error in this code snippet:
val mapOps = (Map.empty[WordType, Int], unifyTwoMaps(intOps) _)
According to this answer scala type inference works per parameter's list - this is exactly what I've been doing here for currying purposes. My unifyTwoMaps function takes two parameters' lists and I'm trying to infer just the second one.
Solution to generic-recursive solution
Ok, so after spending morning on it I've finally understood that I've been providing wrong exact types.
val mapOps = (Map.empty[WordType, Int], unifyTwoMaps(intOps)(_: Map[String, Map[WordType, Int]], _: Map[String, Map[WordType, Int]]))
Should've been
val mapOps = (Map.empty[WordType, Int], unifyTwoMaps(intOps)(_: Map[WordType, Int], _: Map[WordType, Int]))
Because I needed to pass the type of Map's V, Map[WordType, Int], and not the type of whole outer map. And now it works!
Solution to underlying problem of nested map merging
Well, abstracting over maps' V zero and add should ring a bell, I've been reinventing Monoid. So I thought I'd try Scalaz |+| Semigroups operator solution from this answer.
import scalaz.Scalaz._
def unifyMapsWithScalaz(seq: Seq[Map[String, Map[WordType, Int]]]): Map[String, Map[WordType, Int]] = {
seq reduce (_ |+| _)
}
And it works!
What's interesting is that I already saw that post before trying my solution, but I thought that I'm not sure it'd work for nested data structure, especially with my map's keys being Java Enum. I thought I'd have to provide some custom implementation extending Semigroups's typeclass.
But as it turned out during my reinventing-the-wheel implementation, the enum is only needed as a passed type and map key and it works pretty well. Well done Scalaz!
Well, that would've made a good blog post actually..
EDIT: but I still don't understand why I had this type inference problem in the first place, I've updated the question.

Higher order operations with flattened tuples in scala

I've recently come across a problem. I'm trying to flatten "tail-nested" tuples in a compiler-friendly way, and I've come up with the code below:
implicit def FS[T](x: T): List[T] = List(x)
implicit def flatten[T,V](x: (T,V))(implicit ft: T=>List[T], fv: V=>List[T]) =
ft(x._1) ++ fv(x._2)
This above code works well for flattening tuples I am calling "tail-nested" like the ones below.
flatten((1,2)) -> List(1,2)
flatten((1,(2,3))) -> List(1,2,3)
flatten((1,(2,(3,4)))) -> List(1,2,3,4)
However, I seek to make my solution more robust. Consider a case where I have a list of these higher-kinded "tail-nested" tuples.
val l = List( (1,2), (1,(2,3)), (1,(2,(3,4))) )
The inferred type signature of this would be List[(Int, Any)] and this poses a problem for an operation such as map, which would fail with:
error: No implicit view available from Any => List[Int]
This error makes sense to me because of the nature of my recursive implicit chain in the flatten function. However, I was wondering: is there any way I can make my method of flattening the tuples more robust so that higher order functions such as map mesh well with it?
EDIT:
As Bask.ws pointed out, the Product trait offers potential for a nice solution. The below code illustrates this:
def flatten(p: Product): List[_] = p.productIterator.toList.flatMap {x => x match {
case pr: Product => flatten(pr)
case _ => List(x)
}}
The result type of this new flatten call is always List[Any]. My problem would be solved if there was a way to have the compiler tighten this bound a bit. In parallel to my original question, does anyone know if it is possible to accomplish this?
UPD Compile-time fail solution added
I have one solution that may suit you. Types of your first 3 examples are resolved in compile time: Int, Tuple2[Int, Int], Tuple2[Int, Tuple2[Int, Int]]. For you example with the list you have heterogeneous list with actual type List[(Int, Any)] and you have to resolve the second type in runtime or it maybe can be done by macro. So you may want to actually write implicit def flatten[T](x: (T,Any)) as your error advises you
Here is the fast solution. It gives a couple of warnings, but it works nicely:
implicit def FS[T](x: T): List[T] = List(x)
implicit def FP[T](x: Product): List[T] = {
val res = (0 until x.productArity).map(i => x.productElement(i) match {
case p: Product => FP[T](p)
case e: T => FS(e)
case _ => sys.error("incorrect element")
})
res.toList.flatten
}
implicit def flatten[T](x: (T,Any))(implicit ft: T=>List[T], fp: Product =>List[T]) =
ft(x._1) ++ (x._2 match {
case p: Product => fp(p)
case t: T => ft(t)
})
val l = List( (1,2), (1,(2,3)), (1,(2,(3,4))) )
scala> l.map(_.flatten)
res0: List[List[Int]] = List(List(1, 2), List(1, 2, 3), List(1, 2, 3, 4))
UPD
I have researched problem a little bit more, and I have found simple solution to make homogeneus list, which can fail at compile time. It is fully typed without Any and match and looks like compiler now correctly resolves nested implicits
case class InfiniteTuple[T](head: T, tail: Option[InfiniteTuple[T]] = None) {
def flatten: List[T] = head +: tail.map(_.flatten).getOrElse(Nil)
}
implicit def toInfiniteTuple[T](x: T): InfiniteTuple[T] = InfiniteTuple(x)
implicit def toInfiniteTuple2[T, V](x: (T, V))(implicit ft: V => InfiniteTuple[T]): InfiniteTuple[T] =
InfiniteTuple(x._1, Some(ft(x._2)))
def l: List[InfiniteTuple[Int]] = List( (1,2), (1,(2,3)), (1,(2,(3,4)))) //OK
def c: List[InfiniteTuple[Int]] = List( (1,2), (1,(2,3)), (1,(2,(3,"44"))))
//Compile-time error
//<console>:11: error: No implicit view available from (Int, (Int, java.lang.String)) => InfiniteTuple[Int]
Then you can implement any flatten you want. For example, one above:
scala> l.map(_.flatten)
res0: List[List[Int]] = List(List(1, 2), List(1, 2, 3), List(1, 2, 3, 4))

Convert a Map of Tuple into a Map of Sets

In my dao I receive a tuple[String,String] of which _1 is non-unique and _2 is unique. I groupBy based on _1 to get this -
val someCache : Map[String, List[(String, String)]]
This is obviously wasteful since _1 is being repeated for all values of the Map. Since _2 is unique, what I want is something like -
val someCache : Map[String, Set[String]]
i.e. group by _1 and use as key and use the paired _2s as value of type Set[String]
def foo(ts: Seq[(String, String)]): Map[String, Set[String]] = {
ts.foldLeft(Map[String, Set[String]]()) { (agg, t) =>
agg + (t._1 -> (agg.getOrElse(t._1, Set()) + t._2))
}
}
scala> foo(List(("1","2"),("1","3"),("2","3")))
res4: Map[String,Set[String]] = Map(1 -> Set(2, 3), 2 -> Set(3))
Straightforward solution is to map over all elements and convert each list to set:
someCache.map{ case (a, l) => a -> l.map{ _._2 }.toSet }
You could also use mapValues but you should note that it creates a lazy collection and performs transformation on every access to value.