Spark: aggregateByKey into a pair of lists - scala

I have a keyed set of records that contain book id as well as reader id fields.
case class Book(book: Int, reader: Int)
How can I use aggregateByKey to combine all records with the same key into one record of the following format:
(key:Int, (books: List:[Int], readers: List:[Int]))
where books is a list of all books and readers is a list of all readers from records with the given key?
My code (below) results in compilation errors:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkContext, SparkConf}
object Aggr {
case class Book(book: Int, reader: Int)
val bookArray = Array(
(2,Book(book = 1, reader = 700)),
(3,Book(book = 2, reader = 710)),
(4,Book(book = 3, reader = 710)),
(2,Book(book = 8, reader = 710)),
(3,Book(book = 1, reader = 720)),
(4,Book(book = 2, reader = 720)),
(4,Book(book = 8, reader = 720)),
(3,Book(book = 3, reader = 730)),
(4,Book(book = 8, reader = 740))
)
def main(args: Array[String]) {
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
// set up environment
val conf = new SparkConf()
.setMaster("local[5]")
.setAppName("Aggr")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val books = sc.parallelize(bookArray)
val aggr = books.aggregateByKey((List()[Int], List()[Int]))
({case
((bookList:List[Int],readerList:List[Int]), Book(book, reader)) =>
(bookList ++ List(book), readerList ++ List(reader))
},
{case ((bookLst1:List[Int], readerLst1:List[Int]),
(bookLst2:List[Int], readerLst2:List[Int])
) => (bookLst1 ++ bookLst2, readerLst1 ++ readerLst2) })
}
}
Errors:
Error:(36, 44) object Nil does not take type parameters.
val aggr = books.aggregateByKey((List()[Int], List()[Int]))
Error:(37, 6) missing parameter type for expanded function The argument types of an anonymous function must be fully known. (SLS 8.5) Expected type was: ?
({case
^
^
Update
When initializing accumalator with (List(0), List(0) everything compiles, but inserts extra zeros into result. Very interesting:
val aggr : RDD[(Int, (List[Int], List[Int]))] = books.aggregateByKey((List(0), List(0))) (
{case
((bookList:List[Int],readerList:List[Int]), Book(book, reader)) =>
(bookList ++ List(book), readerList ++ List(reader))
},
{case ((bookLst1:List[Int], readerLst1:List[Int]),
(bookLst2:List[Int], readerLst2:List[Int])
) => (bookLst1 ++ bookLst2, readerLst1 ++ readerLst2) }
)
This results in the following output:
[Stage 0:> (0 + 0) / 5](2,(List(0, 1, 0, 8),List(0, 700, 0, 710)))
(3,(List(0, 2, 0, 1, 0, 3),List(0, 710, 0, 720, 0, 730)))
(4,(List(0, 3, 0, 2, 8, 0, 8),List(0, 710, 0, 720, 720, 0, 740)))
Providing I could have empty lists as initializers instead of lists with zeros, I would not have extra zeros of course, lists would concatenate nicely.
Can somebody, please, explain me why empty list initializer (List(), List() results in error and (List(0), List(0) compiles. Is it a Scala bug or a feature?

Actually you're doing everything OK, it's only that your indentation/syntax style is a bit sloppy, you just need to move one parenthesis from this:
val aggr = books.aggregateByKey((List()[Int], List()[Int]))
({case
Into this:
val aggr = books.aggregateByKey((List[Int](), List[Int]())) (
{case
These links might shed some light why this didn't work for you:
What are the precise rules for when you can omit parenthesis, dots, braces, = (functions), etc.? (first answer)
http://docs.scala-lang.org/style/method-invocation.html#suffix-notation

Answering your update - you misplaced the type declaration for your lists. If you declared them as List[Int]() instead of List()[Int], everything would have worked. The compiler error message is correctly telling you the problem, but it's not quite easy to understand. By putting [Int] at the end, you are passing a type parameter to the result of the List() function. The result of List() is Nil - a singleton object representating an empty list - and it does not take type parameters.
As for why List(0) also works - scala performs type inference, if it can. You've declared one element of a list - which is 0, an integer, so it inferred that this is a List[Int]. Note however, that this does not declare an empty list, but a list with a single zero. You probably want to use List[Int]() instead.
Just using List() doesn't work because scala cannot infer the type of the empty list.

Related

apply/get methods in Scala

If we go by the definition in "Programming in Scala" book:
When you apply parentheses surrounding one or more values to a
variable, Scala will transform the code into an invocation of a method
named apply on that variable
Then what about accessing the elements of an array? eg: x(0) is transformed to x.apply(0) ? (let's assume that x is an array). I tried to execute the above line. It was throwing error. I also tried x.get(0) which was also throwing error.
Can anyone please help?
() implies apply(),
Array example,
scala> val data = Array(1, 1, 2, 3, 5, 8)
data: Array[Int] = Array(1, 1, 2, 3, 5, 8)
scala> data.apply(0)
res0: Int = 1
scala> data(0)
res1: Int = 1
not releated but alternative is to use safer method which is lift
scala> data.lift(0)
res4: Option[Int] = Some(1)
scala> data.lift(100)
res5: Option[Int] = None
**Note: ** scala.Array can be mutated,
scala> data(0) = 100
scala> data
res7: Array[Int] = Array(100, 1, 2, 3, 5, 8)
In this you can not use apply, think of apply as a getter not mutator,
scala> data.apply(0) = 100
<console>:13: error: missing argument list for method apply in class Array
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `apply _` or `apply(_)` instead of `apply`.
data.apply(0) = 100
^
You better use .update if you want to mutate,
scala> data.update(0, 200)
scala> data
res11: Array[Int] = Array(200, 1, 2, 3, 5, 8)
User defined apply method,
scala> object Test {
|
| case class User(name: String, password: String)
|
| object User {
| def apply(): User = User("updupd", "password")
| }
|
| }
defined object Test
scala> Test.User()
res2: Test.User = User(updupd,password)
If you add an apply method to an object, you can apply that object (like you can apply functions).
The way to do that it is just apply the object as if it was a function, directly with (), without a "dot".
val array:Array[Int] = Array(1,2,3,4)
array(0) == array.apply(0)
For
x(1)=200
which you mention in the comment, the answer is different. It also gets translated to a method call, but not to apply; instead it's
x.update(1, 200)
Just like apply, this will work with any type which defines a suitable update method.

scala: list as input and output of function

I am new to scala. I have a very simple problem.
Given a list in python
x=[1, 100, "a1", "b1"]
I can write a function that will return the last two elements
def f(w):
if w[0]>=1 and w[1]<=100:
return ([w[2],w[3]])
How do I do the equivalent in scala
val v= List(1, 100, "a1", "b1")
def g(L:List[Any]): List[String] = {
if( L(0)>=1 & L(1)<=100 ) {return List(L(2), L(3))}
}
val w=g(v)
This gets me the error
List[Any] = List(1, 100, a, b)
Incomplete expression
You can't get a List[String] from a List[Any]. (Well, you can, but it's a really bad thing to do.)
Don't, don't, don't create a List[Any]. Unlike Python, Scala is a strictly typed language, which means that the compiler keeps a close watch on the type of each variable and every collection. When the compiler looses track of the List type it becomes List[Any] and you've lost all the assistance the compiler offers to help write programs that don't crash.
To mix types in a collection you can use tuples. Here's the type-safe Scala way to write your g() method.
def g(tup: (Int,Int,String,String)): List[String] =
if (tup._1 >= 1 & tup._2 <= 100) List(tup._3, tup._4)
else List()
Usage:
val v = (1, 100, "a1", "b1")
val w = g(v) //w: List[String] = List(a1, b1)
It seems like you have a typo here:
if(L(0)>=1 & L(1<=100)) {return List(L(2), L(3))}
Wouldn't it be like this?
if(L(0)>=1 & L(1)<=100) {return List(L(2), L(3))}
The error seems to point out there's something wrong with that extra bracket there.
scala> List(1,2,3,4,5).takeRight(2)
res44: List[Int] = List(4, 5)
You can use a built in function in Scala that does this!

Using "next" in an "iterator" and getting type mismatch error

I am writing my code in Scala and need to have a loop over a vector of points in an image but getting type mismatch error!. I can understand why I have the error but I don't know how to solve it. here is my code:
val output= new Mat (image.rows, image.cols,CV_8UC3,new Scalar(0, 0, 0))
val it = Iterator(vect1)
var vect3=new Array[Byte](3)
vect3(0)=0
vect3(1)=255.toByte
vect3(2)=0
var e= new Point(0,0)
while(it.hasNext){
e = it.next();
output.put(e.x.toInt,e.y.toInt,vect3)
}
and I am getting this error:
...
type mismatch;
found : scala.collection.mutable.ArrayBuffer[org.opencv.core.Point]
required: org.opencv.core.Point[
e = it.next()
By doing val it = Iterator(vect1), you are creating an iterator that iterates on vect1 itself, and not on vect1's elements. Thankfully, you don't need to create an iterator for that, because it already exists :
val vect1 = ArrayBuffer(1, 2, 3)
// vect1: scala.collection.mutable.ArrayBuffer[Int] = ArrayBuffer(1, 2, 3)
val it = vect1.iterator
// it: Iterator[Int] = non-empty iterator
while (it.hasNext) {
println(it.next)
}
// 1
// 2
// 3
// res0: Unit = ()
Note that, according to the Scala API documentation, ArrayBuffer inherits from IterableLike. IterableLike basically means that a collection is iterable, so it makes sense that it defines a method which returns an iterator.
By the way, you can also avoid directly accessing the iterator entirely, using either the foreach method or a for comprehension, because IterableLike also defines the foreachmethod:
// foreach
vect1.foreach(p => output.put(p.x.toInt, p.y.toInt, vect3))
// for comprehension
for (p <- vect1) {
output.put(p.x.toInt, p.y.toInt, vect3)
}
Using the foreach method or the for comprehension is strictly equivalent: the compiler translates for comprehensions to one or more method calls; in this case, a call to foreach.

HOMap implementation example

I was watching this video by Daniel Spiewak and tried to implement sample about Higher Kinds from it. Here's what I get:
/* bad style */
val map: Map[Option[Any], List[Any]] = Map (
Some("foo") -> List("foo", "bar", "baz"),
Some(42) -> List(1, 1, 2, 3, 5, 8),
Some(true) -> List(true, false, true, false)
)
val xs: List[String] =
map(Some("foo")).asInstanceOf[List[String]] // ugly cast
val ys: List[Int] =
map(Some(42)).asInstanceOf[List[Int]] // another one
println(xs)
println(ys)
/* higher kinds usage */
// HOMAP :: ((* => *) x (* => *)) => *
class HOMap[K[_], V[_]](delegate: Map[K[Any], V[Any]]) {
def apply[A](key: K[A]): V[A] =
delegate(key.asInstanceOf[K[Any]]).asInstanceOf[V[A]]
}
object HOMap {
type Pair[K[_], V[_]] = (K[A], V[A]) forSome { type A }
def apply[K[_], V[_]](tuples: Pair[K, V]*) =
new HOMap[K, V](Map(tuples: _*))
}
val map_b: HOMap[Option, List] = HOMap[Option, List](
Some("foo") -> List("foo", "bar", "baz"),
Some(42) -> List(1, 1, 2, 3, 5, 8),
Some(true) -> List(true, false, true, false)
)
val xs_b: List[String] = map_b(Some("foo"))
val ys_b: List[Int] = map_b(Some(42))
println(xs_b)
println(ys_b)
Unfortunately launching this I get the type mismatch error:
username#host:~/workspace/scala/samples$ scala higher_kinds.scala
/home/username/workspace/scala/samples/higher_kinds.scala:30: error: type mismatch;
found : Main.$anon.HOMap.Pair[K,V]*
required: Seq[(K[Any], V[Any])]
new HOMap[K, V](Map(tuples: _*))
^
one error found
My questions:
How can I fix this? I fully understand that I just need to pass in the right type, but my experience with this kind of stuff in Scala is poor and I can't figure out this.
Why this happens? I mean the operation tuples: _* is probably used widely for passing to Map, but it somehow gives some strange type - Main.$anon.HOMap.Pair[K,V]* and not what it's supposed to give.
Why that example is no longer work? Maybe some recent changes to Scala language changed some syntax?
Thanks for answers!
Problem in type varince conditions. In line def apply[K[_], V[_]] you need guaranty that containers K[_] & V[_] can be cast to K[Any] & V[Any]
Just add type covarince constraint (+) to K & V containers:
object HOMap {
def apply[K[+_], V[+_]](tuples: (Pair[K[A], V[A]] forSome { type A })*) =
new HOMap[K, V](Map(tuples: _*))
}

SortedSet map does not always preserve element ordering in result?

Given the following Scala 2.9.2 code:
Updated with non-working example
import collection.immutable.SortedSet
case class Bar(s: String)
trait Foo {
val stuff: SortedSet[String]
def makeBars(bs: Map[String, String])
= stuff.map(k => Bar(bs.getOrElse(k, "-"))).toList
}
case class Bazz(rawStuff: List[String]) extends Foo {
val stuff = SortedSet(rawStuff: _*)
}
// test it out....
val b = Bazz(List("A","B","C"))
b.makeBars(Map("A"->"1","B"->"2","C"->"3"))
// List[Bar] = List(Bar(1), Bar(2), Bar(3))
// Looks good?
// Make a really big list not in order. This is why we pass it to a SortedSet...
val data = Stream.continually(util.Random.shuffle(List("A","B","C","D","E","F"))).take(100).toList
val b2 = Bazz(data.flatten)
// And how about a sparse map...?
val bs = util.Random.shuffle(Map("A" -> "1", "B" -> "2", "E" -> "5").toList).toMap
b2.makeBars(bs)
// res24: List[Bar] = List(Bar(1), Bar(2), Bar(-), Bar(5))
I've discovered that, in some cases, the makeBars method of classes extending Foo does not return a sorted List. In fact, the list ordering does not reflect the ordering of the SortedSet
What am I missing about the above code where Scala will not always map a SortedSet to a List with elements ordered by the SortedSet ordering?
You're being surprised by implicit resolution.
The map method requires a CanBuildFrom instance that's compatible with the target collection type (in simple cases, identical to the source collection type) and the mapper function's return type.
In the particular case of SortedSet, its implicit CanBuildFrom requires that an Ordering[A] (where A is the return type of the mapper function) be available. When your map function returns something that the compiler already knows how to find an Ordering for, you're good:
scala> val ss = collection.immutable.SortedSet(10,9,8,7,6,5,4,3,2,1)
ss: scala.collection.immutable.SortedSet[Int] = TreeSet(1, 2, 3, 4, 5,
6, 7, 8, 9, 10)
scala> val result1 = ss.map(_ * 2)
result1: scala.collection.immutable.SortedSet[Int] = TreeSet(2, 4, 6, 8, 10,
12, 14, 16, 18, 20)
// still sorted because Ordering[Int] is readily available
scala> val result2 = ss.map(_ + " is a number")
result2: scala.collection.immutable.SortedSet[String] = TreeSet(1 is a number,
10 is a number,
2 is a number,
3 is a number,
4 is a number,
5 is a number,
6 is a number,
7 is a number,
8 is a number,
9 is a number)
// The default Ordering[String] is an "asciibetical" sort,
// so 10 comes between 1 and 2. :)
However, when your mapper function turns out to return a type for which no Ordering is known, the implicit on SortedSet doesn't match (specifically, no value can be found for its implicit parameter), so the compiler looks "upward" for a compatible CanBuildFrom and finds the generic one from Set.
scala> case class Foo(i: Int)
defined class Foo
scala> val result3 = ss.map(Foo(_))
result3: scala.collection.immutable.Set[Foo] = Set(Foo(10), Foo(4), Foo(6), Foo(7), Foo(1), Foo(3), Foo(5), Foo(8), Foo(9), Foo(2))
// The default Set is a hash set, therefore ordering is not preserved
Of course, you can get around this by simply supplying an instance of Ordering[Foo] that does whatever you expect:
scala> implicit val fooIsOrdered: Ordering[Foo] = Ordering.by(_.i)
fooIsOrdered: Ordering[Foo] = scala.math.Ordering$$anon$9#7512dbf2
scala> val result4 = ss.map(Foo(_))
result4: scala.collection.immutable.SortedSet[Foo] = TreeSet(Foo(1), Foo(2),
Foo(3), Foo(4), Foo(5),
Foo(6), Foo(7), Foo(8),
Foo(9), Foo(10))
// And we're back!
Finally, note that toy examples often don't exhibit the problem, because the Scala collection library has special implementations for small (n <= 6) Sets and Maps.
You're probably making assumption about what SortedSet does from Java. You need to specify what order you want the elements to be in. See http://www.scala-lang.org/docu/files/collections-api/collections_8.html