Running a function against every item in collection - scala

I have data type :
counted: org.apache.spark.rdd.RDD[(String, Seq[(String, Int)])] = MapPartitionsRDD[24] at groupByKey at <console>:28
And I'm trying to apply the following to this type :
def func = 2
counted.flatMap { x => counted.map { y => ((x._1+","+y._1),func) } }
So each sequence is compared to each other and a function is applied. For simplicity the function is just returning 2. When I attempt above function I receive this error :
scala> counted.flatMap { x => counted.map { y => ((x._1+","+y._1),func) } }
<console>:33: error: type mismatch;
found : org.apache.spark.rdd.RDD[(String, Int)]
required: TraversableOnce[?]
counted.flatMap { x => counted.map { y => ((x._1+","+y._1),func) } }
How can this function be applied using Spark ?
I have tried
val dataArray = counted.collect
dataArray.flatMap { x => dataArray.map { y => ((x._1+","+y._1),func) } }
which converts the collection to Array type and applies same function. But I run out of memory when I try this method. I think using an RDD is more efficient than using an Array ? The max amount of memory I can allocate is 7g , is there a mechanism in spark that I can use hard drive memory to augment available RAM memory ?
The collection I'm running this function on contain 20'000 entries so 20'000^2 comparisons (400'000'000) but in Spark terms this is quite small ?

Short answer:
counted.cartesian(counted).map {
case ((x, _), (y, _)) => (x + "," + y, func)
}
Please use pattern matching to extract tuple elements for nested tuples to avoid unreadable chained underscore notation. Using _ for the second elements shows the reader that these values are being ignored.
Now what would be even more readable (and maybe more efficient) if func doesn't use the second elements would be to do this:
val projected = counted.map(_._1)
projected.cartesian(projected).map(x => (x._1 + "," + x._2, func))
Note that you do not need curly braces if your lambda fits in a single semantic line this is a very common mistake in Scala.
I would like to know why you wish to have this Cartesian product, there is often ways to avoid doing this that are significantly more scalable. Please say what your going to do with this Cartesian product and I will try to find a scalable way of doing what you want.
One final point; please put spaces between operators

#RexKerr pointed to me that I was somewhat incorrect in the comment section, so I deleted my comments. But while doing that, I had the chance to read the post again and came up with the idea that might be of some use to you.
Since what you are trying to implement is actually some operation over a cartesian product, you might want to try just calling the RDD#cartesian. Here is a dumb example, but if you can give some real code, maybe I'll be able to do something like this in that case as well:
// get collection with the type corresponding to the type in question:
val v1 = sc.parallelize(List("q"-> (".", 0), "s"->(".", 1), "f" -> (".", 2))).groupByKey
// try doing something
v1.cartesian(v1).map{x => (x._1._1+","+x._1._1, 2)}.foreach(println)

Related

How to define case class with a list of tuples and access the tuples in scala

I have a case class with a parameter a which is a list of int tuple. I want to iterate over a and define operations on a.
I have tried the following:
case class XType (a: List[(Int, Int)]) {
for (x <- a) {
assert(x._2 >= 0)
}
def op(): XType = {
for ( x <- XType(a))
yield (x._1, x._2)
}
}
However, I am getting the error:
"Value map is not a member of XType."
How can I access the integers of tuples and define operations on them?
You're running into an issue with for comprehensions, which are really another way of expressing things like foreach and map (and flatMap and withFilter/filter). See here and here for more explanation.
Your first for comprehension (the one with asserts) is equivalent to
a.foreach(x => assert(x._2 >= 0))
a is a List, x is an (Int, Int), everything's good.
However, the second on (in op) translates to
XType(a).map(x => x)
which doesn't make sense--XType doesn't know what to do with map, like the error said.
An instance of XType refers to its a as simply a (or this.a), so a.map(x => x) would be just fine in op (and then turn the result into a new XType).
As a general rule, for comprehensions are handy for nested maps (or flatMaps or whatever), rather than as a 1-1 equivalent for for loops in other languages--just use map instead.
You can access to the tuple list by:
def op(): XType = {
XType(a.map(...))
}

too many map keys causing out of memory exception in spark

I have an RDD 'inRDD' of the form RDD[(Vector[(Int, Byte)], Vector[(Int, Byte)])] which is a PairRDD(key,value) where key is Vector[(Int, Byte)] and value is Vector[(Int, Byte)].
For each element (Int, Byte) in the vector of key field, and each element (Int, Byte) in the vector of value field I would like to get a new (key,value) pair in the output RDD as (Int, Int), (Byte, Byte).
That should give me an RDD of the form RDD[((Int, Int), (Byte, Byte))].
For example, inRDD contents could be like,
(Vector((3,2)),Vector((4,2))), (Vector((2,3), (3,3)),Vector((3,1))), (Vector((1,3)),Vector((2,1))), (Vector((1,2)),Vector((2,2), (1,2)))
which would become
((3,4),(2,2)), ((2,3),(3,1)), ((3,3),(3,1)), ((1,2),(3,1)), ((1,2),(2,2)), ((1,1),(2,2))
I have the following code for that.
val outRDD = inRDD.flatMap {
case (left, right) =>
for ((ll, li) <- left; (rl, ri) <- right) yield {
(ll,rl) -> (li,ri)
}
}
It works when the vectors are small in size in the inRDD. But when there are lot elements in the vectors, I get out of memory exception. Increasing the available memory
to spark could only solve for smaller inputs and the error appears again for even larger inputs.
Looks like I am trying to assemble a huge structure in memory. I am unable to rewrite this code in any other ways.
I have implemented a similar logic with java in hadoop as follows.
for (String fromValue : fromAssetVals) {
fromEntity = fromValue.split(":")[0];
fromAttr = fromValue.split(":")[1];
for (String toValue : toAssetVals) {
toEntity = toValue.split(":")[0];
toAttr = toValue.split(":")[1];
oKey = new Text(fromEntity.trim() + ":" + toEntity.trim());
oValue = new Text(fromAttr + ":" + toAttr);
outputCollector.collect(oKey, oValue);
}
}
But when I try something similar in spark, I get nested rdd exceptions.
How do I do this efficiently with spark using scala?
Well, if Cartesian product is the only option you can at least make it a little bit more lazy:
inRDD.flatMap { case (xs, ys) =>
xs.toIterator.flatMap(x => ys.toIterator.map(y => (x, y)))
}
You can also handle this at the Spark level
import org.apache.spark.RangePartitioner
val indexed = inRDD.zipWithUniqueId.map(_.swap)
val partitioner = new RangePartitioner(indexed.partitions.size, indexed)
val partitioned = indexed.partitionBy(partitioner)
val lefts = partitioned.flatMapValues(_._1)
val rights = partitioned.flatMapValues(_._2)
lefts.join(rights).values

How to iterate values of map in Scala?

For the value val m = Map(2 ->(3, 2), 1 ->(2, 1))
I want to add up elements belonged to same key, thus, the result is : Map(2 -> 5,1 -> 3) Please guys help me how to solve this problem, I'll appreciate any help!
Consider
m.mapValues { case(x,y) => x+y }
which creates a new Map with same keys and computed values. Also consider
def f(t: (Int,Int)) = t._1+t._2
and so a more concise approach includes this
m.mapValues(f)
Note Decomposing tuples in function arguments for details in declaring a function that can take the tuples from the Map.
Update Following important note by #KevinMeredith (see link in comment below), mapValues provides a view to the collection and the transformation needs be referentially transparent; hence as a standard (intuitive) approach consider pattern-matching on the entire key-value group using map for instance like this,
m.map { case (x,(t1,t2)) => x -> (t1+t2) }
or
m.map { case (k,v) => (k,f(v)) }
or
for ( (x,(t1,t2)) <- m ) yield x -> (t1+t2)

scala + spark: serialize a task for recommendations

I've been working with Scala + Spark and the Movie Recommendation with MLib tutorial.
After obtaining my predictions I need the top 3 items per user.
val predictions =
model.predict(usersProducts).map { case Rating(user, product, rate) =>
(user, product, rate)
}
I've tried this:
def myPrint(x:(Int, Int, Double)) = println(x._1 + ":" + x._2 + " - " +x._3)
predictions.collect().sortBy(- _._3).groupBy(_._1).foreach( t2 => t2._2.take(3).foreach(myPrint) )
( _.1 is user, _.2 is item, _.3 is rate)
I had to add the "collect()" method to make it work, but I can't serialize this task.
By the way, I added the myPrint method because I don't know how to obtain a collection or map from the last line.
Any idea to make it serializable?
Any idea to get a collection/map from last line?
If I can't do it better, in myPrint I will write in a database and make commit after 1000 insert.
Thanks.
You could make sure that all the computations are done in RDDs by slightly modifying your approach:
predictions.sortBy(- _.rating).groupBy(_.user)
.flatMap(_._2.take(3)).foreach(println)
A task that calls a method has to serialize the object containing the method. Try using a function value instead:
val myPrint: ((Int, Int, Double)) => Unit = x => ...
You don't want the collect() at the start, that defeats the whole point of using Spark.
I don't understand what you're saying about "get a collection/map". .take(3) already returns a collection.
After read the answer of lmm and do some research I resolved my problem this way:
First, I begun to work with a Rating object instead of Tuples:
val predictions = model.predict(usersProducts)
Then I defined the function value as follows, now I do here the "take":
def myPrint: ((Int, Iterable[Rating])) => Unit = x => x._2.take(3).foreach(println)
So, now I mix everything this way:
predictions.sortBy(- _.rating).groupBy(_.user).foreach(myPrint)

Is there a way to handle the last case differently in a Scala for loop?

For example suppose I have
for (line <- myData) {
println("}, {")
}
Is there a way to get the last line to print
println("}")
Can you refactor your code to take advantage of built-in mkString?
scala> List(1, 2, 3).mkString("{", "}, {", "}")
res1: String = {1}, {2}, {3}
Before going any further, I'd recommend you avoid println in a for-comprehension. It can sometimes be useful for tracking down a bug that occurs in the middle of a collection, but otherwise leads to code that's harder to refactor and test.
More generally, life usually becomes easier if you can restrict where any sort of side-effect occurs. So instead of:
for (line <- myData) {
println("}, {")
}
You can write:
val lines = for (line <- myData) yield "}, {"
println(lines mkString "\n")
I'm also going to take a guess here that you wanted the content of each line in the output!
val lines = for (line <- myData) yield (line + "}, {")
println(lines mkString "\n")
Though you'd be better off still if you just used mkString directly - that's what it's for!
val lines = myData.mkString("{", "\n}, {", "}")
println(lines)
Note how we're first producing a String, then printing it in a single operation. This approach can easily be split into separate methods and used to implement toString on your class, or to inspect the generated String in tests.
I agree fully with what has been said before about using mkstring, and distinguishing the first iteration rather than the last one. Would you still need to distinguish on the last, scala collections have an init method, which return all elements but the last.
So you can do
for(x <- coll.init) workOnNonLast(x)
workOnLast(coll.last)
(init and last being sort of the opposite of head and tail, which are the first and and all but first). Note however than depending on the structure, they may be costly. On Vector, all of them are fast. On List, while head and tail are basically free, init and last are both linear in the length of the list. headOption and lastOption may help you when the collection may be empty, replacing workOnlast by
for (x <- coll.lastOption) workOnLast(x)
You may take the addString function of the TraversableOncetrait as an example.
def addString(b: StringBuilder, start: String, sep: String, end: String): StringBuilder = {
var first = true
b append start
for (x <- self) {
if (first) {
b append x
first = false
} else {
b append sep
b append x
}
}
b append end
b
}
In your case, the separator is }, { and the end is }
If you don't want to use built-in mkString function, you can make something like
for (line <- lines)
if (line == lines.last) println("last")
else println(line)
UPDATE: As didierd mentioned in comments, this solution is wrong because last value can occurs several times, he provides better solution in his answer.
It is fine for Vectors, because last function takes "effectively constant time" for them, as for Lists, it takes linear time, so you can use pattern matching
#tailrec
def printLines[A](l: List[A]) {
l match {
case Nil =>
case x :: Nil => println("last")
case x :: xs => println(x); printLines(xs)
}
}
Other answers are rightfully pointed to mkString, and for a normal amount of data I would also use that.
However, mkString builds (accumulates) the end-result in-memory through a StringBuilder. This is not always desirable, depending on the amount of data we have.
In this case, if all we want is to "print" we don't need to build the big-result first (and maybe we even want to avoid this).
Consider the implementation of this helper function:
def forEachIsLast[A](iterator: Iterator[A])(operation: (A, Boolean) => Unit): Unit = {
while(iterator.hasNext) {
val element = iterator.next()
val isLast = !iterator.hasNext // if there is no "next", this is the last one
operation(element, isLast)
}
}
It iterates over all elements and invokes operation passing each element in turn, with a boolean value. The value is true if the element passed is the last one.
In your case it could be used like this:
forEachIsLast(myData) { (line, isLast) =>
if(isLast)
println("}")
else
println("}, {")
}
We have the following advantages here:
It operates on each element, one by one, without necessarily accumulating the result in memory (unless you want to).
Because it does not need to load the whole collection into memory to check its size, it's enough to ask the Iterator if it's exhausted or not. You could read data from a big file, or from the network, etc.