scala + spark: serialize a task for recommendations - scala

I've been working with Scala + Spark and the Movie Recommendation with MLib tutorial.
After obtaining my predictions I need the top 3 items per user.
val predictions =
model.predict(usersProducts).map { case Rating(user, product, rate) =>
(user, product, rate)
}
I've tried this:
def myPrint(x:(Int, Int, Double)) = println(x._1 + ":" + x._2 + " - " +x._3)
predictions.collect().sortBy(- _._3).groupBy(_._1).foreach( t2 => t2._2.take(3).foreach(myPrint) )
( _.1 is user, _.2 is item, _.3 is rate)
I had to add the "collect()" method to make it work, but I can't serialize this task.
By the way, I added the myPrint method because I don't know how to obtain a collection or map from the last line.
Any idea to make it serializable?
Any idea to get a collection/map from last line?
If I can't do it better, in myPrint I will write in a database and make commit after 1000 insert.
Thanks.

You could make sure that all the computations are done in RDDs by slightly modifying your approach:
predictions.sortBy(- _.rating).groupBy(_.user)
.flatMap(_._2.take(3)).foreach(println)

A task that calls a method has to serialize the object containing the method. Try using a function value instead:
val myPrint: ((Int, Int, Double)) => Unit = x => ...
You don't want the collect() at the start, that defeats the whole point of using Spark.
I don't understand what you're saying about "get a collection/map". .take(3) already returns a collection.

After read the answer of lmm and do some research I resolved my problem this way:
First, I begun to work with a Rating object instead of Tuples:
val predictions = model.predict(usersProducts)
Then I defined the function value as follows, now I do here the "take":
def myPrint: ((Int, Iterable[Rating])) => Unit = x => x._2.take(3).foreach(println)
So, now I mix everything this way:
predictions.sortBy(- _.rating).groupBy(_.user).foreach(myPrint)

Related

Unable to get println working in Grouped Integer Range map function

I am experimenting with below code:
def TestRun(n: Int): Unit = {
(1 to n)
.grouped(4)
.map(grp => { println("Group length is: " + grp.length)})
}
TestRun(100)
And I am a bit surprised that I am not able to see any output of println after executing the program. Code compiled successfully and ran, but without any expected output.
Kindly point me what mistake I am doing.
The reason there is no output is that Range gives an Iterator which is lazy. This means that it won't create any data until it is asked. Likewise the grouped and map methods also return a lazy Iterator, so the result is a Iterator that will return a set of values only when asked. TestRun never asks for the data, so it is never generated.
One way round this is to use foreach rather than map because foreach is eager (the opposite of lazy) and will take each value from the Iterator in turn.
Another way would be to force the Iterator to become a concrete collection using something like toList:
def TestRun(n: Int): Unit = {
(1 to n)
.grouped(4)
.map(grp => { println("Group length is: " + grp.length)})
.toList
}
TestRun(100)

How to run two SparkSql queries in parallel in Apache Spark

First, let me write the part of the code I want to execute in .scala file on spark.
This is my source file. It has structured data with four fields
val inputFile = sc.textFile("hdfs://Hadoop1:9000/user/hduser/test.csv")
I have declared a case class to store the data from file into table with four columns
case class Table1(srcIp: String, destIp: String, srcPrt: Int, destPrt: Int)
val inputValue = inputFile.map(_.split(",")).map(p => Table1(p(0),p(1),p(2).trim.toInt,p(3).trim.toInt)).toDF()
inputValue.registerTempTable("inputValue")
Now, let's say, I want to run following two queries. How can I run these queries in parallel as they are mutually independent. I feel, if I could run them in parallel, it can reduce the execution time. Right now, they are executed serially.
val primaryDestValues = sqlContext.sql("SELECT distinct destIp FROM inputValue")
primaryDestValues.registerTempTable("primaryDestValues")
val primarySrcValues = sqlContext.sql("SELECT distinct srcIp FROM inputValue")
primarySrcValues.registerTempTable("primarySrcValues")
primaryDestValues.join(primarySrcValues, $"destIp" === $"srcIp").select($"destIp",$"srcIp").show(
May be you can look in direction of Futures/Promises. There is a method in SparkContext submitJob which return you future with results. So may this you can fire two jobs and then collect results from futures.
I have not tried this method yet. Just an assumption.
No idea why you want to use sqlContext in the first place, and don't keep things simple.
val inputValue = inputFile.map(_.split(",")).map(p => (p(0),p(1),p(2).trim.toInt,p(3).trim.toInt))
Assuming p(0) = destIp, p(1)=srcIp
val joinedValue = inputValue.map{case(destIp, srcIp, x, y) => (destIp, (x, y))}
.join(inputFile.map{case(destIp, srcIp, x, y) => (srcIp, (x, y))})
.map{case(ip, (x1, y1), (x2, y2)) => (ip, destX, destY, srcX, srcY)}
Now it will be parallezied, and you can even control number of partitions you want using colasce
You can skip the two DISTINCT and do one at the end:
inputValue.select($"srcIp").join(
inputValue.select($"destIp"),
$"srcIp" === $"destIp"
).distinct().show
That's a nice question. This can be executed in parallel using the par in the array. For this you have customize your code accordingly.
Declare an array with two items in it (your can name this as per your wish). Write your code inside each case statement which you need to execute in parallel.
Array("destIp","srcIp").par.foreach { i =>
{
i match {
case "destIp" => {
val primaryDestValues = sqlContext.sql("SELECT distinct destIp FROM inputValue")
primaryDestValues.registerTempTable("primaryDestValues")
}
case "srcIp" => {
val primarySrcValues = sqlContext.sql("SELECT distinct srcIp FROM inputValue")
primarySrcValues.registerTempTable("primarySrcValues")
}}}
}
Once both of the case statement's execution is completed, your below code will be executed.
primaryDestValues.join(primarySrcValues, $"destIp" === $"srcIp").select($"destIp",$"srcIp").show()
Note : If you remove par from the code, it will run sequentially
The other option is to create another sparksession inside the code and execute sql using that sparksession variable. But this is little risky and has be used very carefully

How to add line number into each line?

suppose these are my data:
‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.
‘Map’ is responsible to read data from input location.
it will generate a key value pair.
that is, an intermediate output in local machine.
’Reducer’ is responsible to process the intermediate.
output received from the mapper and generate the final output.
and i want to add a number to every line like below output:
1,‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.
2,‘Map’ is responsible to read data from input location.
3,it will generate a key value pair.
4,that is, an intermediate output in local machine.
5,’Reducer’ is responsible to process the intermediate.
6,output received from the mapper and generate the final output.
save them to file.
i've tried:
object DS_E5 {
def main(args: Array[String]): Unit = {
var i=0
val conf = new SparkConf().setAppName("prep").setMaster("local")
val sc = new SparkContext(conf)
val sample1 = sc.textFile("data.txt")
for(sample<-sample1){
i=i+1
val ss=sample.map(l=>(i,sample))
println(ss)
}
}
}
but its output is like blew :
Vector((1,‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.))
...
How can i edit my code to generate an output like my favorite output?
zipWithIndex is what you need here. It maps from RDD[T] to RDD[(T, Long)] by adding an index on the second position of the pair.
sample1
.zipWithIndex()
.map { case (line, i) => i.toString + ", " + line }
or using string interpolation (see a comment by #DanielC.Sobral)
sample1
.zipWithIndex()
.map { case (line, i) => s"$i, $line" }
By calling val sample1 = sc.textFile("data.txt") you are creating a new RDD.
If you need just an output, you can try to use next code:
sample1.zipWithIndex().foreach(f => println(f._2 + ", " + f._1))
Basically, by using this code, you will do this:
Using .zipWithIndex() will return new RDD[(T, Long)], where (T, Long) is a Tuple, T is a previous RDD elements datatype (java.lang.String, I believe) and Long is an index of element in RDD.
You performed transformation, now you need to make an action. foreach, in this case, suits very well. What is basically does: it applies your statement to every element in current RDD, so we just call quickly formatted println.

Running a function against every item in collection

I have data type :
counted: org.apache.spark.rdd.RDD[(String, Seq[(String, Int)])] = MapPartitionsRDD[24] at groupByKey at <console>:28
And I'm trying to apply the following to this type :
def func = 2
counted.flatMap { x => counted.map { y => ((x._1+","+y._1),func) } }
So each sequence is compared to each other and a function is applied. For simplicity the function is just returning 2. When I attempt above function I receive this error :
scala> counted.flatMap { x => counted.map { y => ((x._1+","+y._1),func) } }
<console>:33: error: type mismatch;
found : org.apache.spark.rdd.RDD[(String, Int)]
required: TraversableOnce[?]
counted.flatMap { x => counted.map { y => ((x._1+","+y._1),func) } }
How can this function be applied using Spark ?
I have tried
val dataArray = counted.collect
dataArray.flatMap { x => dataArray.map { y => ((x._1+","+y._1),func) } }
which converts the collection to Array type and applies same function. But I run out of memory when I try this method. I think using an RDD is more efficient than using an Array ? The max amount of memory I can allocate is 7g , is there a mechanism in spark that I can use hard drive memory to augment available RAM memory ?
The collection I'm running this function on contain 20'000 entries so 20'000^2 comparisons (400'000'000) but in Spark terms this is quite small ?
Short answer:
counted.cartesian(counted).map {
case ((x, _), (y, _)) => (x + "," + y, func)
}
Please use pattern matching to extract tuple elements for nested tuples to avoid unreadable chained underscore notation. Using _ for the second elements shows the reader that these values are being ignored.
Now what would be even more readable (and maybe more efficient) if func doesn't use the second elements would be to do this:
val projected = counted.map(_._1)
projected.cartesian(projected).map(x => (x._1 + "," + x._2, func))
Note that you do not need curly braces if your lambda fits in a single semantic line this is a very common mistake in Scala.
I would like to know why you wish to have this Cartesian product, there is often ways to avoid doing this that are significantly more scalable. Please say what your going to do with this Cartesian product and I will try to find a scalable way of doing what you want.
One final point; please put spaces between operators
#RexKerr pointed to me that I was somewhat incorrect in the comment section, so I deleted my comments. But while doing that, I had the chance to read the post again and came up with the idea that might be of some use to you.
Since what you are trying to implement is actually some operation over a cartesian product, you might want to try just calling the RDD#cartesian. Here is a dumb example, but if you can give some real code, maybe I'll be able to do something like this in that case as well:
// get collection with the type corresponding to the type in question:
val v1 = sc.parallelize(List("q"-> (".", 0), "s"->(".", 1), "f" -> (".", 2))).groupByKey
// try doing something
v1.cartesian(v1).map{x => (x._1._1+","+x._1._1, 2)}.foreach(println)

better way of using reduceLeft/foldLeft on list of case class

My intention is simple: to get comma seperated values of emaild from list of User objects.
I have done this in Java with a for loop and if else conditions.
Now i want to do it in Scala, so i tried this.
case class User(name: String, email: String)
val userList = List(User("aaa", "aaa#aaa.com"),
User("bbb", "bbb#bbb.com"),
User("ccc", "ccc#ccc.com"))
now
val mailIds = userList.foldLeft(""){(a: String, b: User) => ( b.email+","+a) }
gives
ccc#ccc.com,bbb#bbb.com,aaa#aaa.com,
(note the comma at the end.)
and
val mailids = userList.map(x => x.email).reduceLeft(_+","+_)
gives
aaa#aaa.com,bbb#bbb.com,ccc#ccc.com
i tried using only reduceLeft like this
val emailids = userList.reduceLeft((emails: String, user: User) => emails+", "+user.email)
but it throws compilation error
type mismatch; found : (String, User) => java.lang.String required: (java.io.Serializable, User) => java.io.Serializable
so, is there a better way of using reduceLeft without map in the above case ?
No, reduceLeft is a foldLeft where the accumulator is the first element of the collection. In your case this first element is a User and you have to pass an argument of the same type for the next step of the iteration. That is why you get a type missmatch.
You can do the map within a lazy collection and reduce this like the following:
val mailIds = userList.view map (_.email) reduceLeft (_ + "," + _)
This will give you aaa#aaa.com,bbb#bbb.com,ccc#ccc.com without creating a second collection. Jessie Eichar wrote a verry good tutorial for that.
EDIT
A way without a lazy collection would be deleting the last comma by:
val rawMailIds = userList.foldLeft("")((acc, elem) => acc + elem.email + ",")
val mailIds = rawMailIds.substring(0, rawMailIds.length - 1)
Why don't you stick with the combination of map and reduce? In any case, your issue is that your reduceLeft only version needs to have a string as its initial value.