RDD spark scala - compute the average over an RDD of tuples - scala

I'm quite new to Scala and Spark so I'm having issues with this simple task. I need to calculate an average deviation for some movies ratings. My code is this one:
def RDDcomputeItemAvgDev(train: RDD[Rating]) = {
// WRONG
val avg = RDDcomputeItemAvgRat(train: RDD[Rating])
val pairs = train.map(i => (i.item, (i.rating - avg(i.item)) / scale(i.rating, avg(i.item))))
val sumCountPair = pairs.combineByKey((x: Double) => (x, 1),
(pair1: (Double, Int), x: Double) => (pair1._1 + x, pair1._2 + 1),
(pair1: (Double, Int), pair2: (Double, Int)) => (pair1._1 + pair2._1, pair1._2 + pair2._2))
val average = sumCountPair.map(x => (x._1, x._2._1 / x._2._2))
average.collectAsMap()
}
Where the deviation function is the part :
(i.rating - avg(i.item)) / scale(i.rating, avg(i.item))
The code runs but the result is not the correct one, I cannot really spot my mistake, can someone help?

Related

How to perform transformations on list/array of tuples in spark scala RDD?

I have a list of tuples - How can I perform reduce on integer values of each tuple?
val student=List((1,"akshay",60),(2,"salman",70),(3,"ranveer",50))
val student_rdd=sc.parallelize(student)
rdd1.reduce((a,b)=>(a._3+b._3)).collect
error: type mismatch;
found: Int
required: (Int, String, Int)
You can map the values before reducing. The other columns are not necessary for the reduction and should be removed before reduction.
student_rdd.map(_._3).reduce(_+_)
There are much better ways than using RDDs, but if you want to get sum, min, max, avg in one pass using reduce then this would work
val res = {
val a = student_rdd.map(r => (r._3, r._3, r._3, 1))
.reduce((a, b) => (a._1 + b._1, Math.min(a._2, b._2),
Math.max(a._3, b._3), a._4 + b._4))
a.copy(_4 = a._1 * 1.0 / a._4)
}
This gives you a tuple with (sum, min, max, avg)

How can we sample from a list of strings based on their probability of occurrence in the list in Scala?

I have a List[(String, Double)] variable where the second element of tuple denotes the probability of the string in first element appearing in a corpus. An example would be [(Apple, 0.2), (Banana, 0.3), (Lemon, 0.5)] where an Apple appears with a probability of 0.2 in the list of strings. I want to randomly sample from the list of strings based on their probability of appearance something along the lines of numpy random.choice() method. What would be the correct way to do this in Scala?
Another solution:
def choice(samples: Seq[(String, Double)], n: Int): Seq[String] = {
val (strings, probs) = samples.unzip
val cumprobs = probs.scanLeft(0.0){ _ + _ }.init
def p2s(p: Double): String = strings(cumprobs.lastIndexWhere(_ <= p))
Seq.fill(n)(math.random).map(p2s)
}
An usage (and verify):
>> val ss = choice(Seq(("Apple", 0.2), ("Banana", 0.3), ("Lemon", 0.5)), 10000)
>> ss.groupBy(identity).map{ case(k, v) => (k, v.size)}
Map[String, Int] = Map(Banana -> 3013, Lemon -> 4971, Apple -> 2016)
A very naive (and inefficient) solution would be to create a List of 100 elements that repeats each of the original elements the amount of times needed to respect its probabilities. Then you can randomly shuffle that List and finally take the first element.
import scala.util.Random
final val percent_100 = BigDecimal(100)
def choice[T](data: List[(T, Double)]): T = {
val distribution = data.flatMap {
case (elem, probability) =>
val scaledProbability = BigDecimal(probability).setScale(
scale = 2,
BigDecimal.RoundingMode.HALF_EVEN
)
val n = (scaledProbability * percent_100).toIntExact
List.fill(n)(elem)
}
Random.shuffle(distribution).head
}
However, I am sure there should be better ways of solving this.

Multiplication of "double" values in scala

I want to multiply two sparse matrices in spark using scala. I am passing these matrices in form of arguments and storing result in another argument.
Matrices are text files where each matrix element is represented by as: row, column, element.
I am not able to multiply two Double values in Scala.
object MultiplySpark {
def main(args: Array[ String ]) {
val conf = new SparkConf().setAppName("Multiply")
conf.setMaster("local[2]")
val sc = new SparkContext(conf)
val M = sc.textFile(args(0)).flatMap(entry => {
val rec = entry.split(",")
val row = rec(0).toInt
val column = rec(1).toInt
val value = rec(2).toDouble
for {pointer <-1 until rec.length} yield ((row,column),value)
})
val N = sc.textFile(args(0)).flatMap(entry => {
val rec = entry.split(",")
val row = rec(0).toInt
val column = rec(1).toInt
val value = rec(2).toDouble
for {pointer <-1 until rec.length} yield ((row,column),value)
})
val Mmap = M.map( e => (e._2,e))
val Nmap = N.map( d => (d._2,d))
val MNjoin = Mmap.join(Nmap).map{ case (k,(e,d)) => e._2.toDouble+","+d._2.toDouble }
val result = MNjoin.reduceByKey( (a,b) => a*b)
.map(entry => {
((entry._1._1, entry._1._2), entry._2)
})
.reduceByKey((a, b) => a + b)
result.saveAsTextFile(args(2))
sc.stop()
How can I multiply double values in Scala?
Please note:
I tried a.toDouble * b.toDouble
Error is: Value * is not a member of Double Double
This reduceByKey would work if you had RDD[((Int, Int), Double)] (or RDD[(SomeType, Double)] more generally) and join gives you RDD[((Int, Int), (Double, Double))]. So you are trying to multiply pairs (Double, Double), not Doubles.

RDD/Scala Get one column from RDD

I have an RDD[Log] file with various fields (username,content,date,bytes) and I want to find different things for each field/column.
For example, I want to get the min/max and average bytes found in the RDD. When i do:
val q1 = cleanRdd.filter(x => x.bytes != 0)
I get the full lines of the RDD with bytes != 0. But how can I actually sum them, calculate the avg, find the min/max etc? How can I take only one column from my RDD and apply transformations on it?
EDIT: Prasad told me about changing the type to dataframe, he gave no instructions on how to do so though, and I cant find a solid answer on the site. Any help would be great.
EDIT: LOG class:
case class Log (username: String, date: String, status: Int, content: Int)
using a cleanRdd.take(5).foreach(println) gives something like this
Log(199.72.81.55 ,01/Jul/1995:00:00:01 -0400,200,6245)
Log(unicomp6.unicomp.net ,01/Jul/1995:00:00:06 -0400,200,3985)
Log(199.120.110.21 ,01/Jul/1995:00:00:09 -0400,200,4085)
Log(burger.letters.com ,01/Jul/1995:00:00:11 -0400,304,0)
Log(199.120.110.21 ,01/Jul/1995:00:00:11 -0400,200,4179)
Well... you have a lot of questions.
So... you have the following abstraction of a Log
case class Log (username: String, date: String, status: Int, content: Int, byte: Int)
Que - How can I take only one column from my RDD.
Ans - You have a map function with the RDD's. So for an RDD[A], map takes a map/transform function of type A => B to transform it into a RDD[B].
val logRdd: RDD[Log] = ...
val byteRdd = logRdd
.filter(l => l.bytes != 0)
.map(l => l.byte)
Que - how can I actually sum them ?
Ans - You can do it by using reduce / fold / aggregate.
val sum = byteRdd.reduce((acc, b) => acc + b)
val sum = byteRdd.fold(0)((acc, b) => acc + b)
val sum = byteRdd.aggregate(0)(
(acc, b) => acc + b,
(acc1, acc2) => acc1 + acc2
)
Note :: An important thing to notice here is that a sum of Int can grow bigger than what an Int can handle. So in most real life cases we should use at least a Long as our accumulator instead of an Int, which actually removes reduce and fold as options. And we will be left with an aggregate only.
val sum = byteRdd.aggregate(0l)(
(acc, b) => acc + b,
(acc1, acc2) => acc1 + acc2
)
Now if you have to calculate multiple things like min, max, avg then I will suggest that you calculate them in a single aggregate instead of multiple like this,
// (count, sum, min, max)
val accInit = (0, 0, Int.MaxValue, Int.MinValue)
val (count, sum, min, max) = byteRdd.aggregate(accInit)(
{ case ((count, sum, min, max), b) =>
(count + 1, sum + b, Math.min(min, b), Math.max(max, b)) },
{ case ((count1, sum1, min1, max1), (count2, sum2, min2, max2)) =>
(count1 + count2, sum1 + sum2, Math.min(min1, min2), Math.max(max1, max2)) }
})
val avg = sum.toDouble / count
Have a look in DataFrame API. You need to convert your RDD to a DataFrame and then you can use min, max, avg functions like below:
val rdd = cleanRdd.filter(x => x.bytes != 0)
val df = sparkSession.sqlContext.createDataFrame(rdd, classOf[Log])
Assuming you wanted to operations on column bytes then
import org.apache.spark.sql.functions._
df.select(avg("bytes")).show
df.select(min("bytes")).show
df.select(max("bytes")).show
Update:
Tried with the following in spark-shell. check the screenshots for the outcome...
case class Log (username: String, date: String, status: Int, content: Int)
val inputRDD = sc.parallelize(Seq(Log("199.72.81.55","01/Jul/1995:00:00:01 -0400",200,6245), Log("unicomp6.unicomp.net","01/Jul/1995:00:00:06 -0400",200,3985), Log("199.120.110.21","01/Jul/1995:00:00:09 -0400",200,4085), Log("burger.letters.com","01/Jul/1995:00:00:11 -0400",304,0), Log("199.120.110.21","01/Jul/1995:00:00:11 -0400",200,4179)))
val rdd = inputRDD.filter(x => x.content != 0)
val df = rdd.toDF("username", "date", "status", "content")
df.printSchema
import org.apache.spark.sql.functions._
df.select(avg("content")).show
df.select(min("content")).show
df.select(max("content")).show

Applying operation to corresponding elements of Array

I want to sum the corresponding elements of the list and multiply the results while keeping the label associated with the array element so
("a",Array((0.5,1.0),(0.667,2.0)))
becomes :
(a , (0.5 + 0.667) * (1.0 + 2.0))
Here is my code to express this for a single array element :
val data = Array(("a",Array((0.5,1.0),(0.667,2.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
//> data : Array[(String, Array[(Double, Double)])] = Array((a,Array((0.5,1.0),
//| (0.667,2.0))), (b,Array((0.6,2.0), (0.6,2.0))))
val v1 = (data(0)._1, data(0)._2.map(m => m._1).sum)
//> v1 : (String, Double) = (a,1.167)
val v2 = (data(0)._1, data(0)._2.map(m => m._2).sum)
//> v2 : (String, Double) = (a,3.0)
val total = (v1._1 , (v1._2 * v2._2)) //> total : (String, Double) = (a,3.5010000000000003)
I just want apply this function to all elements of the array so val "data" above becomes :
Map[(String, Double)] = ((a,3.5010000000000003),(b,4.8))
But I'm not sure how to combine the above code into a single function which maps over all the array elements ?
Update : the inner Array can be of variable length so this is also valid :
val data = Array(("a",Array((0.5,1.0,2.0),(0.667,2.0,1.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
Pattern matching is your friend! You can use it for tuples and arrays. If there are always two elements in the inner array, you can do it this way:
val data = Array(("a",Array((0.5,1.0),(0.667,2.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
data.map {
case (s, Array((x1, x2), (x3, x4))) => s -> (x1 + x3) * (x2 + x4)
}
// Array[(String, Double)] = Array((a,3.5010000000000003), (b,4.8))
res6.toMap
// scala.collection.immutable.Map[String,Double] = Map(a -> 3.5010000000000003, b -> 4.8)
If the inner elements are variable length, you could do it this way (a for comprehension instead of explicit maps):
for {
(s, tuples) <- data
sum1 = tuples.map(_._1).sum
sum2 = tuples.map(_._2).sum
} yield s -> sum1 * sum2
Note that while this is a very clear solution, it's not the most efficient possible, because we're iterating over the tuples twice. You could use a fold instead, but it would be much harder to read (for me anyway. :)
Finally, note that .sum will produce zero on an empty collection. If that's not what you want, you could do this instead:
val emptyDefault = 1.0 // Or whatever, depends on your use case
for {
(s, tuples) <- data
sum1 = tuples.map(_._1).reduceLeftOption(_ + _).getOrElse(emptyDefault)
sum2 = tuples.map(_._2).reduceLeftOption(_ + _).getOrElse(emptyDefault)
} yield s -> sum1 * sum2
You can use algebird numeric library:
val data = Array(("a",Array((0.5,1.0),(0.667,2.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
import com.twitter.algebird.Operators._
def sumAndProduct(a: Array[(Double, Double)]) = {
val sums = a.reduceLeft((m, n) => m + n)
sums._1 * sums._2
}
data.map{ case (x, y) => (x, sumAndProduct(y)) }
// Array((a,3.5010000000000003), (b,4.8))
It will work fine for variable size array as well.
val data = Array(("a",Array((0.5,1.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
// Array((a,0.5), (b,4.8))
Like this? Does your array always have only 2 pairs?
val m = data map ({case (label,Array(a,b)) => (label, (a._1 + b._1) * (a._2 + b._2)) })
m.toMap