How to calculate standard deviation and average values of RDD[Long]? - scala

I have RDD[Long] called mod and I want to compute standard deviation and mean values for this RDD using Spark 2.2 and Scala 2.11.8.
How can I do it?
I tried to calculate the average value as follows, but is there any easier way to get these values?
val avg_val = mod.toDF("col").agg(
avg($"col").as("avg")
).first().toString().toDouble
val stddev_val = mod.toDF("col").agg(
stddev($"col").as("avg")
).first().toString().toDouble

I have RDD[Long] called mod and I want to compute standard deviation and mean
Just use stats:
scala> val mod = sc.parallelize(Seq(1L, 3L, 5L))
mod: org.apache.spark.rdd.RDD[Long] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> val stats = mod.stats
stats: org.apache.spark.util.StatCounter = (count: 3, mean: 3.000000, stdev: 1.632993, max: 5.000000, min: 1.000000)
scala> stats.mean
res0: Double = 3.0
scala> stats.stdev
res1: Double = 1.632993161855452
It uses the same internals a stdev and mean but has to scan data only once.
With Dataset I'd recommend:
val (avg_val, stddev_val) = mod.toDS
.agg(mean("value"), stddev("value"))
.as[(Double, Double)].first
or
import org.apache.spark.sql.Row
val Row(avg_val: Double, stddev_val: Double) = mod.toDS
.agg(mean("value"), stddev("value"))
.first
but it neither necessary nor useful here.

I think this is pretty simple:
mod.stdev()
mod.mean()

Related

The proper way to compute correlation between two Seq columns into a third column

I have a DataFrame where each row has 3 columns:
ID:Long, ratings1:Seq[Double], ratings2:Seq[Double]
For each row I need to compute the correlation between those Vectors.
I came up with the following solution which seems to be inefficient (not working as Jarrod Roberson has mentioned) as I have to create RDDs for each Seq:
val similarities = ratingPairs.map(row => {
val ratings1 = sc.parallelize(row.getAs[Seq[Double]]("ratings1"))
val ratings2 = sc.parallelize(row.getAs[Seq[Double]]("ratings2"))
val corr:Double = Statistics.corr(ratings1, ratings2)
Similarity(row.getAs[Long]("ID"), corr)
})
Is there a way to compute such correlations properly?
Let's assume you have a correlation function for arrays:
def correlation(arr1: Array[Double], arr2: Array[Double]): Double
(for potential implementations of that function, which is completely independent of Spark, you can ask a separate question or search online, there are some close-enough resource, e.g. this implementation).
Now, all that's left to do is to wrap this function with a UDF and use it:
import org.apache.spark.sql.functions._
import spark.implicits._
val corrUdf = udf {
(arr1: Seq[Double], arr2: Seq[Double]) => correlation(arr1.toArray, arr2.toArray)
}
val result = df.select($"ID", corrUdf($"ratings1", $"ratings2") as "correlation")

How to subtract two doubles?

I want to get IRQ in Spark. How can I subtract two values (double or int)?
I tried the code below but only got error:
scala> Q1
res103: org.apache.spark.sql.Row = [11.09314]
scala> Q3
res104: org.apache.spark.sql.Row = [34.370419]
scala> val IRQ = Math.abs(Q3-Q1)
<console>:43: error: value - is not a member of org.apache.spark.sql.Row
val IRQ = Math.abs(Q3-Q1)
This should do the trick:
val IQR = Math.abs(Q3.getDouble(0) - Q1.getDouble(0))
See the docs for Row interface here.

Spark RDD map internal object to Row

My initial data from a CSV file is:
1 ,21658392713 ,21626890421
1 ,21623461747 ,21626890421
1 ,21623461747 ,21626890421
The data I have after a few transformations and grouping based on business logic is yields
scala> val sGrouped = grouped
sGrouped: org.apache.spark.rdd.RDD[(String, Iterable[(String,
(Array[String], String))])] = ShuffledRDD[85] at groupBy at <console>:51
scala> sGrouped.foreach(f=>println(f))
(21626890421,CompactBuffer((21626890421,
([Ljava.lang.String;#62ac8444,21626890421)),
(21626890421,([Ljava.lang.String;#59d80fe,21626890421)),
(21626890421,([Ljava.lang.String;#270042e8,21626890421)),
from this I want to get a map that yields something like the following format
[String, Row[String]]
so the data may look like:
[ 21626890421 , Row[(1 ,21658392713 ,21626890421)
, (1 ,21623461747 ,21626890421)
, (1 ,21623461747,21626890421)]]
I really appreciate any guidance on moving forward on this.
I found the answer, but I am not sure if this is an efficient way, any better approaches are appreciated, as this feels more like a hack.
scala> import org.apache.spark.sql.Row
scala> val grouped = cToP.groupBy(_._1)
grouped: org.apache.spark.rdd.RDD[(String, Iterable[(String,
(Array[String], String))])]
scala> val sGrouped = grouped.map(f => f._2.toList)
sGrouped: org.apache.spark.rdd.RDD[List[(String, (Array[String],
String))]]
scala> val tGrouped = sGrouped.map(f =>f.map(_._2).map(c =>
Row(c._1(0), c._1(12), c._1(18))))
tGrouped: org.apache.spark.rdd.RDD[List[org.apache.spark.sql.Row]] =
MapPartitionsRDD[42] a
scala> tGrouped.foreach(f => println(f))
yields
List([1,21658392713,21626890421], [1,21623461747,21626890421],
[1,21623461747,21626890421])
scala> tGrouped.count()
res6: Long = 1
The answer I am getting is correct, and even the count is correct. However, I do not understand why the count is 1.

Spark: Summary statistics

I am trying to use Spark summary statistics as described at: https://spark.apache.org/docs/1.1.0/mllib-statistics.html
According to Spark docs :
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
import org.apache.spark.mllib.linalg.DenseVector
val observations: RDD[Vector] = ... // an RDD of Vectors
// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
I have a problem building observations:RDD[Vector] object. I try:
scala> val data:Array[Double] = Array(1, 2, 3, 4, 5)
data: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0)
scala> val v = new DenseVector(data)
v: org.apache.spark.mllib.linalg.DenseVector = [1.0,2.0,3.0,4.0,5.0]
scala> val observations = sc.parallelize(Array(v))
observations: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector] = ParallelCollectionRDD[3] at parallelize at <console>:19
scala> val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
<console>:21: error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
Note: org.apache.spark.mllib.linalg.DenseVector <: org.apache.spark.mllib.linalg.Vector, but class RDD is invariant in type T.
You may wish to define T as +T instead. (SLS 4.5)
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
Questions:
1) How should I cast DenseVector to Vector?
2) In real program instead of array of doubles I have a to get statistics on a collection that I get from RDD using:
def countByKey(): Map[K, Long]
//Count the number of elements for each key, and return the result to the master as a Map.
So I have to do:
myRdd.countByKey().values.map(_.toDouble)
Which does not make much sense because instead of working with RDDs I now have to work with regular Scala collections whiich at some time stop fitting into memory. All advantages of Spark distributed computations is lost.
How to solve this in scalable manner?
Update
In my case I have:
val cnts: org.apache.spark.rdd.RDD[Int] = prodCntByCity.map(_._2) // get product counts only
val doubleCnts: org.apache.spark.rdd.RDD[Double] = cnts.map(_.toDouble)
How to convert doubleCnts into observations: RDD[Vector] ?
1) You don't need to cast, you just need to type:
val observations = sc.parallelize(Array(v: Vector))
2) Use aggregateByKey (map all the keys to to 1, and reduce by summing) rather than countByKey.
DenseVector has a compressed function. so you can change the RDD[ DenseVector] to RDD[Vector] as :
val st =observations.map(x=>x.compressed)

How to use priority queues in Scala?

I am trying to implement A* search in Scala (version 2.10), but I've ran into a brick wall - I can't figure out how to use Scala's Priority Queue.
I have a set of squares, represented by (Int, Int)s, and I need to insert them with priorities represented by Ints. In Python you just have a list of key, value pairs and use the heapq functions to sort it.
So how do you do this?
There is actually pre-defined lexicographical order for tuples -- but you need to import it:
import scala.math.Ordering.Implicits._
Moreover, you can define your own ordering.
Suppose I want to arrange tuples, based on the difference between first and second members of the tuple:
scala> import scala.collection.mutable.PriorityQueue
// import scala.collection.mutable.PriorityQueue
scala> def diff(t2: (Int,Int)) = math.abs(t2._1 - t2._2)
// diff: (t2: (Int, Int))Int
scala> val x = new PriorityQueue[(Int, Int)]()(Ordering.by(diff))
// x: scala.collection.mutable.PriorityQueue[(Int, Int)] = PriorityQueue()
scala> x.enqueue(1 -> 1)
scala> x.enqueue(1 -> 2)
scala> x.enqueue(1 -> 3)
scala> x.enqueue(1 -> 4)
scala> x.enqueue(1 -> 0)
scala> x
// res5: scala.collection.mutable.PriorityQueue[(Int, Int)] = PriorityQueue((1,4), (1,3), (1,2), (1,1), (1,0))
Indeed, there is no implicit ordering on pairs of integers (a, b). What would it be? Perhaps they are both positive and you can use (a - 1.0/b)? Or they are not, and you can use, what, (a + atan(b/pi))? If you have an ordering in mind, you can consider wrapping your pairs in a type that has your ordering.