calculating kurtosis array[Double] filed in spark scala - scala

how to calculate the kurtosis of array field in spark
spark built-in function is failing array field.
due to data type mismatch: argument 1 requires double type, however, 'SERIES' is of array<double> type.;;
Example in python
from scipy.stats import kurtosis
kurtosis([1, 2, 3, 4, 5])
-1.3
i used spark built in function
df.withColumn("newcolumn",when(col("SERIES").isNotNull,kurtosis(columnName))

using Twitter Algebra package i can get kurtosis value.
import com.twitter.algebird._
val y = List(1, 2, 3, 4, 5)
def getMoments(xs: List[Int]): Moments =
xs.foldLeft(MomentsGroup.zero) { (m, x) =>
MomentsGroup.plus(m, Moments(x))
}
println(getMoments(y).kurtosis) // -1.3

Related

Scala: How to find the minimum of more than 2 elements?

Since the Math.min() function only allows for the use of 2 elements, I was wondering if there is maybe another function which can calculate the minimum of more than 2 elements.
Thanks in advance!
If you have multiple elements you can just chain calls to the min method:
Math.min(x, Math.min(y, z))
Since scala adds the a min method to numbers via implicits you could write the following which looks much fancier:
x min y min z
If you have a list of values and want to find their minimum:
val someNumbers: List[Int] = ???
val minimum = someNumbers.min
Note that this throws an exception if the list is empty. From scala 2.13.x onwards, there will be a minOption method to handle such cases gracefully. For older versions you could use the reduceOption method as workaround:
someNumbers.reduceOption(_ min _)
someNumbers.reduceOption(Math.min)
Add all numbers in the collection like the list and find a minimum of it.
scala> val list = List(2,3,7,1,9,4,5)
list: List[Int] = List(2, 3, 7, 1, 9, 4, 5)
scala> list.min
res0: Int = 1

dot product of dense vectors with nulls

I am using spark with scala and trying to do the following.
I have two dense vectors(created using Vectors.dense), and I need to find the dot product of these. How could I accomplish this?
Also, I am creating the vectors based on an input file which is comma seperated. However some values are missing. Is there an easy way to read these values as zero instead of null when I am creating the vectors?
For example:
input file: 3,1,,,2
created vector: 3,1,0,0,2
Spark vectors are just wrappers for arrays, internally they get converted to Breeze arrays for vector/matrix operations. You can do just that manually to get the dot product:
import org.apache.spark.mllib.linalg.{Vector, Vectors, DenseVector}
import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV}
val dv1: Vector = Vectors.dense(1.0, 0.0, 3.0)
val bdv1 = new BDV(dv1.toArray)
val dv2: Vector = Vectors.dense(2.0, 0.0, 0.0)
val bdv2 = new BDV(dv2.toArray)
scala> bdv1 dot bdv2
res3: Double = 2.0
For your second question, you can do something like this:
val v: String = "3,1,,,2"
scala> v.split("\\,").map(r => if (r == "") 0 else r.toInt)
res4: Array[Int] = Array(3, 1, 0, 0, 2)

How to create a TypedColumn in a Spark Dataset and manipulate it?

I'm trying to perform an aggregation using mapGroups that returns a SparseMatrix as one of the columns, and sum the columns.
I created a case class schema for the mapped rows in order to provide column names. The matrix column is typed org.apache.spark.mllib.linalg.Matrix. If I don't run toDF before performing the aggregation (select(sum("mycolumn")) I get one type mismatch error (required: org.apache.spark.sql.TypedColumn[MySchema,?]). If I include toDF I get another type mismatch error: cannot resolve 'sum(mycolumn)' due to data type mismatch: function sum requires numeric types, not org.apache.spark.mllib.linalg.MatrixUDT. So what's the right way to do it?
It looks you struggle with at least two distinct problems here. Lets assume you have Dataset like this:
val ds = Seq(
("foo", Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))),
("foo", Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0)))
).toDS
Selecting TypedColumn:
using implicit conversions with $:
ds.select(col("_1").as[String])
using o.a.s.sql.functions.col:
ds.select(col("_1").as[String])
Adding matrices:
MLLib Matrix and MatrixUDT don't implement addition. It means you won't be able to sum function or reduce with +
you can use third party linear algebra library but this is not supported in Spark SQL / Spark Dataset
If you really want to do it with Datsets you can try to do something like this:
ds.groupByKey(_._1).mapGroups(
(key, values) => {
val matrices = values.map(_._2.toArray)
val first = matrices.next
val sum = matrices.foldLeft(first)(
(acc, m) => acc.zip(m).map { case (x, y) => x + y }
)
(key, sum)
})
and map back to matrices but personally I would just convert to RDD and use breeze.

Spark - correlation matrix from file of ratings

I'm pretty new to Scala and Spark and I'm not able to create a correlation matrix from a file of ratings. It's similar to this question but I have sparse data in the matrix form. My data looks like this:
<user-id>, <rating-for-movie-1-or-null>, ... <rating-for-movie-n-or-null>
123, , , 3, , 4.5
456, 1, 2, 3, , 4
...
The code that is most promising so far looks like this:
val corTest = sc.textFile("data/collab_filter_data.txt").map(_.split(","))
Statistics.corr(corTest, "pearson")
(I know the user_ids in there are a defect, but I'm willing to live with that for the moment)
I'm expecting output like:
1, .123, .345
.123, 1, .454
.345, .454, 1
It's a matrix showing how each user is correlated to every other user. Graphically, it would be a correlogram.
It's a total noob problem but I've been fighting with it for a few hours and can't seem to Google my way out of it.
I believe this code should accomplish what you want:
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.linalg._
...
val corTest = input.map { case (line: String) =>
val split = line.split(",").drop(1)
split.map(elem => if (elem.trim.isEmpty) 0.0 else elem.toDouble)
}.map(arr => Vectors.dense(arr))
val corrMatrix = Statistics.corr(corTest)
Here, we are mapping your input into a String array, dropping the user id element, zeroing out your whitespace, and finally creating a dense vector from the resultant array. Also, note that Pearson's method is used by default if no method is supplied.
When run in shell with some examples, I see the following:
scala> val input = sc.parallelize(Array("123, , , 3, , 4.5", "456, 1, 2, 3, , 4", "789, 4, 2.5, , 0.5, 4", "000, 5, 3.5, , 4.5, "))
input: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[18] at parallelize at <console>:16
scala> val corTest = ...
corTest: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[20] at map at <console>:18
scala> val corrMatrix = Statistics.corr(corTest)
...
corrMatrix: org.apache.spark.mllib.linalg.Matrix =
1.0 0.9037378388935388 -0.9701425001453317 ... (5 total)
0.9037378388935388 1.0 -0.7844645405527361 ...
-0.9701425001453317 -0.7844645405527361 1.0 ...
0.7709910794438823 0.7273340668525836 -0.6622661785325219 ...
-0.7513578452729373 -0.7560667258329613 0.6195855517393626 ...

Spark: Summary statistics

I am trying to use Spark summary statistics as described at: https://spark.apache.org/docs/1.1.0/mllib-statistics.html
According to Spark docs :
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
import org.apache.spark.mllib.linalg.DenseVector
val observations: RDD[Vector] = ... // an RDD of Vectors
// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
I have a problem building observations:RDD[Vector] object. I try:
scala> val data:Array[Double] = Array(1, 2, 3, 4, 5)
data: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0)
scala> val v = new DenseVector(data)
v: org.apache.spark.mllib.linalg.DenseVector = [1.0,2.0,3.0,4.0,5.0]
scala> val observations = sc.parallelize(Array(v))
observations: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector] = ParallelCollectionRDD[3] at parallelize at <console>:19
scala> val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
<console>:21: error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
Note: org.apache.spark.mllib.linalg.DenseVector <: org.apache.spark.mllib.linalg.Vector, but class RDD is invariant in type T.
You may wish to define T as +T instead. (SLS 4.5)
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
Questions:
1) How should I cast DenseVector to Vector?
2) In real program instead of array of doubles I have a to get statistics on a collection that I get from RDD using:
def countByKey(): Map[K, Long]
//Count the number of elements for each key, and return the result to the master as a Map.
So I have to do:
myRdd.countByKey().values.map(_.toDouble)
Which does not make much sense because instead of working with RDDs I now have to work with regular Scala collections whiich at some time stop fitting into memory. All advantages of Spark distributed computations is lost.
How to solve this in scalable manner?
Update
In my case I have:
val cnts: org.apache.spark.rdd.RDD[Int] = prodCntByCity.map(_._2) // get product counts only
val doubleCnts: org.apache.spark.rdd.RDD[Double] = cnts.map(_.toDouble)
How to convert doubleCnts into observations: RDD[Vector] ?
1) You don't need to cast, you just need to type:
val observations = sc.parallelize(Array(v: Vector))
2) Use aggregateByKey (map all the keys to to 1, and reduce by summing) rather than countByKey.
DenseVector has a compressed function. so you can change the RDD[ DenseVector] to RDD[Vector] as :
val st =observations.map(x=>x.compressed)