I have the following recursive function that determines the Outlier using the InterQuartileRange method:
def interQuartileRangeFiltering(df: DataFrame): DataFrame = {
def inner(cols: List[String], acc: DataFrame): DataFrame = cols match {
case Nil => acc
case column :: xs =>
val quantiles = acc.stat.approxQuantile(column, Array(0.25, 0.75), 0.0) // TODO: values should come from config
println(s"$column ${quantiles.size}")
val q1 = quantiles(0)
val q3 = quantiles(1)
val iqr = q1 - q3
val lowerRange = q1 - 1.5 * iqr
val upperRange = q3 + 1.5 * iqr
val filtered = acc.filter(s"$column < $lowerRange or $column > $upperRange")
inner(xs, filtered)
inner(df.columns.toList, df)
val outlierDF = interQuartileRangeFiltering(incomingDF)
So basically what I'm doing is that I'm recursively iterating over the columns and eliminating the outliers. Strangely it results in an ArrayIndexOutOfBounds Exception and prints the following:
housing_median_age 2
inland 2
island 2
population 2
total_bedrooms 2
near_bay 2
near_ocean 2
median_house_value 0
java.lang.ArrayIndexOutOfBoundsException: 0
at inner$1(<console>:75)
at interQuartileRangeFiltering(<console>:83)
... 54 elided
What is wrong with my approach?
Here is what I came up with and works like a charm:
def outlierEliminator(df: DataFrame, colsToIgnore: List[String])(fn: (String, DataFrame) => (Double, Double)): DataFrame = {
val ID_COL_NAME = "id"
val dfWithId = DataFrameUtils.addColumnIndex(spark, df, ID_COL_NAME)
val dfWithIgnoredCols = dfWithId.drop(colsToIgnore: _*)
def inner(
cols: List[String],
filterIdSeq: List[Long],
dfWithId: DataFrame
): List[Long] = cols match {
case Nil => filterIdSeq
case column :: xs =>
if (column == ID_COL_NAME) {
inner(xs, filterIdSeq, dfWithId)
} else {
val (lowerBound, upperBound) = fn(column, dfWithId)
val filteredIds =
.filter(s"$column < $lowerBound or $column > $upperBound")
.map(r => r.getLong(0))
inner(xs, filteredIds ++ filterIdSeq, dfWithId)
val filteredIds = inner(dfWithIgnoredCols.columns.toList, List.empty[Long], dfWithIgnoredCols)
dfWithId.except(dfWithId.filter($"$ID_COL_NAME".isin(filteredIds: _*)))
I'm trying to fit a curve with SimpleCurveFitter of commons.math3.fitting in Scala but I catch an exception :
org.apache.commons.math3.exception.ConvergenceException : Unable to permorm
Qr decomposition on jacobian
However, I have checked my gradient calculations.... I still don't see why the exception is raised.
See the code by yourself
def main(args: Array[String]): Unit = {
var xv: DenseVector[Double] = linspace(0, 3, 300)
var yv: DenseVector[Double] = DenseVector.zeros(300)
for (i <- xv.findAll(x => x < 1.0)) yv.update(i, 1)
for (i <- xv.findAll(x => x >= 1.0)) yv.update(i, exp(-(xv(i) - 1.0)/1))
val wop: Array[WeightedObservedPoint] = new Array[WeightedObservedPoint](xv.length)
for (i <- 0 to xv.length - 1) wop.update(i, new WeightedObservedPoint(1, xv(i), yv(i)))
val f: ParametricUnivariateFunction = new ParametricUnivariateFunction {
override def value(x: Double, parameters: Double*): Double = {
val a = parameters(0)
val b = parameters(1)
1.0 / (1.0 + a * pow(x, 2 * b))
override def gradient(x: Double, parameters: Double*): Array[Double] = {
val a = parameters(0)
val b = parameters(1)
val ga = - pow(x, 2 * b) / pow(1 + a * pow(x, 2 * b), 2)
val gb = - (2 * a * pow(x, 2 * b) * log(x)) / pow(1 + a * pow(x, 2 * b), 2)
val grad = Array(ga, gb)
val wopc = JavaConverters.asJavaCollection(wop)
val cf = SimpleCurveFitter.create(f, Array(1, 1))
val param = cf.fit(wopc)
println(param(0), param(1))
Thank you for your help :)
I would like to find the fastest way to write euclidean distance in Scala. After some attemps, i'm here.
def euclidean[V <: Seq[Double]](dot1: V, dot2: V): Double = {
var d = 0D
var i = 0
while( i < dot1.size ) {
val toPow2 = dot1(i) - dot2(i)
d += toPow2 * toPow2
i += 1
Fastest results are obtain with mutable.ArrayBuffer[Double] as V and no collection.parallel._ are authorized for various vector size from 2 up to 10000
For those who desire to test breeze its slower with following distance function :
def euclideanDV(v1: DenseVector[Double], v2: DenseVector[Double]) = norm(v1 - v2)
If anyone knows any pure scala code or library that could help to improve speed it would be greatly appreciated.
The way i tested speed was i follow.
val te1 = 0L
val te2 = 0L
val runNumber = 100000
val warmUp = 60000
(0 until runNumber).foreach{ x =>
val t1 = System.nanoTime
euclidean1(v1, v2)
val t2 = System.nanoTime
euclidean2(v1, v2)
val t3 = System.nanoTime
if( x >= warmUp ) {
te1 += t2 - t1
te2 += t3 - t2
Here a some of my tries
// Fast on ArrayBuffer, quadratic on List
def euclidean1[V <: Seq[Double]](v1: V, v2: V) =
var d = 0D
var i = 0
while( i < v1.size ){
val toPow2 = v1(i) - v2(i)
d += toPow2 * toPow2
i += 1
// Breeze test
def euclideanDV(v1: DenseVector[Double], v2: DenseVector[Double]) = norm(v1 - v2)
// Slower than euclidean1
def euclidean2[V <: Seq[Double]](v1: V, v2: V) =
var d = 0D
var i = 0
while( i < v1.size )
d += pow(v1(i) - v2(i), 2)
i += 1
// Slower than 1 for Vsize ~< 1000 and a bit better over 1000 on ArrayBuffer
def euclidean3[V <: Seq[Double]](v1: V, v2: V) =
var d = 0D
var i = 0
(0 until v1.size).foreach{ i=>
val toPow2 = v1(i) - v2(i)
d += toPow2 * toPow2
// Slower than 1 for Vsize ~< 1000 and a bit better over 1000 on ArrayBuffer
def euclidean3bis(dot1: Seq[Double], dot2: Seq[Double]): Double =
var sum = 0D
dot1.indices.foreach{ id =>
val toPow2 = dot1(id) - dot2(id)
sum += toPow2 * toPow2
// Slower than 1
def euclidean4[V <: Seq[Double]](v1: V, v2: V) =
var d = 0D
var i = 0
val vz = v1.zip(v2)
while( i < vz.size )
val (a, b) = vz(i)
val toPow2 = a - b
d += toPow2 * toPow2
i += 1
// Slower than 1
def euclideanL1(v1: List[Double], v2: List[Double]) = sqrt(v1.zip(v2).map{ case (a, b) =>
val toPow2 = a - b
toPow2 * toPow2
// Slower than 1
def euclidean5(dot1: Seq[Double], dot2: Seq[Double]): Double =
var sum = 0D
dot1.zipWithIndex.foreach{ case (a, id) =>
val toPow2 = a - dot2(id)
sum += toPow2 * toPow2
// super super slow
def euclidean6(v1: Seq[Double], v2: Seq[Double]) = sqrt(v1.zip(v2).map{ case (a, b) => pow(a - b, 2) }.sum)
// Slower than 1
def euclidean7(dot1: Seq[Double], dot2: Seq[Double]): Double =
var sum = 0D
dot1.zip(dot2).foreach{ case (a, b) => sum += pow(a - b, 2) }
// Slower than 1
def euclidean8(v1: Seq[Double], v2: Seq[Double]) =
def inc(n: Int, v: Double) = {
val toPow2 = v1(n) - v2(n)
v + toPow2 * toPow2
def go(n: Int, v: Double): Double =
if( n < v1.size - 1 ) go(n + 1, inc(n, v))
else inc(n, v)
sqrt(go(0, 0D))
// Slower than 1
def euclideanL2(v1: List[Double], v2: List[Double]) =
def inc(vzz: List[(Double, Double)], v: Double): Double =
val (a, b) = vzz.head
val toPow2 = a - b
v + toPow2 * toPow2
def go(vzz: List[(Double, Double)], v: Double): Double =
if( vzz.isEmpty ) v
else go(vzz.tail, inc(vzz, v))
sqrt(go(v1.zip(v2), 0D))
I tried tailrecursion on List but not enough efficiently on ArrayBuffer, i totally agree with the fact that proper tools like JMH are needed to test speed efficiency properly. But when order of magnitude is between 10-50% faster, we can be confident that it is better.
Even if it is V <: Seq[Double] it is NOT appropriate for List but for ArrayLike structure.
Here my proposal
def euclideanF[V <: Seq[Double]](v1: V, v2: V) = {
def go(d: Double, i: Int): Double = {
if( i < v1.size ) {
val toPow2 = v1(i) - v2(i)
go(d + toPow2 * toPow2, i + 1)
else d
sqrt(go(0D, 0))
How to split this data T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0 into two columns using hive function
For example
T 32
P 1
A 420
H 60
R 0.30841494477846165
S 0
You can do this with a regex implementation:
def main(args: Array[String]) {
val s = "T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0"
val pattern = "[A-Z]\\_\\d+\\.?\\d*"
var buff = new String()
val r = Pattern.compile(pattern)
val m = r.matcher(s)
while (m.find()) {
buff = buff + (m.group(0))
buff = buff + "\n"
buff = buff.toString.replaceAll("\\_", " ")
println("output:\n" + buff)
T 32
P 1
A 420
H 60
R 0.30841494477846165
S 0
If you need to collect the data for further processing, and you're guaranteed it's always paired correctly, you could do something like this.
scala> val str = "T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0"
str: String = T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0
scala> val data = str.split("_").sliding(2,2)
data: Iterator[Array[String]] = non-empty iterator
scala> data.toList // just to see it
res29: List[Array[String]] = List(Array(T, 32), Array(P, 1), Array(A, 420), Array(H, 60), Array(R, 0.30841494477846165), Array(S, 0))
You can split your string, get an array, zipWithIndex and filter based on index to get two arrays col1 and col2 and then use it for printing:
val str = "T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0"
val tmp = str.split('_').zipWithIndex
val col1 = tmp.filter( p => p._2 % 2 == 0 ).map( p => p._1)
val col2 = tmp.filter( p => p._2 % 2 != 0 ).map( p => p._1)
//col1: Array[String] = Array(T, P, A, H, R, S)
//col2: Array[String] = Array(32, 1, 420, 60, ...
I have a list of values and their aggregated lengths of all their occurrences as an array.
Ex: If my sentence is
"I have a cat. The cat looks very cute"
My array looks like
Array((I,1), (have,4), (a,1), (cat,6), (The, 3), (looks, 5), (very ,4), (cute,4))
Now I want to compute the average length of each word. i.e the length / number of occurrences.
I tried to do the coding using Scala as follows:
val avglen = arr.reduceByKey( (x,y) => (x, y.toDouble / x.size.toDouble) )
I'm getting an error as follows at x.size
error: value size is not a member of Int
Please help me where I'm going wrong here.
After your comment I think I got it:
val words = sc.parallelize(Array(("i", 1), ("have", 4),
("a", 1), ("cat", 6),
("the", 3), ("looks", 5),
("very", 4), ("cute", 4)))
val avgs = words.map { case (word, count) => (word, count / word.length.toDouble) }
println("My averages are: ")
Supposing you have a paragraph with those words and You want to calculate the mean size of the words of the paragraph.
In two steps, with a map-reduce approach and in spark-1.5.1:
val words = sc.parallelize(Array(("i", 1), ("have", 4),
("a", 1), ("cat", 6),
("the", 3), ("looks", 5),
("very", 4), ("cute", 4)))
val wordCount = words.map { case (word, count) => count}.reduce((a, b) => a + b)
val wordLength = words.map { case (word, count) => word.length * count}.reduce((a, b) => a + b)
println("The avg length is: " + wordLength / wordCount.toDouble)
I ran this code using an .ipynb connected to a spark-kernel this is the output.
If I understand the problem correctly:
val rdd: RDD[(String, Int) = ???
val ave: RDD[(String, Double) =
rdd.map { case (name, numOccurance) =>
(name, name.length.toDouble / numOccurance)
This is a slightly confusing question. If your data is already in an Array[(String, Int)] collection (presumably after a collect() to the driver), then you need not use any RDD transformations. In fact, there's a nifty trick you can run with fold*() to grab the average over a collection:
val average = arr.foldLeft(0.0) { case (sum: Double, (_, count: Int)) => sum + count } / arr.foldLeft(0.0) { case (sum: Double, (word: String, count: Int)) => sum + count / word.length }
Kind of long winded, but it essentially aggregates the total number of characters in the numerator and the number of words in the denominator. Run on your example, I see the following:
scala> val arr = Array(("I",1), ("have",4), ("a",1), ("cat",6), ("The", 3), ("looks", 5), ("very" ,4), ("cute",4))
arr: Array[(String, Int)] = Array((I,1), (have,4), (a,1), (cat,6), (The,3), (looks,5), (very,4), (cute,4))
scala> val average = ...
average: Double = 3.111111111111111
If you have your (String, Int) tuples distributed across an RDD[(String, Int)], you can use accumulators to solve this problem quite easily:
val chars = sc.accumulator(0.0)
val words = sc.accumulator(0.0)
wordsRDD.foreach { case (word: String, count: Int) =>
chars += count; words += count / word.length
val average = chars.value / words.value
When running on the above example (placed in an RDD), I see the following:
scala> val arr = Array(("I",1), ("have",4), ("a",1), ("cat",6), ("The", 3), ("looks", 5), ("very" ,4), ("cute",4))
arr: Array[(String, Int)] = Array((I,1), (have,4), (a,1), (cat,6), (The,3), (looks,5), (very,4), (cute,4))
scala> val wordsRDD = sc.parallelize(arr)
wordsRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:14
scala> val chars = sc.accumulator(0.0)
chars: org.apache.spark.Accumulator[Double] = 0.0
scala> val words = sc.accumulator(0.0)
words: org.apache.spark.Accumulator[Double] = 0.0
scala> wordsRDD.foreach { case (word: String, count: Int) =>
| chars += count; words += count / word.length
| }
scala> val average = chars.value / words.value
average: Double = 3.111111111111111