Outlier Elimination in Spark With InterQuartileRange Results in Error - scala

I have the following recursive function that determines the Outlier using the InterQuartileRange method:
def interQuartileRangeFiltering(df: DataFrame): DataFrame = {
#scala.annotation.tailrec
def inner(cols: List[String], acc: DataFrame): DataFrame = cols match {
case Nil => acc
case column :: xs =>
val quantiles = acc.stat.approxQuantile(column, Array(0.25, 0.75), 0.0) // TODO: values should come from config
println(s"$column ${quantiles.size}")
val q1 = quantiles(0)
val q3 = quantiles(1)
val iqr = q1 - q3
val lowerRange = q1 - 1.5 * iqr
val upperRange = q3 + 1.5 * iqr
val filtered = acc.filter(s"$column < $lowerRange or $column > $upperRange")
inner(xs, filtered)
}
inner(df.columns.toList, df)
}
val outlierDF = interQuartileRangeFiltering(incomingDF)
So basically what I'm doing is that I'm recursively iterating over the columns and eliminating the outliers. Strangely it results in an ArrayIndexOutOfBounds Exception and prints the following:
housing_median_age 2
inland 2
island 2
population 2
total_bedrooms 2
near_bay 2
near_ocean 2
median_house_value 0
java.lang.ArrayIndexOutOfBoundsException: 0
at inner$1(<console>:75)
at interQuartileRangeFiltering(<console>:83)
... 54 elided
What is wrong with my approach?

Here is what I came up with and works like a charm:
def outlierEliminator(df: DataFrame, colsToIgnore: List[String])(fn: (String, DataFrame) => (Double, Double)): DataFrame = {
val ID_COL_NAME = "id"
val dfWithId = DataFrameUtils.addColumnIndex(spark, df, ID_COL_NAME)
val dfWithIgnoredCols = dfWithId.drop(colsToIgnore: _*)
#tailrec
def inner(
cols: List[String],
filterIdSeq: List[Long],
dfWithId: DataFrame
): List[Long] = cols match {
case Nil => filterIdSeq
case column :: xs =>
if (column == ID_COL_NAME) {
inner(xs, filterIdSeq, dfWithId)
} else {
val (lowerBound, upperBound) = fn(column, dfWithId)
val filteredIds =
dfWithId
.filter(s"$column < $lowerBound or $column > $upperBound")
.select(col(ID_COL_NAME))
.map(r => r.getLong(0))
.collect
.toList
inner(xs, filteredIds ++ filterIdSeq, dfWithId)
}
}
val filteredIds = inner(dfWithIgnoredCols.columns.toList, List.empty[Long], dfWithIgnoredCols)
dfWithId.except(dfWithId.filter($"$ID_COL_NAME".isin(filteredIds: _*)))
}

Related

How to reduce multiple case when in scala-spark

Newbie question, how do you optimize/reduce expressions like these:
when(x1._1,x1._2).when(x2._1,x2._2).when(x3._1,x3._2).when(x4._1,x4._2).when(x5._1,x5._2)....
.when(xX._1,xX._2).otherwise(z)
The x1, x2, x3, xX are maps where x1._1 is the condition and x._2 is the "then".
I was trying to save the maps in a list and then use a map-reduce but it was producing a:
when(x1._1,x1._2).otherwise(z) && when(x2._1,x2._2).otherwise(z)...
Which is wrong. I have like 10 lines of pure when case and would like to reduce that so my code is more clear.
You can use foldLeft on the maplist:
val maplist = List(x1, x2) // add more x if needed
val new_col = maplist.tail.foldLeft(when(maplist.head._1, maplist.head._2))((x,y) => x.when(y._1, y._2)).otherwise(z)
An alternative is to use coalesce. If the condition is not met, null is returned by the when statement, and the next when statement will be evaluated until a non-null result is obtained.
val new_col = coalesce((maplist.map(x => when(x._1, x._2)) :+ z):_*)
You could create a simple recursive method to assemble the nested-when/otherwise condition:
import org.apache.spark.sql.Column
def nestedCond(cols: Array[String], default: String): Column = {
def loop(ls: List[String]): Column = ls match {
case Nil => col(default)
case c :: tail => when(col(s"$c._1"), col(s"$c._2")).otherwise(loop(tail))
}
loop(cols.toList).as("nested-cond")
}
Testing the method:
val df = Seq(
((false, 1), (false, 2), (true, 3), 88),
((false, 4), (true, 5), (true, 6), 99)
).toDF("x1", "x2", "x3", "z")
val cols = df.columns.filter(_.startsWith("x"))
// cols: Array[String] = Array(x1, x2, x3)
df.select(nestedCond(cols, "z")).show
// +-----------+
// |nested-cond|
// +-----------+
// | 3|
// | 5|
// +-----------+
Alternatively, use foldRight to assemble the nested-condition:
def nestedCond(cols: Array[String], default: String): Column =
cols.foldRight(col(default)){ (c, acc) =>
when(col(s"$c._1"), col(s"$c._2")).otherwise(acc)
}.as("nested-cond")
Another way by passing the otherwise as initial value for foldLeft:
val maplist = Seq(Map(col("c1") -> "value1"), Map(col("c2") -> "value2"))
val newCol = maplist.flatMap(_.toSeq).foldLeft(lit("z")) {
case (acc, (cond, value)) => when(cond, value).otherwise(acc)
}
// gives:
// newCol: org.apache.spark.sql.Column = CASE WHEN c2 THEN value2 ELSE CASE WHEN c1 THEN value1 ELSE z END END

Pattern Matching Function Call in Scala

I am originally posting the question on CodeReview but it seems to be not fitted there. I'll reask here. Please tell me if it's also not fit here, and where should I post this kind of question. Thanks.
I am a newbie in Scala and functional programming. I want to call a function several times, with combination of parameters based on two variables. Basically, What I am doing right now is this:
def someFunction(a: Int, b: Int): Future[Int] = ???
val value1 = true
val value2 = false
(value1, value2) match {
case (true, true) =>
val res1 = someFunction(0, 0)
val res2 = someFunction(0, 1)
val res3 = someFunction(1, 0)
val res4 = someFunction(1, 1)
for {
r1 <- res1
r2 <- res2
r3 <- res3
r4 <- res4
} yield r1 + r2 + r3 + r4
case (true, false) =>
val res1 = someFunction(0, 0)
val res2 = someFunction(1, 0)
for {
r1 <- res1
r2 <- res2
} yield r1 + r2
case (false, true) =>
val res1 = someFunction(0, 0)
val res2 = someFunction(0, 1)
for {
r1 <- res1
r2 <- res2
} yield r1 + r2
case (false, false) =>
for { r1 <- someFunction(0, 0) } yield r1
}
I am not satisfied with the above code as it is repetitive and hard to read / maintain. Is there any better way I could do this? I've tried to search on how to combine function by pattern matching value like this, but finds nothing to work with. Looks like I don't know the right term for this.
Any help would be appreciated, and feel free to change the title if there's a better wording.
Thanks before :)
An easier way could be to pregenerate a sequence of argument tuples:
val arguments = for {
arg1 <- 0 to (if (value1) 1 else 0)
arg2 <- 0 to (if (value2) 1 else 0)
} yield (arg1, arg2)
Then you can combine function executions on the arguments with Future.traverse to get a Future of the sequence of results, and then sum the results:
Future.traverse(arguments)(Function.tupled(someFunction)).map(_.sum)
I think this should solve your problem:
def someFunction(x: Int, y: Int): Future[Int] = ???
def someFunctionTupled: ((Int, Int)) => Future[Int] = (someFunction _).tupled // Same as someFunction but you can pass in a tuple here
def genParamList(b: Boolean) = if (b)
List(0, 1)
else
List(0)
val value1 = true
val value2 = false
val l1 = genParamList(value1)
val l2 = genParamList(value2)
// Combine the two parameter lists by constructing the cartesian product
val allParams = l1.foldLeft(List[(Int, Int)]()){
case (acc, elem) => acc ++ l2.map((elem, _))
}
allParams.map((someFunction _).tupled).sum
The above code will result in a Future[Int] which is the sum of all results of someFunction applied to the elements of the allParams list.

RDD transformation inside a loop

So i have an rdd:Array[String] named Adat and I want to transform it within a loop and get a new RDD which I can use outside the loop scope.I tried this but the result is not what I want.
val sharedA = {
for {
i <- 0 to shareA.toInt - 1
j <- 0 to shareA.toInt - 1
} yield {
Adat.map(x => (x(1).toInt, i % shareA.toInt, j % shareA.toInt, x(2)))
}
}
The above code transforms the SharedA rdd to IndexedSeq[RDD[(Int, Int, Int, String)]] and when i try to print it the result is:
MapPartitionsRDD[12] at map at planet.scala:99
MapPartitionsRDD[13] at map at planet.scala:99 and so on.
How to transform sharedA to RDD[(Int, Int, Int, String)]?
If i do it like this the sharedA has the correct datatype but i cannot use it outside the scope.
for { i <- 0 to shareA.toInt -1
j<-0 to shareA.toInt-1 }
yield {
val sharedA=Adat.map(x => (x(1).toInt,i % shareA.toInt ,j %
shareA.toInt,x(2)))
}
I don't exactly understand your description but flatMap should do the trick:
val rdd = sc.parallelize(Seq(Array("", "0", "foo"), Array("", "1", "bar")))
val n = 2
val result = rdd.flatMap(xs => for {
i <- 0 to n
j <- 0 to n
} yield (xs(1).toInt, i, j, xs(2)))
result.take(5)
// Array[(Int, Int, Int, String)] =
// Array((0,0,0,foo), (0,0,1,foo), (0,0,2,foo), (0,1,0,foo), (0,1,1,foo))
Less common approach would be to call SparkContext.union on the results:
val resultViaUnion = sc.union(for {
i <- 0 to n
j <- 0 to n
} yield rdd.map(xs => (xs(1).toInt, i, j, xs(2))))
resultViaUnion.take(5)
// Array[(Int, Int, Int, String)] =
// Array((0,0,0,foo), (1,0,0,bar), (0,0,1,foo), (1,0,1,bar), (0,0,2,foo))

String Hive function for split key-value pair into two columns

How to split this data T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0 into two columns using hive function
For example
T 32
P 1
A 420
H 60
R 0.30841494477846165
S 0
You can do this with a regex implementation:
def main(args: Array[String]) {
val s = "T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0"
val pattern = "[A-Z]\\_\\d+\\.?\\d*"
var buff = new String()
val r = Pattern.compile(pattern)
val m = r.matcher(s)
while (m.find()) {
buff = buff + (m.group(0))
buff = buff + "\n"
}
buff = buff.toString.replaceAll("\\_", " ")
println("output:\n" + buff)
}
Output:
output:
T 32
P 1
A 420
H 60
R 0.30841494477846165
S 0
If you need to collect the data for further processing, and you're guaranteed it's always paired correctly, you could do something like this.
scala> val str = "T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0"
str: String = T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0
scala> val data = str.split("_").sliding(2,2)
data: Iterator[Array[String]] = non-empty iterator
scala> data.toList // just to see it
res29: List[Array[String]] = List(Array(T, 32), Array(P, 1), Array(A, 420), Array(H, 60), Array(R, 0.30841494477846165), Array(S, 0))
You can split your string, get an array, zipWithIndex and filter based on index to get two arrays col1 and col2 and then use it for printing:
val str = "T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0"
val tmp = str.split('_').zipWithIndex
val col1 = tmp.filter( p => p._2 % 2 == 0 ).map( p => p._1)
val col2 = tmp.filter( p => p._2 % 2 != 0 ).map( p => p._1)
//col1: Array[String] = Array(T, P, A, H, R, S)
//col2: Array[String] = Array(32, 1, 420, 60, ...

Three Sum to N in Scala

Is there a better way than this example to find three numbers from a list that sum to zero in scala? Right now, I feel like my functional way may not be the most efficient and it contains duplicate tuples. What is the most efficient way to get rid of duplicate tuples in my current example?
def secondThreeSum(nums:List[Int], n:Int):List[(Int,Int,Int)] = {
val sums = nums.combinations(2).map(combo => combo(0) + combo(1) -> (combo(0), combo(1))).toList.toMap
nums.flatMap { num =>
val tmp = n - num
if(sums.contains(tmp) && sums(tmp)._1 != num && sums(tmp)._2 != num) Some((num, sums(tmp)._1, sums(tmp)._2)) else None
}
}
This is pretty simple, and doesn't repeat any tuples:
def f(nums: List[Int], n: Int): List[(Int, Int, Int)] = {
for {
(a, i) <- nums.zipWithIndex;
(b, j) <- nums.zipWithIndex.drop(i + 1)
c <- nums.drop(j + 1)
if n == a + b + c
} yield (a, b, c)
}
Use .combinations(3) to generate all distinct possible triplets of your start list, then keep only those that sum up to n :
scala> def secondThreeSum(nums:List[Int], n:Int):List[(Int,Int,Int)] = {
nums.combinations(3)
.collect { case List(a,b,c) if (a+b+c) == n => (a,b,c) }
.toList
}
secondThreeSum: (nums: List[Int], n: Int)List[(Int, Int, Int)]
scala> secondThreeSum(List(1,2,3,-5,2), 0)
res3: List[(Int, Int, Int)] = List((2,3,-5))
scala> secondThreeSum(List(1,2,3,-5,2), -1)
res4: List[(Int, Int, Int)] = List((1,3,-5), (2,2,-5))
Here is a solution that's O(n^2*log(n)). So it's quite a lot faster for large lists.
Also it uses lower level language features to increase the speed even further.
def f(nums: List[Int], n: Int): List[(Int, Int, Int)] = {
val result = scala.collection.mutable.ArrayBuffer.empty[(Int, Int, Int)]
val array = nums.toArray
val mapValueToMaxIndex = scala.collection.mutable.Map.empty[Int, Int]
nums.zipWithIndex.foreach {
case (n, i) => mapValueToMaxIndex += (n -> math.max(i, (mapValueToMaxIndex.getOrElse(n, i))))
}
val size = array.size
var i = 0
while(i < size) {
val a = array(i)
var j = i+1
while(j < size) {
val b = array(j)
val c = n - b - a
mapValueToMaxIndex.get(c).foreach { maxIndex =>
if(maxIndex > j) result += ((a, b, c))
}
j += 1
}
i += 1
}
result.toList
}