RDD transformation inside a loop - scala

So i have an rdd:Array[String] named Adat and I want to transform it within a loop and get a new RDD which I can use outside the loop scope.I tried this but the result is not what I want.
val sharedA = {
for {
i <- 0 to shareA.toInt - 1
j <- 0 to shareA.toInt - 1
} yield {
Adat.map(x => (x(1).toInt, i % shareA.toInt, j % shareA.toInt, x(2)))
}
}
The above code transforms the SharedA rdd to IndexedSeq[RDD[(Int, Int, Int, String)]] and when i try to print it the result is:
MapPartitionsRDD[12] at map at planet.scala:99
MapPartitionsRDD[13] at map at planet.scala:99 and so on.
How to transform sharedA to RDD[(Int, Int, Int, String)]?
If i do it like this the sharedA has the correct datatype but i cannot use it outside the scope.
for { i <- 0 to shareA.toInt -1
j<-0 to shareA.toInt-1 }
yield {
val sharedA=Adat.map(x => (x(1).toInt,i % shareA.toInt ,j %
shareA.toInt,x(2)))
}

I don't exactly understand your description but flatMap should do the trick:
val rdd = sc.parallelize(Seq(Array("", "0", "foo"), Array("", "1", "bar")))
val n = 2
val result = rdd.flatMap(xs => for {
i <- 0 to n
j <- 0 to n
} yield (xs(1).toInt, i, j, xs(2)))
result.take(5)
// Array[(Int, Int, Int, String)] =
// Array((0,0,0,foo), (0,0,1,foo), (0,0,2,foo), (0,1,0,foo), (0,1,1,foo))
Less common approach would be to call SparkContext.union on the results:
val resultViaUnion = sc.union(for {
i <- 0 to n
j <- 0 to n
} yield rdd.map(xs => (xs(1).toInt, i, j, xs(2))))
resultViaUnion.take(5)
// Array[(Int, Int, Int, String)] =
// Array((0,0,0,foo), (1,0,0,bar), (0,0,1,foo), (1,0,1,bar), (0,0,2,foo))

Related

Outlier Elimination in Spark With InterQuartileRange Results in Error

I have the following recursive function that determines the Outlier using the InterQuartileRange method:
def interQuartileRangeFiltering(df: DataFrame): DataFrame = {
#scala.annotation.tailrec
def inner(cols: List[String], acc: DataFrame): DataFrame = cols match {
case Nil => acc
case column :: xs =>
val quantiles = acc.stat.approxQuantile(column, Array(0.25, 0.75), 0.0) // TODO: values should come from config
println(s"$column ${quantiles.size}")
val q1 = quantiles(0)
val q3 = quantiles(1)
val iqr = q1 - q3
val lowerRange = q1 - 1.5 * iqr
val upperRange = q3 + 1.5 * iqr
val filtered = acc.filter(s"$column < $lowerRange or $column > $upperRange")
inner(xs, filtered)
}
inner(df.columns.toList, df)
}
val outlierDF = interQuartileRangeFiltering(incomingDF)
So basically what I'm doing is that I'm recursively iterating over the columns and eliminating the outliers. Strangely it results in an ArrayIndexOutOfBounds Exception and prints the following:
housing_median_age 2
inland 2
island 2
population 2
total_bedrooms 2
near_bay 2
near_ocean 2
median_house_value 0
java.lang.ArrayIndexOutOfBoundsException: 0
at inner$1(<console>:75)
at interQuartileRangeFiltering(<console>:83)
... 54 elided
What is wrong with my approach?
Here is what I came up with and works like a charm:
def outlierEliminator(df: DataFrame, colsToIgnore: List[String])(fn: (String, DataFrame) => (Double, Double)): DataFrame = {
val ID_COL_NAME = "id"
val dfWithId = DataFrameUtils.addColumnIndex(spark, df, ID_COL_NAME)
val dfWithIgnoredCols = dfWithId.drop(colsToIgnore: _*)
#tailrec
def inner(
cols: List[String],
filterIdSeq: List[Long],
dfWithId: DataFrame
): List[Long] = cols match {
case Nil => filterIdSeq
case column :: xs =>
if (column == ID_COL_NAME) {
inner(xs, filterIdSeq, dfWithId)
} else {
val (lowerBound, upperBound) = fn(column, dfWithId)
val filteredIds =
dfWithId
.filter(s"$column < $lowerBound or $column > $upperBound")
.select(col(ID_COL_NAME))
.map(r => r.getLong(0))
.collect
.toList
inner(xs, filteredIds ++ filterIdSeq, dfWithId)
}
}
val filteredIds = inner(dfWithIgnoredCols.columns.toList, List.empty[Long], dfWithIgnoredCols)
dfWithId.except(dfWithId.filter($"$ID_COL_NAME".isin(filteredIds: _*)))
}

Matrix multiplication in Scala

I am trying to transpose a matrix of size 3*2 by defining a empty matrix of size 2*3, how can i create an empty matrix?? I am missing something in the commented piece of code!!
type Row = List[Int]
type Matrix = List[Row]
val m:Matrix = List(1 :: 2 :: Nil, 3 :: 4 :: Nil, 5 :: 6 :: Nil)
def transpose(m:Matrix):Matrix = {
val rows = m.size
val cols = m.head.size
val trans= List(List())(rows(cols)) // Int doesn't take parameter
for (i <- 0 until cols) {
for (j <- 0 until rows) {
trans(i)(j) = this (j)(i)
}
}
return trans
}
When it is necessary to access elements by index, Vector or Array is more efficient than Lists.
Here is the Vector version of solution.
type Row = Vector[Int]
type Matrix = Vector[Row]
val m:Matrix = Vector(Vector(1,2), Vector(3,4), Vector(5,6))
def transpose(mat:Matrix) = {
def size[A](v: Vector[A]): Int = { var x =0; for(i<-v) x+=1; x}
for (i <-Range(0,size(mat(0)))) yield for(j <-Range(0,size(mat))) yield mat(j)(i)
}
Test in REPL:
scala> transpose(m)
res12: scala.collection.immutable.IndexedSeq[scala.collection.immutable.IndexedSeq[Int]] = Vector(Vector(1, 3, 5), Vector(2,
4, 6))

String Hive function for split key-value pair into two columns

How to split this data T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0 into two columns using hive function
For example
T 32
P 1
A 420
H 60
R 0.30841494477846165
S 0
You can do this with a regex implementation:
def main(args: Array[String]) {
val s = "T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0"
val pattern = "[A-Z]\\_\\d+\\.?\\d*"
var buff = new String()
val r = Pattern.compile(pattern)
val m = r.matcher(s)
while (m.find()) {
buff = buff + (m.group(0))
buff = buff + "\n"
}
buff = buff.toString.replaceAll("\\_", " ")
println("output:\n" + buff)
}
Output:
output:
T 32
P 1
A 420
H 60
R 0.30841494477846165
S 0
If you need to collect the data for further processing, and you're guaranteed it's always paired correctly, you could do something like this.
scala> val str = "T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0"
str: String = T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0
scala> val data = str.split("_").sliding(2,2)
data: Iterator[Array[String]] = non-empty iterator
scala> data.toList // just to see it
res29: List[Array[String]] = List(Array(T, 32), Array(P, 1), Array(A, 420), Array(H, 60), Array(R, 0.30841494477846165), Array(S, 0))
You can split your string, get an array, zipWithIndex and filter based on index to get two arrays col1 and col2 and then use it for printing:
val str = "T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0"
val tmp = str.split('_').zipWithIndex
val col1 = tmp.filter( p => p._2 % 2 == 0 ).map( p => p._1)
val col2 = tmp.filter( p => p._2 % 2 != 0 ).map( p => p._1)
//col1: Array[String] = Array(T, P, A, H, R, S)
//col2: Array[String] = Array(32, 1, 420, 60, ...

Three Sum to N in Scala

Is there a better way than this example to find three numbers from a list that sum to zero in scala? Right now, I feel like my functional way may not be the most efficient and it contains duplicate tuples. What is the most efficient way to get rid of duplicate tuples in my current example?
def secondThreeSum(nums:List[Int], n:Int):List[(Int,Int,Int)] = {
val sums = nums.combinations(2).map(combo => combo(0) + combo(1) -> (combo(0), combo(1))).toList.toMap
nums.flatMap { num =>
val tmp = n - num
if(sums.contains(tmp) && sums(tmp)._1 != num && sums(tmp)._2 != num) Some((num, sums(tmp)._1, sums(tmp)._2)) else None
}
}
This is pretty simple, and doesn't repeat any tuples:
def f(nums: List[Int], n: Int): List[(Int, Int, Int)] = {
for {
(a, i) <- nums.zipWithIndex;
(b, j) <- nums.zipWithIndex.drop(i + 1)
c <- nums.drop(j + 1)
if n == a + b + c
} yield (a, b, c)
}
Use .combinations(3) to generate all distinct possible triplets of your start list, then keep only those that sum up to n :
scala> def secondThreeSum(nums:List[Int], n:Int):List[(Int,Int,Int)] = {
nums.combinations(3)
.collect { case List(a,b,c) if (a+b+c) == n => (a,b,c) }
.toList
}
secondThreeSum: (nums: List[Int], n: Int)List[(Int, Int, Int)]
scala> secondThreeSum(List(1,2,3,-5,2), 0)
res3: List[(Int, Int, Int)] = List((2,3,-5))
scala> secondThreeSum(List(1,2,3,-5,2), -1)
res4: List[(Int, Int, Int)] = List((1,3,-5), (2,2,-5))
Here is a solution that's O(n^2*log(n)). So it's quite a lot faster for large lists.
Also it uses lower level language features to increase the speed even further.
def f(nums: List[Int], n: Int): List[(Int, Int, Int)] = {
val result = scala.collection.mutable.ArrayBuffer.empty[(Int, Int, Int)]
val array = nums.toArray
val mapValueToMaxIndex = scala.collection.mutable.Map.empty[Int, Int]
nums.zipWithIndex.foreach {
case (n, i) => mapValueToMaxIndex += (n -> math.max(i, (mapValueToMaxIndex.getOrElse(n, i))))
}
val size = array.size
var i = 0
while(i < size) {
val a = array(i)
var j = i+1
while(j < size) {
val b = array(j)
val c = n - b - a
mapValueToMaxIndex.get(c).foreach { maxIndex =>
if(maxIndex > j) result += ((a, b, c))
}
j += 1
}
i += 1
}
result.toList
}

Convert iterative two sum k to functional

I have this code in Python that finds all pairs of numbers in an array that sum to k:
def two_sum_k(array, k):
seen = set()
out = set()
for v in array:
if k - v in seen:
out.add((min(v, k-v), max(v, k-v)))
seen.add(v)
return out
Can anyone help me convert this to Scala (in a functional style)? Also with linear complexity.
I think this is a classic case of when a for-comprehension can provide additional clarity
scala> def algo(xs: IndexedSeq[Int], target: Int) =
| for {
| i <- 0 until xs.length
| j <- (i + 1) until xs.length if xs(i) + xs(j) == target
| }
| yield xs(i) -> xs(j)
algo: (xs: IndexedSeq[Int], target: Int)scala.collection.immutable.IndexedSeq[(Int, Int)]
Using it:
scala> algo(1 to 20, 15)
res0: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector((1,14), (2,13), (3,12), (4,11), (5,10), (6,9), (7,8))
I think it also doesn't suffer from the problems that your algorithm has
I'm not sure this is the clearest, but folds usually do the trick:
def two_sum_k(xs: Seq[Int], k: Int) = {
xs.foldLeft((Set[Int](),Set[(Int,Int)]())){ case ((seen,out),v) =>
(seen+v, if (seen contains k-v) out+((v min k-v, v max k-v)) else out)
}._2
}
You could just filter for (k-x <= x) by only using those x as first element, which aren't bigger than k/2:
def two_sum_k (xs: List[Int], k: Int): List [(Int, Int)] =
xs.filter (x => (x <= k/2)).
filter (x => (xs contains k-x) && (xs.indexOf (x) != xs.lastIndexOf (x))).
map (x => (x, k-x)).distinct
My first filter on line 3 was just filter (x => xs contains k-x)., which failed as found in the comment by Someone Else. Now it's more complicated and doesn't find (4, 4).
scala> li
res6: List[Int] = List(2, 3, 3, 4, 5, 5)
scala> two_sum_k (li, 8)
res7: List[(Int, Int)] = List((3,5))
def twoSumK(xs: List[Int], k: Int): List[(Int, Int)] = {
val tuples = xs.iterator map { x => (x, k-x) }
val potentialValues = tuples map { case (a, b) => (a min b) -> (a max b) }
val values = potentialValues filter { xs contains _._2 }
values.toSet.toList
}
Well, a direct translation would be this:
import scala.collection.mutable
def twoSumK[T : Numeric](array: Array[T], k: T) = {
val num = implicitly[Numeric[T]]
import num._
val seen = mutable.HashSet[T]()
val out: mutable.Set[(T, T)] = mutable.HashSet[(T, T)]()
for (v <- array) {
if (seen contains k - v) out += min(v, k - v) -> max(v, k - v)
seen += v
}
out
}
One clever way of doing it would be this:
def twoSumK[T : Numeric](array: Array[T], k: T) = {
val num = implicitly[Numeric[T]]
import num._
// One can write all the rest as a one-liner
val s1 = array.toSet
val s2 = s1 map (k -)
val s3 = s1 intersect s2
s3 map (v => min(v, k - v) -> max(v, k - v))
}
This does the trick:
def two_sum_k(xs: List[Int], k: Int): List[(Int, Int)] ={
xs.map(a=>xs.map(b=>(b,a+b)).filter(_._2 == k).map(b=>(b._1,a))).flatten.collect{case (a,b)=>if(a>b){(b,a)}else{(a,b)}}.distinct
}