Spark: How to transform values of some selected features in LabeledPoint? - scala

I've got a LabeledPoint and a list of features that I want to transform:
scala> transformedData.collect()
res29: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((0.0,(400036,[7744],[2.0])), (0.0,(400036,[7744,8608],[3.0,3.0])), (0.0,(400036,[7744],[2.0])), (0.0,(400036,[133,218,2162,7460,7744,9567],[1.0,1.0,2.0,1.0,42.0,21.0])), (0.0,(400036,[133,218,1589,2162,2784,2922,3274,6914,7008,7131,7460,8608,9437,9567,199999,200021,200035,200048,200051,200056,200058,200064,200070,200072,200075,200087,400008,400011],[4.0,1.0,6.0,53.0,6.0,1.0,1.0,2.0,11.0,17.0,48.0,3.0,4.0,113.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,28.0,1.0,1.0,1.0,1.0,1.0,4.0])), (0.0,(400036,[1589,3585,4830,6935,6936,7744,400008,400011],[2.0,6.0,3.0,52.0,4.0,3.0,1.0,2.0])), (0.0,(400036,[1589,2162,2784,2922,4123,7008,7131,7792,8608],[23.0,70.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0])), (0.0,(400036,[4830,6935,6936,400008,400011],[1.0,36.0...
val toTransform = List(124,443,543,211,...
Transformation that I want to use looks like this :
Take the natural logarithm of (feature value+1): new_val=log(val+1)
Divide new values by maximum of new values: new_val/max(new_val) (if max not equal to 0)
How can perform this transformation for each feature from my toTransform list (I don't want to create new features, just transform old one)

It is possible but not exactly straightforward. If you can transform values before you assemble vectors and labeled points then answer provided by #eliasah should do the trick. Otherwise you have to do things the hard way. Lets assume your data looks like this
import org.apache.spark.mllib.linalg.{Vector, Vectors, SparseVector, DenseVector}
import org.apache.spark.mllib.regression.LabeledPoint
val points = sc.parallelize(Seq(
LabeledPoint(1.0, Vectors.sparse(6, Array(1, 4, 5), Array(2.0, 6.0, 3.0))),
LabeledPoint(2.0, Vectors.sparse(6, Array(2, 3), Array(0.1, 1.0)))
))
Next lets define small helper:
import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV}
def toBreeze(v: Vector): BV[Double] = v match {
case DenseVector(values) => new BDV[Double](values)
case SparseVector(size, indices, values) => {
new BSV[Double](indices, values, size)
}
}
and disassemble LabeledPoints as follows:
val pairs = points.map(lp => (lp.label, toBreeze(lp.features)))
Now can define a transformation function:
def transform(indices: Seq[Int])(v: BV[Double]) = {
for(i <- indices) v(i) = breeze.numerics.log(v(i) + 1.0)
v
}
and transform pairs:
val indices = Array(2, 4)
val transformed = pairs.mapValues(transform(indices))
Finally lets find maximum values:
val maxV = transformed.values.reduce(breeze.linalg.max(_, _))
def divideByMax(m: BV[Double], indices: Seq[Int])(v: BV[Double]) = {
for (i <- indices) if(m(i) != 0) v(i) /= m(i)
v
}
val divided = transformed.mapValues(divideByMax(maxV, indices))
and map to LabelPoints:
def toSpark(v: BV[Double]) = v match {
case v: BDV[Double] => new DenseVector(v.toArray)
case v: BSV[Double] => new SparseVector(v.length, v.index, v.data)
}
divided.map{case (l, v) => LabeledPoint(l, toSpark(v))}

#zero323 is right, you'd better flatten your LabeledPoints then you can do the following :
// create an UDF to transform
def transform(max: Double) = udf[Double,Double] { c => Math.log1p(c) / max}
// create dummy data
val df = sc.parallelize(Seq(1, 2, 3, 4, 5, 4, 3, 2, 1)).toDF("feature")
// get the max value of the feature
val maxFeat = df.agg(max($"feature")).rdd.map { case r: Row => r.getInt(0) }.max
// apply the transformation on your feature column
val newDf = df.withColumn("norm", transform(maxFeat)($"feature"))
newDF.show
// +-------+-------------------+
// |feature| norm|
// +-------+-------------------+
// | 1|0.13862943611198905|
// | 2|0.21972245773362192|
// | 3| 0.2772588722239781|
// | 4|0.32188758248682003|
// | 5| 0.358351893845611|
// | 4|0.32188758248682003|
// | 3| 0.2772588722239781|
// | 2|0.21972245773362192|
// | 1|0.13862943611198905|
// +-------+-------------------+

Related

How to reduce multiple case when in scala-spark

Newbie question, how do you optimize/reduce expressions like these:
when(x1._1,x1._2).when(x2._1,x2._2).when(x3._1,x3._2).when(x4._1,x4._2).when(x5._1,x5._2)....
.when(xX._1,xX._2).otherwise(z)
The x1, x2, x3, xX are maps where x1._1 is the condition and x._2 is the "then".
I was trying to save the maps in a list and then use a map-reduce but it was producing a:
when(x1._1,x1._2).otherwise(z) && when(x2._1,x2._2).otherwise(z)...
Which is wrong. I have like 10 lines of pure when case and would like to reduce that so my code is more clear.
You can use foldLeft on the maplist:
val maplist = List(x1, x2) // add more x if needed
val new_col = maplist.tail.foldLeft(when(maplist.head._1, maplist.head._2))((x,y) => x.when(y._1, y._2)).otherwise(z)
An alternative is to use coalesce. If the condition is not met, null is returned by the when statement, and the next when statement will be evaluated until a non-null result is obtained.
val new_col = coalesce((maplist.map(x => when(x._1, x._2)) :+ z):_*)
You could create a simple recursive method to assemble the nested-when/otherwise condition:
import org.apache.spark.sql.Column
def nestedCond(cols: Array[String], default: String): Column = {
def loop(ls: List[String]): Column = ls match {
case Nil => col(default)
case c :: tail => when(col(s"$c._1"), col(s"$c._2")).otherwise(loop(tail))
}
loop(cols.toList).as("nested-cond")
}
Testing the method:
val df = Seq(
((false, 1), (false, 2), (true, 3), 88),
((false, 4), (true, 5), (true, 6), 99)
).toDF("x1", "x2", "x3", "z")
val cols = df.columns.filter(_.startsWith("x"))
// cols: Array[String] = Array(x1, x2, x3)
df.select(nestedCond(cols, "z")).show
// +-----------+
// |nested-cond|
// +-----------+
// | 3|
// | 5|
// +-----------+
Alternatively, use foldRight to assemble the nested-condition:
def nestedCond(cols: Array[String], default: String): Column =
cols.foldRight(col(default)){ (c, acc) =>
when(col(s"$c._1"), col(s"$c._2")).otherwise(acc)
}.as("nested-cond")
Another way by passing the otherwise as initial value for foldLeft:
val maplist = Seq(Map(col("c1") -> "value1"), Map(col("c2") -> "value2"))
val newCol = maplist.flatMap(_.toSeq).foldLeft(lit("z")) {
case (acc, (cond, value)) => when(cond, value).otherwise(acc)
}
// gives:
// newCol: org.apache.spark.sql.Column = CASE WHEN c2 THEN value2 ELSE CASE WHEN c1 THEN value1 ELSE z END END

How to update a global variable inside RDD map operation

I have RDD[(Int, Array[Double])] and after that, I called a classFunction
val rdd = spark.sparkContext.parallelize(Seq(
(1, Array(2.0,5.0,6.3)),
(5, Array(1.0,3.3,9.5)),
(1, Array(5.0,4.2,3.1)),
(2, Array(9.6,6.3,2.3)),
(1, Array(8.5,2.5,1.2)),
(5, Array(6.0,2.4,7.8)),
(2, Array(7.8,9.1,4.2))
)
)
val new_class = new ABC
new_class.demo(data)
Inside class, declared a global variable value =0. Inside the demo() the new variable new_value = 0 is declared. After the map operation, the new_value get updated and it prints the updated value inside the map.
class ABC extends Serializable {
var value = 0
def demo(data_new : RDD[(Int ,Array[Double])]): Unit ={
var new_value = 0
data_new.coalesce(1).map(x => {
if(x._1 == 1)
new_value = new_value + 1
println(new_value)
value = new_value
}).count()
println("Outside-->" +value)
}
}
OUTPUT:-
1
1
2
2
3
3
3
Outside-->0
How can I update the global variable value after the map operation?.
I'm not sure about what is it you are doing but you need to use Accumulators to perform the type of operations where you need to add values like this.
Here is an example :
scala> val rdd = spark.sparkContext.parallelize(Seq(
| (1, Array(2.0,5.0,6.3)),
| (5, Array(1.0,3.3,9.5)),
| (1, Array(5.0,4.2,3.1)),
| (2, Array(9.6,6.3,2.3)),
| (1, Array(8.5,2.5,1.2)),
| (5, Array(6.0,2.4,7.8)),
| (2, Array(7.8,9.1,4.2))
| )
| )
rdd: org.apache.spark.rdd.RDD[(Int, Array[Double])] = ParallelCollectionRDD[83] at parallelize at <console>:24
scala> val accum = sc.longAccumulator("My Accumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 46181, name: Some(My Accumulator), value: 0)
scala> rdd.foreach { x => if(x._1 == 1) accum.add(1) }
scala> accum.value
res38: Long = 3
And as mentioned by #philantrovert, if you wish to count the number of occurrences of each key, you can do the following :
scala> rdd.mapValues(_ => 1L).reduceByKey(_ + _).take(3)
res41: Array[(Int, Long)] = Array((1,3), (2,2), (5,2))
You can also use countByKey but it is to be avoided with big datasets.
No you can't change the global variables from inside the map.
If you are trying to count the number of one in the function than you can use filter
val value = data_new.filter(x => (x._1 == 1)).count
println("Outside-->" +value)
Output:
Outside-->3
Also it is not recommended to use mutable variables var. You should always try to use immutable as val
I hope this helps!
OR You can do achieve your problem in this way also:
class ABC extends Serializable {
def demo(data_new : RDD[(Int ,Array[Double])]): Unit ={
var new_value = 0
data_new.coalesce(1).map(x => {
if(x._1 == 1)
var key = x._1
(key, 1)
}).reduceByKey(_ + _)
}
println("Outside-->" +demo(data_new))
}

Scala map with dependent variables

In scala I have a list of functions that return a value. The order in which the functions are executed are important since the argument of function n is the output of function n-1.
This hints to use foldLeft, something like:
val base: A
val funcs: Seq[Function[A, A]]
funcs.foldLeft(base)(x, f) => f(x)
(detail: type A is actually a Spark DataFrame).
However, the results of each functions are mutually exclusive and in the end I want the union of all the results for each function.
This hints to use a map, something like:
funcs.map(f => f(base)).reduce(_.union(_)
But here each function is applied to base which is not what I want.
Short: A list of variable length of ordered functions needs to return a list of equal length of return values, where each value n-1 was the input for function n (starting from base where n=0). Such that the result values can be concatenated.
How can I achieve this?
EDIT
example:
case class X(id:Int, value:Int)
val base = spark.createDataset(Seq(X(1, 1), X(2, 2), X(3, 3), X(4, 4), X(5, 5))).toDF
def toA = (x: DataFrame) => x.filter('value.mod(2) === 1).withColumn("value", lit("a"))
def toB = (x: DataFrame) => x.withColumn("value", lit("b"))
val a = toA(base)
val remainder = base.join(a, Seq("id"), "leftanti")
val b = toB(remainder)
a.union(b)
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 3| a|
| 5| a|
| 2| b|
| 4| b|
+---+-----+
This should work for an arbitrary number of functions (e.g. toA, toB ... toN. Where each time the remainder of the previous result is calculated and passed into the next function. In the end a union is applied to all results.
Seq already has a method scanLeft that does this out-of-the-box:
funcs.scanLeft(base)((acc, f) => f(acc)).tail
Make sure to drop the first element of the result of scanLeft if you don't want base to be included.
Using only foldLeft it is possible too:
funcs.foldLeft((base, List.empty[A])){ case ((x, list), f) =>
val res = f(x)
(res, res :: list)
}._2.reverse.reduce(_.union(_))
Or:
funcs.foldLeft((base, Vector.empty[A])){ case ((x, list), f) =>
val res = f(x)
(res, list :+ res)
}._2.reduce(_.union(_))
The trick is to accumulate into a Seq inside the fold.
Example:
scala> val base = 7
base: Int = 7
scala> val funcs: List[Int => Int] = List(_ * 2, _ + 3)
funcs: List[Int => Int] = List($$Lambda$1772/1298658703#7d46af18, $$Lambda$1773/107346281#5470fb9b)
scala> funcs.foldLeft((base, Vector.empty[Int])){ case ((x, list), f) =>
| val res = f(x)
| (res, list :+ res)
| }._2
res8: scala.collection.immutable.Vector[Int] = Vector(14, 17)
scala> .reduce(_ + _)
res9: Int = 31
I've got a simplified solution using normal collections but the same principle applies.
val list: List[Int] = List(1, 2, 3, 4, 5)
val funcs: Seq[Function[List[Int], List[Int]]] = Seq(times2, by2)
funcs.foldLeft(list) { case(collection, func) => func(collection) } foreach println // prints 1 2 3 4 5
def times2(l: List[Int]): List[Int] = l.map(_ * 2)
def by2(l: List[Int]): List[Int] = l.map(_ / 2)
This solution does not hold if you want a single reduced value as your final output e.g. single Int; therefore this works as:
F[B] -> F[B] -> F[B] and not as F[B] -> F[B] -> B; though I guess this is what you need.

Average word length in Spark

I have a list of values and their aggregated lengths of all their occurrences as an array.
Ex: If my sentence is
"I have a cat. The cat looks very cute"
My array looks like
Array((I,1), (have,4), (a,1), (cat,6), (The, 3), (looks, 5), (very ,4), (cute,4))
Now I want to compute the average length of each word. i.e the length / number of occurrences.
I tried to do the coding using Scala as follows:
val avglen = arr.reduceByKey( (x,y) => (x, y.toDouble / x.size.toDouble) )
I'm getting an error as follows at x.size
error: value size is not a member of Int
Please help me where I'm going wrong here.
After your comment I think I got it:
val words = sc.parallelize(Array(("i", 1), ("have", 4),
("a", 1), ("cat", 6),
("the", 3), ("looks", 5),
("very", 4), ("cute", 4)))
val avgs = words.map { case (word, count) => (word, count / word.length.toDouble) }
println("My averages are: ")
avgs.take(100).foreach(println)
Supposing you have a paragraph with those words and You want to calculate the mean size of the words of the paragraph.
In two steps, with a map-reduce approach and in spark-1.5.1:
val words = sc.parallelize(Array(("i", 1), ("have", 4),
("a", 1), ("cat", 6),
("the", 3), ("looks", 5),
("very", 4), ("cute", 4)))
val wordCount = words.map { case (word, count) => count}.reduce((a, b) => a + b)
val wordLength = words.map { case (word, count) => word.length * count}.reduce((a, b) => a + b)
println("The avg length is: " + wordLength / wordCount.toDouble)
I ran this code using an .ipynb connected to a spark-kernel this is the output.
If I understand the problem correctly:
val rdd: RDD[(String, Int) = ???
val ave: RDD[(String, Double) =
rdd.map { case (name, numOccurance) =>
(name, name.length.toDouble / numOccurance)
}
This is a slightly confusing question. If your data is already in an Array[(String, Int)] collection (presumably after a collect() to the driver), then you need not use any RDD transformations. In fact, there's a nifty trick you can run with fold*() to grab the average over a collection:
val average = arr.foldLeft(0.0) { case (sum: Double, (_, count: Int)) => sum + count } / arr.foldLeft(0.0) { case (sum: Double, (word: String, count: Int)) => sum + count / word.length }
Kind of long winded, but it essentially aggregates the total number of characters in the numerator and the number of words in the denominator. Run on your example, I see the following:
scala> val arr = Array(("I",1), ("have",4), ("a",1), ("cat",6), ("The", 3), ("looks", 5), ("very" ,4), ("cute",4))
arr: Array[(String, Int)] = Array((I,1), (have,4), (a,1), (cat,6), (The,3), (looks,5), (very,4), (cute,4))
scala> val average = ...
average: Double = 3.111111111111111
If you have your (String, Int) tuples distributed across an RDD[(String, Int)], you can use accumulators to solve this problem quite easily:
val chars = sc.accumulator(0.0)
val words = sc.accumulator(0.0)
wordsRDD.foreach { case (word: String, count: Int) =>
chars += count; words += count / word.length
}
val average = chars.value / words.value
When running on the above example (placed in an RDD), I see the following:
scala> val arr = Array(("I",1), ("have",4), ("a",1), ("cat",6), ("The", 3), ("looks", 5), ("very" ,4), ("cute",4))
arr: Array[(String, Int)] = Array((I,1), (have,4), (a,1), (cat,6), (The,3), (looks,5), (very,4), (cute,4))
scala> val wordsRDD = sc.parallelize(arr)
wordsRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:14
scala> val chars = sc.accumulator(0.0)
chars: org.apache.spark.Accumulator[Double] = 0.0
scala> val words = sc.accumulator(0.0)
words: org.apache.spark.Accumulator[Double] = 0.0
scala> wordsRDD.foreach { case (word: String, count: Int) =>
| chars += count; words += count / word.length
| }
...
scala> val average = chars.value / words.value
average: Double = 3.111111111111111

In Scala, how do I keep track of running totals without using var?

For example, suppose I wish to read in fat, carbs and protein and wish to print the running total of each variable. An imperative style would look like the following:
var totalFat = 0.0
var totalCarbs = 0.0
var totalProtein = 0.0
var lineNumber = 0
for (lineData <- allData) {
totalFat += lineData...
totalCarbs += lineData...
totalProtein += lineData...
lineNumber += 1
printCSV(lineNumber, totalFat, totalCarbs, totalProtein)
}
How would I write the above using only vals?
Use scanLeft.
val zs = allData.scanLeft((0, 0.0, 0.0, 0.0)) { case(r, c) =>
val lineNr = r._1 + 1
val fat = r._2 + c...
val carbs = r._3 + c...
val protein = r._4 + c...
(lineNr, fat, carbs, protein)
}
zs foreach Function.tupled(printCSV)
Recursion. Pass the sums from previous row to a function that will add them to values from current row, print them to CSV and pass them to itself...
You can transform your data with map and get the total result with sum:
val total = allData map { ... } sum
With scanLeft you get the particular sums of each step:
val steps = allData.scanLeft(0) { case (sum,lineData) => sum+lineData}
val result = steps.last
If you want to create several new values in one iteration step I would prefer a class which hold the values:
case class X(i: Int, str: String)
object X {
def empty = X(0, "")
}
(1 to 10).scanLeft(X.empty) { case (sum, data) => X(sum.i+data, sum.str+data) }
It's just a jump to the left,
and then a fold to the right /:
class Data (val a: Int, val b: Int, val c: Int)
val list = List (new Data (3, 4, 5), new Data (4, 2, 3),
new Data (0, 6, 2), new Data (2, 4, 8))
val res = (new Data (0, 0, 0) /: list)
((acc, x) => new Data (acc.a + x.a, acc.b + x.b, acc.c + x.c))