Newbie question, how do you optimize/reduce expressions like these:
when(x1._1,x1._2).when(x2._1,x2._2).when(x3._1,x3._2).when(x4._1,x4._2).when(x5._1,x5._2)....
.when(xX._1,xX._2).otherwise(z)
The x1, x2, x3, xX are maps where x1._1 is the condition and x._2 is the "then".
I was trying to save the maps in a list and then use a map-reduce but it was producing a:
when(x1._1,x1._2).otherwise(z) && when(x2._1,x2._2).otherwise(z)...
Which is wrong. I have like 10 lines of pure when case and would like to reduce that so my code is more clear.
You can use foldLeft on the maplist:
val maplist = List(x1, x2) // add more x if needed
val new_col = maplist.tail.foldLeft(when(maplist.head._1, maplist.head._2))((x,y) => x.when(y._1, y._2)).otherwise(z)
An alternative is to use coalesce. If the condition is not met, null is returned by the when statement, and the next when statement will be evaluated until a non-null result is obtained.
val new_col = coalesce((maplist.map(x => when(x._1, x._2)) :+ z):_*)
You could create a simple recursive method to assemble the nested-when/otherwise condition:
import org.apache.spark.sql.Column
def nestedCond(cols: Array[String], default: String): Column = {
def loop(ls: List[String]): Column = ls match {
case Nil => col(default)
case c :: tail => when(col(s"$c._1"), col(s"$c._2")).otherwise(loop(tail))
}
loop(cols.toList).as("nested-cond")
}
Testing the method:
val df = Seq(
((false, 1), (false, 2), (true, 3), 88),
((false, 4), (true, 5), (true, 6), 99)
).toDF("x1", "x2", "x3", "z")
val cols = df.columns.filter(_.startsWith("x"))
// cols: Array[String] = Array(x1, x2, x3)
df.select(nestedCond(cols, "z")).show
// +-----------+
// |nested-cond|
// +-----------+
// | 3|
// | 5|
// +-----------+
Alternatively, use foldRight to assemble the nested-condition:
def nestedCond(cols: Array[String], default: String): Column =
cols.foldRight(col(default)){ (c, acc) =>
when(col(s"$c._1"), col(s"$c._2")).otherwise(acc)
}.as("nested-cond")
Another way by passing the otherwise as initial value for foldLeft:
val maplist = Seq(Map(col("c1") -> "value1"), Map(col("c2") -> "value2"))
val newCol = maplist.flatMap(_.toSeq).foldLeft(lit("z")) {
case (acc, (cond, value)) => when(cond, value).otherwise(acc)
}
// gives:
// newCol: org.apache.spark.sql.Column = CASE WHEN c2 THEN value2 ELSE CASE WHEN c1 THEN value1 ELSE z END END
Related
In scala I have a list of functions that return a value. The order in which the functions are executed are important since the argument of function n is the output of function n-1.
This hints to use foldLeft, something like:
val base: A
val funcs: Seq[Function[A, A]]
funcs.foldLeft(base)(x, f) => f(x)
(detail: type A is actually a Spark DataFrame).
However, the results of each functions are mutually exclusive and in the end I want the union of all the results for each function.
This hints to use a map, something like:
funcs.map(f => f(base)).reduce(_.union(_)
But here each function is applied to base which is not what I want.
Short: A list of variable length of ordered functions needs to return a list of equal length of return values, where each value n-1 was the input for function n (starting from base where n=0). Such that the result values can be concatenated.
How can I achieve this?
EDIT
example:
case class X(id:Int, value:Int)
val base = spark.createDataset(Seq(X(1, 1), X(2, 2), X(3, 3), X(4, 4), X(5, 5))).toDF
def toA = (x: DataFrame) => x.filter('value.mod(2) === 1).withColumn("value", lit("a"))
def toB = (x: DataFrame) => x.withColumn("value", lit("b"))
val a = toA(base)
val remainder = base.join(a, Seq("id"), "leftanti")
val b = toB(remainder)
a.union(b)
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 3| a|
| 5| a|
| 2| b|
| 4| b|
+---+-----+
This should work for an arbitrary number of functions (e.g. toA, toB ... toN. Where each time the remainder of the previous result is calculated and passed into the next function. In the end a union is applied to all results.
Seq already has a method scanLeft that does this out-of-the-box:
funcs.scanLeft(base)((acc, f) => f(acc)).tail
Make sure to drop the first element of the result of scanLeft if you don't want base to be included.
Using only foldLeft it is possible too:
funcs.foldLeft((base, List.empty[A])){ case ((x, list), f) =>
val res = f(x)
(res, res :: list)
}._2.reverse.reduce(_.union(_))
Or:
funcs.foldLeft((base, Vector.empty[A])){ case ((x, list), f) =>
val res = f(x)
(res, list :+ res)
}._2.reduce(_.union(_))
The trick is to accumulate into a Seq inside the fold.
Example:
scala> val base = 7
base: Int = 7
scala> val funcs: List[Int => Int] = List(_ * 2, _ + 3)
funcs: List[Int => Int] = List($$Lambda$1772/1298658703#7d46af18, $$Lambda$1773/107346281#5470fb9b)
scala> funcs.foldLeft((base, Vector.empty[Int])){ case ((x, list), f) =>
| val res = f(x)
| (res, list :+ res)
| }._2
res8: scala.collection.immutable.Vector[Int] = Vector(14, 17)
scala> .reduce(_ + _)
res9: Int = 31
I've got a simplified solution using normal collections but the same principle applies.
val list: List[Int] = List(1, 2, 3, 4, 5)
val funcs: Seq[Function[List[Int], List[Int]]] = Seq(times2, by2)
funcs.foldLeft(list) { case(collection, func) => func(collection) } foreach println // prints 1 2 3 4 5
def times2(l: List[Int]): List[Int] = l.map(_ * 2)
def by2(l: List[Int]): List[Int] = l.map(_ / 2)
This solution does not hold if you want a single reduced value as your final output e.g. single Int; therefore this works as:
F[B] -> F[B] -> F[B] and not as F[B] -> F[B] -> B; though I guess this is what you need.
according this link: https://github.com/amplab/training/blob/ampcamp6/machine-learning/scala/solution/MovieLensALS.scala
I don't understand what is the point of :
val numUsers = ratings.map(_._2.user).distinct.count
val numMovies = ratings.map(_._2.product).distinct.count
_._2.[user|product] , what does that mean?
That is accessing the tuple elements: The following example might explain it better.
val xs = List(
(1, "Foo"),
(2, "Bar")
)
xs.map(_._1) // => List(1,2)
xs.map(_._2) // => List("Foo", "Bar")
// An equivalent way to write this
xs.map(e => e._1)
xs.map(e => e._2)
// Perhaps a better way is
xs.collect {case (a, b) => a} // => List(1,2)
xs.collect {case (a, b) => b} // => List("Foo", "Bar")
ratings is a collection of tuples:(timestamp % 10, Rating(userId, movieId, rating)). The first underscore in _._2.user refers to the current element being processed by the map function. So the first underscore now refers to a tuple (pair of values). For a pair tuple t you can refer to its first and second elements in the shorthand notation: t._1 & t._2 So _._2 is selecting the second element of the tuple currently being processed by the map function.
val ratings = sc.textFile(movieLensHomeDir + "/ratings.dat").map { line =>
val fields = line.split("::")
// format: (timestamp % 10, Rating(userId, movieId, rating))
(fields(3).toLong % 10, Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble))
}
I've got a LabeledPoint and a list of features that I want to transform:
scala> transformedData.collect()
res29: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((0.0,(400036,[7744],[2.0])), (0.0,(400036,[7744,8608],[3.0,3.0])), (0.0,(400036,[7744],[2.0])), (0.0,(400036,[133,218,2162,7460,7744,9567],[1.0,1.0,2.0,1.0,42.0,21.0])), (0.0,(400036,[133,218,1589,2162,2784,2922,3274,6914,7008,7131,7460,8608,9437,9567,199999,200021,200035,200048,200051,200056,200058,200064,200070,200072,200075,200087,400008,400011],[4.0,1.0,6.0,53.0,6.0,1.0,1.0,2.0,11.0,17.0,48.0,3.0,4.0,113.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,28.0,1.0,1.0,1.0,1.0,1.0,4.0])), (0.0,(400036,[1589,3585,4830,6935,6936,7744,400008,400011],[2.0,6.0,3.0,52.0,4.0,3.0,1.0,2.0])), (0.0,(400036,[1589,2162,2784,2922,4123,7008,7131,7792,8608],[23.0,70.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0])), (0.0,(400036,[4830,6935,6936,400008,400011],[1.0,36.0...
val toTransform = List(124,443,543,211,...
Transformation that I want to use looks like this :
Take the natural logarithm of (feature value+1): new_val=log(val+1)
Divide new values by maximum of new values: new_val/max(new_val) (if max not equal to 0)
How can perform this transformation for each feature from my toTransform list (I don't want to create new features, just transform old one)
It is possible but not exactly straightforward. If you can transform values before you assemble vectors and labeled points then answer provided by #eliasah should do the trick. Otherwise you have to do things the hard way. Lets assume your data looks like this
import org.apache.spark.mllib.linalg.{Vector, Vectors, SparseVector, DenseVector}
import org.apache.spark.mllib.regression.LabeledPoint
val points = sc.parallelize(Seq(
LabeledPoint(1.0, Vectors.sparse(6, Array(1, 4, 5), Array(2.0, 6.0, 3.0))),
LabeledPoint(2.0, Vectors.sparse(6, Array(2, 3), Array(0.1, 1.0)))
))
Next lets define small helper:
import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV}
def toBreeze(v: Vector): BV[Double] = v match {
case DenseVector(values) => new BDV[Double](values)
case SparseVector(size, indices, values) => {
new BSV[Double](indices, values, size)
}
}
and disassemble LabeledPoints as follows:
val pairs = points.map(lp => (lp.label, toBreeze(lp.features)))
Now can define a transformation function:
def transform(indices: Seq[Int])(v: BV[Double]) = {
for(i <- indices) v(i) = breeze.numerics.log(v(i) + 1.0)
v
}
and transform pairs:
val indices = Array(2, 4)
val transformed = pairs.mapValues(transform(indices))
Finally lets find maximum values:
val maxV = transformed.values.reduce(breeze.linalg.max(_, _))
def divideByMax(m: BV[Double], indices: Seq[Int])(v: BV[Double]) = {
for (i <- indices) if(m(i) != 0) v(i) /= m(i)
v
}
val divided = transformed.mapValues(divideByMax(maxV, indices))
and map to LabelPoints:
def toSpark(v: BV[Double]) = v match {
case v: BDV[Double] => new DenseVector(v.toArray)
case v: BSV[Double] => new SparseVector(v.length, v.index, v.data)
}
divided.map{case (l, v) => LabeledPoint(l, toSpark(v))}
#zero323 is right, you'd better flatten your LabeledPoints then you can do the following :
// create an UDF to transform
def transform(max: Double) = udf[Double,Double] { c => Math.log1p(c) / max}
// create dummy data
val df = sc.parallelize(Seq(1, 2, 3, 4, 5, 4, 3, 2, 1)).toDF("feature")
// get the max value of the feature
val maxFeat = df.agg(max($"feature")).rdd.map { case r: Row => r.getInt(0) }.max
// apply the transformation on your feature column
val newDf = df.withColumn("norm", transform(maxFeat)($"feature"))
newDF.show
// +-------+-------------------+
// |feature| norm|
// +-------+-------------------+
// | 1|0.13862943611198905|
// | 2|0.21972245773362192|
// | 3| 0.2772588722239781|
// | 4|0.32188758248682003|
// | 5| 0.358351893845611|
// | 4|0.32188758248682003|
// | 3| 0.2772588722239781|
// | 2|0.21972245773362192|
// | 1|0.13862943611198905|
// +-------+-------------------+
I am cogrouping two RDDs and I want to process its values. That is,
rdd1.cogroup(rdd2)
as a result of this cogrouping I get results as below:
(ion,(CompactBuffer(100772C121, 100772C111, 6666666666),CompactBuffer(100772C121)))
Considering this result I would like to obtain all distinct pairs. e.g.
For the key 'ion'
100772C121 - 100772C111
100772C121 - 666666666
100772C111 - 666666666
How can I do this in scala?
You could try something like the following:
(l1 ++ l2).distinct.combinations(2).map { case Seq(x, y) => (x, y) }.toList
You would need to update l1 and l2 for your CompactBuffer fields. When I tried this locally, I get this (which is what I believe you want):
scala> val l1 = List("100772C121", "100772C111", "6666666666")
l1: List[String] = List(100772C121, 100772C111, 6666666666)
scala> val l2 = List("100772C121")
l2: List[String] = List(100772C121)
scala> val combine = (l1 ++ l2).distinct.combinations(2).map { case Seq(x, y) => (x, y) }.toList
combine: List[(String, String)] = List((100772C121,100772C111), (100772C121,6666666666), (100772C111,6666666666))
If you would like all of these pairs on separate rows, you can enclose this logic within a flatMap.
EDIT: Added steps per your example above.
scala> val rdd1 = sc.parallelize(Array(("ion", "100772C121"), ("ion", "100772C111"), ("ion", "6666666666")))
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> val rdd2 = sc.parallelize(Array(("ion", "100772C121")))
rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[1] at parallelize at <console>:12
scala> val cgrp = rdd1.cogroup(rdd2).flatMap {
| case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
| (l1.toSeq ++ l2.toSeq).distinct.combinations(2).map { case Seq(x, y) => (x, y) }.toList
| }
cgrp: org.apache.spark.rdd.RDD[(String, String)] = FlatMappedRDD[4] at flatMap at <console>:16
scala> cgrp.foreach(println)
...
(100772C121,100772C111)
(100772C121,6666666666)
(100772C111,6666666666)
EDIT 2: Updated again per your use case.
scala> val cgrp = rdd1.cogroup(rdd2).flatMap {
| case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
| for { e1 <- l1.toSeq; e2 <- l2.toSeq; if (e1 != e2) }
| yield if (e1 > e2) ((e1, e2), 1) else ((e2, e1), 1)
| }.reduceByKey(_ + _)
...
((6666666666,100772C121),2)
((6666666666,100772C111),1)
((100772C121,100772C111),1)
In ML, one can assign names for each element of a matched pattern:
fun findPair n nil = NONE
| findPair n (head as (n1, _))::rest =
if n = n1 then (SOME head) else (findPair n rest)
In this code, I defined an alias for the first pair of the list and matched the contents of the pair. Is there an equivalent construct in Scala?
You can do variable binding with the # symbol, e.g.:
scala> val wholeList # List(x, _*) = List(1,2,3)
wholeList: List[Int] = List(1, 2, 3)
x: Int = 1
I'm sure you'll get a more complete answer later as I'm not sure how to write it recursively like your example, but maybe this variation would work for you:
scala> val pairs = List((1, "a"), (2, "b"), (3, "c"))
pairs: List[(Int, String)] = List((1,a), (2,b), (3,c))
scala> val n = 2
n: Int = 2
scala> pairs find {e => e._1 == n}
res0: Option[(Int, String)] = Some((2,b))
OK, next attempt at direct translation. How about this?
scala> def findPair[A, B](n: A, p: List[Tuple2[A, B]]): Option[Tuple2[A, B]] = p match {
| case Nil => None
| case head::rest if head._1 == n => Some(head)
| case _::rest => findPair(n, rest)
| }
findPair: [A, B](n: A, p: List[(A, B)])Option[(A, B)]