This is my DataFrame
df.groupBy($"label").count.show
+-----+---------+
|label| count|
+-----+---------+
| 0.0|400000000|
| 1.0| 10000000|
+-----+---------+
I am trying to subsample the records with label == 0.0 with the following:
val r = scala.util.Random
val df2 = df.filter($"label" === 1.0 || r.nextDouble > 0.5) // keep 50% of 0.0
My output looks like this:
df2.groupBy($"label").count.show
+-----+--------+
|label| count|
+-----+--------+
| 1.0|10000000|
+-----+--------+
r.nextDouble is a constant in the expression so the actual evaluation is quite different from what you mean. Depending on the actual sampled value it is either
scala> r.setSeed(0)
scala> $"label" === 1.0 || r.nextDouble > 0.5
res0: org.apache.spark.sql.Column = ((label = 1.0) OR true)
or
scala> r.setSeed(4096)
scala> $"label" === 1.0 || r.nextDouble > 0.5
res3: org.apache.spark.sql.Column = ((label = 1.0) OR false)
so after simplification it is just:
true
(keeping all the records) or
label = 1.0
(keeping only ones, the case you observed) respectively.
To generate random numbers you should use corresponding SQL function
scala> import org.apache.spark.sql.functions.rand
import org.apache.spark.sql.functions.rand
scala> $"label" === 1.0 || rand > 0.5
res1: org.apache.spark.sql.Column = ((label = 1.0) OR (rand(3801516599083917286) > 0.5))
though Spark already provides stratified sampling tools:
df.stat.sampleBy(
"label", // column
Map(0.0 -> 0.5, 1.0 -> 1.0), // fractions
42 // seed
)
I'm trying to fit a curve with SimpleCurveFitter of commons.math3.fitting in Scala but I catch an exception :
org.apache.commons.math3.exception.ConvergenceException : Unable to permorm
Qr decomposition on jacobian
However, I have checked my gradient calculations.... I still don't see why the exception is raised.
See the code by yourself
def main(args: Array[String]): Unit = {
var xv: DenseVector[Double] = linspace(0, 3, 300)
var yv: DenseVector[Double] = DenseVector.zeros(300)
for (i <- xv.findAll(x => x < 1.0)) yv.update(i, 1)
for (i <- xv.findAll(x => x >= 1.0)) yv.update(i, exp(-(xv(i) - 1.0)/1))
val wop: Array[WeightedObservedPoint] = new Array[WeightedObservedPoint](xv.length)
for (i <- 0 to xv.length - 1) wop.update(i, new WeightedObservedPoint(1, xv(i), yv(i)))
val f: ParametricUnivariateFunction = new ParametricUnivariateFunction {
override def value(x: Double, parameters: Double*): Double = {
val a = parameters(0)
val b = parameters(1)
1.0 / (1.0 + a * pow(x, 2 * b))
}
override def gradient(x: Double, parameters: Double*): Array[Double] = {
val a = parameters(0)
val b = parameters(1)
val ga = - pow(x, 2 * b) / pow(1 + a * pow(x, 2 * b), 2)
val gb = - (2 * a * pow(x, 2 * b) * log(x)) / pow(1 + a * pow(x, 2 * b), 2)
val grad = Array(ga, gb)
grad
}
}
val wopc = JavaConverters.asJavaCollection(wop)
val cf = SimpleCurveFitter.create(f, Array(1, 1))
val param = cf.fit(wopc)
println(param(0), param(1))
}
Thank you for your help :)
I'm quite new to Scala and Spark, and had some questions about displaying results in output file.
I actually have a Map in which each key is associated to a List of List (Map[Int, List<Double>]), such as :
(2, List(x1,x2,x3), List(y1,y2,y3), ...).
I am supposed to display for each key the values inside the lists of lists, such as:
2 x1,x2,x3
2 y1,y2,y3
1 z1,z2,z3
and so on.
When I use the saveAsTextFile function, it doesn't give me what I want in the output. Does anybody know how I can do it?
EDIT :
This is one of my function :
def PrintCluster(vectorsByKey : Map[Int, List[Double]], vectCentroidPairs : Map[Int, Int]) : Map[Int, List[Double]] = {
var vectorsByCentroid: Map[Int, List[Double]] = Map()
val SortedCentroid = vectCentroidPairs.groupBy(_._2).mapValues(x => x.map(_._1).toList).toSeq.sortBy(_._1).toMap
SortedCentroid.foreach { case (centroid, vect) =>
var nbVectors = vect.length
for (i <- 0 to nbVectors - 1) {
var vectValues = vectorsByKey(vect(i))
println(centroid + " " + vectValues)
vectorsByCentroid += (centroid -> (vectValues))
}
}
return vectorsByCentroid
}
I know it's wrong, because I only can affect one unique keys for a group of values. That is why it returns me only the first List for each key in the Map. I thought that for using the saveAsTextFile function, I've had necessarily to use a Map structure, but I don't really know.
create sample rdd as per your input data
val rdd: RDD[Map[Int, List[List[Double]]]] = spark.sparkContext.parallelize(
Seq(Map(
2 -> List(List(-4.4, -2.0, 1.5), List(-3.3, -5.4, 3.9), List(-5.8, -3.3, 2.3), List(-5.2, -4.0, 2.8)),
1 -> List(List(7.3, 1.0, -2.0), List(9.8, 0.4, -1.0), List(7.5, 0.3, -3.0), List(6.1, -0.5, -0.6), List(7.8, 2.2, -0.7), List(6.6, 1.4, -1.1), List(8.1, -0.0, 2.7)),
3 -> List(List(-3.0, 4.0, 1.4), List(-4.0, 3.9, 0.8), List(-1.4, 4.3, -0.5), List(-1.6, 5.2, 1.0)))
)
)
Transform RDD[Map[Int, List[List[Double]]]] to RDD[(Int, String)]
val result: RDD[(Int, String)] = rdd.flatMap(i => {
i.map {
case (x, y) => y.map(list => (x, list.mkString(" ")))
}
}).flatMap(z => z)
result.foreach(println)
result.saveAsTextFile("location")
Using a Map[Int, List[List[Double]]] and simply print it in the format wanted is simple, it can be done by first converting to a list and then applying flatMap. Using the data supplied in a comment:
val map: Map[Int, List[List[Double]]] = Map(
2 -> List(List(-4.4, -2.0, 1.5), List(-3.3, -5.4, 3.9), List(-5.8, -3.3, 2.3), List(-5.2, -4.0, 2.8)),
1 -> List(List(7.3, 1.0, -2.0), List(9.8, 0.4, -1.0), List(7.5, 0.3, -3.0), List(6.1, -0.5, -0.6), List(7.8, 2.2, -0.7), List(6.6, 1.4, -1.1), List(8.1, -0.0, 2.7)),
3 -> List(List(-3.0, 4.0, 1.4), List(-4.0, 3.9, 0.8), List(-1.4, 4.3, -0.5), List(-1.6, 5.2, 1.0))
)
val list = map.toList.flatMap(t => t._2.map((t._1, _)))
val result = for (t <- list) yield t._1 + "\t" + t._2.mkString(",")
// Saving the result to file
import java.io._
val pw = new PrintWriter(new File("fileName.txt"))
result.foreach{ line => pw.println(line)}
pw.close
Will print out:
2 -4.4,-2.0,1.5
2 -3.3,-5.4,3.9
2 -5.8,-3.3,2.3
2 -5.2,-4.0,2.8
1 7.3,1.0,-2.0
1 9.8,0.4,-1.0
1 7.5,0.3,-3.0
1 6.1,-0.5,-0.6
1 7.8,2.2,-0.7
1 6.6,1.4,-1.1
1 8.1,-0.0,2.7
3 -3.0,4.0,1.4
3 -4.0,3.9,0.8
3 -1.4,4.3,-0.5
3 -1.6,5.2,1.0
I am new to functional programming, I have a Seq[Double] and I'd like to check for each value if it is higher (1), lower (-1) or equal (0) to previous value, like:
val g = Seq(0.1, 0.3, 0.5, 0.5, 0.5, 0.3)
and I'd like to have a result like:
val result = Seq(1, 1, 0, 0, 0, -1)
is there a more concise way than:
val g = Seq(0.1, 0.3, 0.5, 0.5, 0.5, 0.3)
g.sliding(2).toList.map(xs =>
if (xs(0)==xs(1)){
0
} else if(xs(0)>xs(1)){
-1
} else {
1
}
)
Use compare:
g.sliding(2).map{ case Seq(x, y) => y compare x }.toList
compare is added by an enrichment trait called OrderedProxy
That's rather concise in my opinion but I'd make it a function and pass it into map to make it more readable. I used pattern matching and guards.
//High, low, equal
scala> def hlo(x: Double, y: Double): Int = y - x match {
| case 0.0 => 0
| case x if x < 0.0 => -1
| case x if x > 0.0 => 1
| }
hlo: (x: Double, y: Double)Int
scala> g.sliding(2).map(xs => hlo(xs(0), xs(1))).toList
res9: List[Int] = List(1, 1, 0, 0, -1)
I agree with Travis Brown's comment from above so am proposing it as an answer.
Reversing the order of the values in the zip, just to match the order of g. This has the added benefit of using tuples instead of a sequence so no pattern matching is needed.
(g, g.tail).zipped.toList.map(t => t._2 compare t._1)
res0: List[Int] = List(1, 1, 0, 0, -1)
I wonder if there is a simple way to do something like this in scala:
case class Pot(width: Int, height: Int, flowers: Seq[FlowerInPot])
case class FlowerInPot(x: Int, y: Int, flower: String)
val flowers = Seq("tulip", "rose")
val height = 3
val width = 3
val res =
for (flower <- flowers;
h <- 0 to height;
w <- 0 to width) yield {
// ??
}
and at an output I'd like to have a Seq of Pots with all possible combinations of flowers placed in it. So in following example, the output should be:
Seq(
Pot(3, 3, Seq(FlowerInPot(0, 0, "tulip"), FlowerInPot(0, 1, "rose"))),
Pot(3, 3, Seq(FlowerInPot(0, 0, "tulip"), FlowerInPot(0, 2, "rose"))),
Pot(3, 3, Seq(FlowerInPot(0, 0, "tulip"), FlowerInPot(1, 0, "rose"))),
Pot(3, 3, Seq(FlowerInPot(0, 0, "tulip"), FlowerInPot(1, 1, "rose"))),
...
Pot(3, 3, Seq(FlowerInPot(2, 2, "tulip"), FlowerInPot(2, 1, "rose")))
)
any ideas?
Is this what you want?
case class FlowerInPot(x: Int, y: Int, flower: String)
case class Pot(width: Int, height: Int, flowers: Seq[FlowerInPot])
val x, y = 0
val flowers = Seq("tulip", "rose")
val height = 3
val width = 3
val res = for {
h <- 0 to height
w <- 0 to width
} yield Pot(height, width, flowers.map(flower => FlowerInPot(w, h, flower)))
I figured it out, for now this solution seems to work:
val res = for {
h <- 0 to height;
w <- 0 to width;
flower <- flowers
} yield (h, w, flower)
val pots: Seq[Pot] = res.sliding(flowers.size).map(l => Pot(width, height, l.map(f => FlowerInPot(f._1, f._2, f._3)))).toList