How to subtract two doubles? - scala

I want to get IRQ in Spark. How can I subtract two values (double or int)?
I tried the code below but only got error:
scala> Q1
res103: org.apache.spark.sql.Row = [11.09314]
scala> Q3
res104: org.apache.spark.sql.Row = [34.370419]
scala> val IRQ = Math.abs(Q3-Q1)
<console>:43: error: value - is not a member of org.apache.spark.sql.Row
val IRQ = Math.abs(Q3-Q1)

This should do the trick:
val IQR = Math.abs(Q3.getDouble(0) - Q1.getDouble(0))
See the docs for Row interface here.

Related

How to calculate standard deviation and average values of RDD[Long]?

I have RDD[Long] called mod and I want to compute standard deviation and mean values for this RDD using Spark 2.2 and Scala 2.11.8.
How can I do it?
I tried to calculate the average value as follows, but is there any easier way to get these values?
val avg_val = mod.toDF("col").agg(
avg($"col").as("avg")
).first().toString().toDouble
val stddev_val = mod.toDF("col").agg(
stddev($"col").as("avg")
).first().toString().toDouble
I have RDD[Long] called mod and I want to compute standard deviation and mean
Just use stats:
scala> val mod = sc.parallelize(Seq(1L, 3L, 5L))
mod: org.apache.spark.rdd.RDD[Long] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> val stats = mod.stats
stats: org.apache.spark.util.StatCounter = (count: 3, mean: 3.000000, stdev: 1.632993, max: 5.000000, min: 1.000000)
scala> stats.mean
res0: Double = 3.0
scala> stats.stdev
res1: Double = 1.632993161855452
It uses the same internals a stdev and mean but has to scan data only once.
With Dataset I'd recommend:
val (avg_val, stddev_val) = mod.toDS
.agg(mean("value"), stddev("value"))
.as[(Double, Double)].first
or
import org.apache.spark.sql.Row
val Row(avg_val: Double, stddev_val: Double) = mod.toDS
.agg(mean("value"), stddev("value"))
.first
but it neither necessary nor useful here.
I think this is pretty simple:
mod.stdev()
mod.mean()

Permutations of Models in Spark mllib

I am using Spark 1.6.1 with Scala 2.10.5 built in. I am building a set of permutation models using binary numbers but I've now hit a wall for a few days now. Here is the code (partially taken from zero323 at RDD to LabeledPoint conversion):
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
import scala.util.Random.{setSeed, nextDouble}
setSeed(123)
case class Record(
target: Double, x1: Double, x2: Double, x3: Double, x4:Double, x5:Double, x6:Double)
val rows = sc.parallelize(
(1 to 50).map(_ => Record(
nextDouble, nextDouble, nextDouble, nextDouble, nextDouble, nextDouble, nextDouble)))
val df = sqlContext.createDataFrame(rows)
df.registerTempTable("df")
sqlContext.sql("""
SELECT ROUND(target, 2) target,
ROUND(x1, 2) x1,
ROUND(x2, 2) x2,
ROUND(x3, 2) x3,
ROUND(x4, 2) x4,
ROUND(x5, 2) x5,
ROUND(x6, 2) x6
FROM df""").show
Now, I want to build binary numbers with 6 digits. This equates to the number 63 in base 10:
val max_base10=63
val max_binary="%06d".format(max_base10.toBinaryString.toLong)
After that, I create a variable which will allow for column indexes to be permutated in ascending order:
def binary_permutationsSeg1 = for {a <- 1 to max_base10
} yield List(Array(0), %06d".format(a.toBinaryString.toLong).split("").map(_.toInt)) flatten // Array(0) is added as target is not an explanatory variable
This is just to make sure I get what I'm after:
binary_permutationsSeg1(62)
Now I wish to build my set of permutation models for which I MUST have an intercept.. I've learned the hard way that train() does not build an intercept in its algorithm...:
var regression = new LinearRegressionWithSGD().setIntercept(true)
regression.optimizer.setStepSize(0.1)
regression.optimizer.setNumIterations(100)
This bit is not very elegant as I create a double def... Also, I'd like to automatically create multiModel01, multiModel02 etc in memory but I'm not able to do it....
def multiModel = for{
a <- 1 to binary_permutationsSeg1.size-1
} yield df.rdd.map(r => LabeledPoint(
r.getDouble(1),
Vectors.dense(binary_permutationsSeg1(a).zipWithIndex.filter(_._1 == 1).unzip ._2 map(r.getDouble(_)) toArray))).cache()
def run_multiModel=for{
a <- 1 to binary_permutationsSeg1.size-1
} yield regression.run(multiModel(a))
run_multiModel
When I run the line above, I don't get the 62 models I am after which leaves me perplex.
These last two paragraphs are to do with model evaluation (taken from zero323 again) but I cannot make it so that it'd automatically contrust valuesAndPreds01, valuesAndPreds02 and so forth... whilst bearig in mind that some permutations will give NaN.
val valuesAndPreds = df.rdd.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
Thanks for taking the time to look at this problem and thanks for any suggestions or tips in the right direction.
Regards,
Christian

How to convert spark DataFrame to RDD mllib LabeledPoints?

I tried to apply PCA to my data and then apply RandomForest to the transformed data. However, PCA.transform(data) gave me a DataFrame but I need a mllib LabeledPoints to feed my RandomForest. How can I do that?
My code:
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.ml.feature.PCA
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val dataset = MLUtils.loadLibSVMFile(sc, "data/mnist/mnist.bz2")
val splits = dataset.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
val trainingDf = trainingData.toDF()
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(100)
.fit(trainingDf)
val pcaTrainingData = pca.transform(trainingDf)
val numClasses = 10
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 20
val maxBins = 32
val model = RandomForest.trainClassifier(pcaTrainingData, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
error: type mismatch;
found : org.apache.spark.sql.DataFrame
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint]
I tried the following two possible solutions but they didn't work:
scala> val pcaTrainingData = trainingData.map(p => p.copy(features = pca.transform(p.features)))
<console>:39: error: overloaded method value transform with alternatives:
(dataset: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame <and>
(dataset: org.apache.spark.sql.DataFrame,paramMap: org.apache.spark.ml.param.ParamMap)org.apache.spark.sql.DataFrame <and>
(dataset: org.apache.spark.sql.DataFrame,firstParamPair: org.apache.spark.ml.param.ParamPair[_],otherParamPairs: org.apache.spark.ml.param.ParamPair[_]*)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.mllib.linalg.Vector)
And:
val labeled = pca
.transform(trainingDf)
.map(row => LabeledPoint(row.getDouble(0), row(4).asInstanceOf[Vector[Int]]))
error: type mismatch;
found : scala.collection.immutable.Vector[Int]
required: org.apache.spark.mllib.linalg.Vector
(I have imported org.apache.spark.mllib.linalg.Vectors in the above case)
Any help?
The correct approach here is the second one you tried - mapping each Row into a LabeledPoint to get an RDD[LabeledPoint]. However, it has two mistakes:
The correct Vector class (org.apache.spark.mllib.linalg.Vector) does NOT take type arguments (e.g. Vector[Int]) - so even though you had the right import, the compiler concluded that you meant scala.collection.immutable.Vector which DOES.
The DataFrame returned from PCA.fit() has 3 columns, and you tried to extract column number 4. For example, showing first 4 lines:
+-----+--------------------+--------------------+
|label| features| pcaFeatures|
+-----+--------------------+--------------------+
| 5.0|(780,[152,153,154...|[880.071111851977...|
| 1.0|(780,[158,159,160...|[-41.473039034112...|
| 2.0|(780,[155,156,157...|[931.444898405036...|
| 1.0|(780,[124,125,126...|[25.5114585648411...|
+-----+--------------------+--------------------+
To make this easier - I prefer using the column names instead of their indices.
So here's the transformation you need:
val labeled = pca.transform(trainingDf).rdd.map(row => LabeledPoint(
row.getAs[Double]("label"),
row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")
))

Spark RDD map internal object to Row

My initial data from a CSV file is:
1 ,21658392713 ,21626890421
1 ,21623461747 ,21626890421
1 ,21623461747 ,21626890421
The data I have after a few transformations and grouping based on business logic is yields
scala> val sGrouped = grouped
sGrouped: org.apache.spark.rdd.RDD[(String, Iterable[(String,
(Array[String], String))])] = ShuffledRDD[85] at groupBy at <console>:51
scala> sGrouped.foreach(f=>println(f))
(21626890421,CompactBuffer((21626890421,
([Ljava.lang.String;#62ac8444,21626890421)),
(21626890421,([Ljava.lang.String;#59d80fe,21626890421)),
(21626890421,([Ljava.lang.String;#270042e8,21626890421)),
from this I want to get a map that yields something like the following format
[String, Row[String]]
so the data may look like:
[ 21626890421 , Row[(1 ,21658392713 ,21626890421)
, (1 ,21623461747 ,21626890421)
, (1 ,21623461747,21626890421)]]
I really appreciate any guidance on moving forward on this.
I found the answer, but I am not sure if this is an efficient way, any better approaches are appreciated, as this feels more like a hack.
scala> import org.apache.spark.sql.Row
scala> val grouped = cToP.groupBy(_._1)
grouped: org.apache.spark.rdd.RDD[(String, Iterable[(String,
(Array[String], String))])]
scala> val sGrouped = grouped.map(f => f._2.toList)
sGrouped: org.apache.spark.rdd.RDD[List[(String, (Array[String],
String))]]
scala> val tGrouped = sGrouped.map(f =>f.map(_._2).map(c =>
Row(c._1(0), c._1(12), c._1(18))))
tGrouped: org.apache.spark.rdd.RDD[List[org.apache.spark.sql.Row]] =
MapPartitionsRDD[42] a
scala> tGrouped.foreach(f => println(f))
yields
List([1,21658392713,21626890421], [1,21623461747,21626890421],
[1,21623461747,21626890421])
scala> tGrouped.count()
res6: Long = 1
The answer I am getting is correct, and even the count is correct. However, I do not understand why the count is 1.

Spark: Summary statistics

I am trying to use Spark summary statistics as described at: https://spark.apache.org/docs/1.1.0/mllib-statistics.html
According to Spark docs :
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
import org.apache.spark.mllib.linalg.DenseVector
val observations: RDD[Vector] = ... // an RDD of Vectors
// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
I have a problem building observations:RDD[Vector] object. I try:
scala> val data:Array[Double] = Array(1, 2, 3, 4, 5)
data: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0)
scala> val v = new DenseVector(data)
v: org.apache.spark.mllib.linalg.DenseVector = [1.0,2.0,3.0,4.0,5.0]
scala> val observations = sc.parallelize(Array(v))
observations: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector] = ParallelCollectionRDD[3] at parallelize at <console>:19
scala> val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
<console>:21: error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
Note: org.apache.spark.mllib.linalg.DenseVector <: org.apache.spark.mllib.linalg.Vector, but class RDD is invariant in type T.
You may wish to define T as +T instead. (SLS 4.5)
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
Questions:
1) How should I cast DenseVector to Vector?
2) In real program instead of array of doubles I have a to get statistics on a collection that I get from RDD using:
def countByKey(): Map[K, Long]
//Count the number of elements for each key, and return the result to the master as a Map.
So I have to do:
myRdd.countByKey().values.map(_.toDouble)
Which does not make much sense because instead of working with RDDs I now have to work with regular Scala collections whiich at some time stop fitting into memory. All advantages of Spark distributed computations is lost.
How to solve this in scalable manner?
Update
In my case I have:
val cnts: org.apache.spark.rdd.RDD[Int] = prodCntByCity.map(_._2) // get product counts only
val doubleCnts: org.apache.spark.rdd.RDD[Double] = cnts.map(_.toDouble)
How to convert doubleCnts into observations: RDD[Vector] ?
1) You don't need to cast, you just need to type:
val observations = sc.parallelize(Array(v: Vector))
2) Use aggregateByKey (map all the keys to to 1, and reduce by summing) rather than countByKey.
DenseVector has a compressed function. so you can change the RDD[ DenseVector] to RDD[Vector] as :
val st =observations.map(x=>x.compressed)