Flink linearRegression: how to load data (Scala) - scala

I'm starting to train a Multiple Linear Regression algorithm in Flink.
I'm following the awesome official documentation and quickstart. I am using Zeppelin to develop this code.
If I load the data from a CSV file:
//Read the file:
val data = benv.readCsvFile[(Int, Double, Double, Double)]("/.../quake.csv")
val mapped = data.map {x => new org.apache.flink.ml.common.LabeledVector (x._4, org.apache.flink.ml.math.DenseVector(x._1,x._2,x._3)) }
//Data created:
mapped: org.apache.flink.api.scala.DataSet[org.apache.flink.ml.common.LabeledVector] = org.apache.flink.api.scala.DataSet#7cb37ad3
LabeledVector(6.7, DenseVector(33.0, -52.26, 28.3))
LabeledVector(5.8, DenseVector(36.0, 45.53, 150.93))
LabeledVector(5.8, DenseVector(57.0, 41.85, 142.78))
//Predict with the model created:
Predict with the model createdval predictions:DataSet[org.apache.flink.ml.common.LabeledVector] = mlr.predict(mapped)
If I load the data from LIBSVM file:
val testingDS: DataSet[(Vector, Double)] = MLUtils.readLibSVM(benv, "/home/borja/Desktop/bbb/quake.libsvm").map(x => (x.vector, x.label))
But I got this ERROR:
->CSV:
res13: org.apache.flink.api.scala.DataSet[org.apache.flink.ml.common.LabeledVector] = org.apache.flink.api.scala.DataSet#7cb37ad3
<console>:89: error: type mismatch;
found : org.apache.flink.api.scala.DataSet[Any]
required: org.apache.flink.api.scala.DataSet[org.apache.flink.ml.common.LabeledVector]
Note: Any >: org.apache.flink.ml.common.LabeledVector, but class DataSet is invariant in type T.
You may wish to define T as -T instead. (SLS 4.5)
Error occurred in an application involving default arguments.
val predictions:DataSet[org.apache.flink.ml.common.LabeledVector] = mlr.predict(mapped)
->LIBSVM:
<console>:111: error: type Vector takes type parameters
val testingDS: DataSet[(Vector, Double)] = MLUtils.readLibSVM(benv, "/home/borja/Desktop/bbb/quake.libsvm").map(x => (x.vector, x.label))
Ok, so I wrote:
New Code:
val testingDS: DataSet[(Vector[org.apache.flink.ml.math.Vector], Double)] = MLUtils.readLibSVM(benv, "/home/borja/Desktop/bbb/quake.libsvm").map(x => (x.vector, x.label))
New Error:
<console>:111: error: type mismatch;
found : org.apache.flink.ml.math.Vector
required: scala.collection.immutable.Vector[org.apache.flink.ml.math.Vector]
val testingDS: DataSet[(Vector[org.apache.flink.ml.math.Vector], Double)] = MLUtils.readLibSVM(benv, "/home/borja/Desktop/bbb/quake.libsvm").map(x => (x.vector, x.label))
I would really appreciate your help! :)

You should not import and use the Scala Vector class. Flink ML is shipped with its own Vector. This should work:
val testingDS: DataSet[(org.apache.flink.ml.math.Vector, Double)] = MLUtils.readLibSVM(benv, "/home/borja/Desktop/bbb/quake.libsvm").map(x => (x.vector, x.label))

Related

Why I get type mismatch in scala Spark?

First, I read a text file and turn it into RDD[(String,(String,Float))]:
val data = sc.textFile(dataInputPath);
val dataRDD:RDD[(String,(String,Float))] = data.map{f=> {
val temp=f.split("//x01");
(temp(0),(temp(1),temp(2).toInt ) );
}
} ;
Then, I run following code to turn my data into Rating type
import org.apache.spark.mllib.recommendation.Rating
val imeiMap = dataRDD.reduceByKey((s1,s2)=>s1).collect().zipWithIndex.toMap;
val docidMap = dataRDD.map( f=>(f._2._1,1)).reduceByKey((s1,s2)=>s1).collect().zipWithIndex.toMap;
val ratings = dataRDD.map{case (imei, (doc_id,rating))=> Rating(imeiMap(imei),docidMap(doc_id),rating)};
But I got an error:
Error:(32, 77) type mismatch;
found : String
required: (String, (String, Float))
val ratings = dataRDD.map{case (imei, (doc_id,rating))=> Rating(imeiMap(imei),docidMap(doc_id),rating)};
Why this happen? I think that the string have already changed to (String, (String, Float)).
The key of docidMap is not a String, is a Tuple (String, Int)
This is because you have the zipWithIndex before the .toMap method:
With this rdd as input for a quick test:
(String1,( String2,32.0))
(String1,( String2,35.0))
scala> val docidMap = dataRDD.map( f=>(f._2._1,1)).reduceByKey((s1,s2)=>s1).collect().zipWithIndex.toMap;
docidMap: scala.collection.immutable.Map[(String, Int),Int] = Map((" String2",1) -> 0)
val docidMap = dataRDD.map( f=>(f._2._1,1)).reduceByKey((s1,s2)=>s1).collect().toMap;
docidMap: scala.collection.immutable.Map[String,Int] = Map(" String2" -> 1)
The same will happen with your imeiMap, it seems that you just need to remove the zipWithIndex from there too
val imeiMap = dataRDD.reduceByKey((s1,s2)=>s1).collect.toMap
it is not about your dataRDD, it is about imeiMap:
imeiMap: scala.collection.immutable.Map[(String, (String, Float)),Int]

found: org.apache.spark.sql.Dataset[(Double, Double)] required: org.apache.spark.rdd.RDD[(Double, Double)]

I am getting the error below
found : org.apache.spark.sql.Dataset[(Double, Double)]
required: org.apache.spark.rdd.RDD[(Double, Double)]
val testMetrics = new BinaryClassificationMetrics(testScoreAndLabel)
On the following code:
val testScoreAndLabel = testResults.
select("Label","ModelProbability").
map{ case Row(l:Double,p:Vector) => (p(1),l) }
val testMetrics = new BinaryClassificationMetrics(testScoreAndLabel)
From the error it seems that testScoreAndLabel is of type sql.Dataset but BinaryClassificationMetrics expects an RDD.
How can I convert a sql.Dataset into an RDD?
I'd do something like this
val testScoreAndLabel = testResults.
select("Label","ModelProbability").
map{ case Row(l:Double,p:Vector) => (p(1),l) }
Now convert testScoreAndLabel to RDD just by doing testScoreAndLabel.rdd
val testMetrics = new BinaryClassificationMetrics(testScoreAndLabel.rdd)
API Doc

How to read probability Vector in Spark Dataframe LogisticRegression output

I am trying to read the first probability on the logistic regression output so that I may perform decile binning on it.
Below is some test code that just emulates the outputs with a vector.
val r = sqlContext.createDataFrame(Seq(("jane", Vectors.dense(.98)),
("tom", Vectors.dense(.34)),
("nancy", Vectors.dense(.93)),
("tim", Vectors.dense(.02)),
("larry", Vectors.dense(.033)),
("lana", Vectors.dense(.85)),
("jack", Vectors.dense(.84)),
("john", Vectors.dense(.09)),
("jill", Vectors.dense(.12)),
("mike", Vectors.dense(.21)),
("jason", Vectors.dense(.31)),
("roger", Vectors.dense(.76)),
("ed", Vectors.dense(.77)),
("alan", Vectors.dense(.64)),
("ryan", Vectors.dense(.52)),
("ted", Vectors.dense(.66)),
("paul", Vectors.dense(.67)),
("brian", Vectors.dense(.68)),
("jeff", Vectors.dense(.05)))).toDF(CSMasterCustomerID, MLProbability)
var result = r.select(CSMasterCustomerID, MLProbability)
val schema = StructType(Seq(StructField(CSMasterCustomerID, StringType, false), StructField(MLProbability, DoubleType, true)))
result = sqlContext.createDataFrame(result.map((r: Row) => {
r match {
case Row(mcid: String, probability: Vector) =>
RowFactory.create(mcid, probability(0))
}
}), schema)
This fails to compile saying:
<console>:56: error: type mismatch;
found : Double
required: Object
Note: an implicit exists from scala.Double => java.lang.Double, but
methods inherited from Object are rendered ambiguous. This is to avoid
a blanket implicit which would convert any scala.Double to any AnyRef.
You may wish to use a type ascription: `x: java.lang.Double`.
RowFactory.create(mcid, probability(0))
Any suggestions to fix this or another approach?

Unexpected Type Mismatch While Using Scala breeze.optimize.linear.LinearProgram

Right now I am playing around with the Linear Program class in Scala Breeze and I've gotten up to the point where I am going to optimize my linear programming problem using the following code.
import breeze.stats.distributions
import breeze.stats._
import breeze.linalg._
val lp = new breeze.optimize.linear.LinearProgram()
val unif_dist = breeze.stats.distributions.Uniform(-1,1)
val U = DenseMatrix.rand(1, 3, unif_dist).toArray
val V = DenseMatrix.rand(2, 3, unif_dist).toArray.grouped(3).toArray
val B = Array.fill(3)(lp.Binary())
val Objective = V.map(vi => U.zip(vi).map(uv => uv._1 * uv._2)).map(uvi => B.zip(uvi).map(buv => buv._1 * buv._2)).map(x => x.reduce(_ + _)).reduce(_ + _)
val lpp = ( Objective subjectTo() )
lp.maximize(lpp)
I receive the following error
scala> lp.minimize(lpp)
<console>:45: error: type mismatch;
found : lp.Problem
required: lp.Problem
lp.minimize(lpp)
^
Has anyone here run into this before, and if so, did you come up with a way to fix it? Additionally, I am open to suggestions on a cleaner way to write the line where I asssign Objective.

Apache Spark type mismatch of the same type (String)

EDIT: Answer: It was a JAR file that created a conflict!
The related post is: Must include log4J, but it is causing errors in Apache Spark shell. How to avoid errors?
Doing the following:
val numOfProcessors:Int = 2
val filePath:java.lang.String = "s3n://somefile.csv"
var rdd:org.apache.spark.rdd.RDD[java.lang.String] = sc.textFile(filePath, numOfProcessors)
I get
error: type mismatch;
found : org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.RDD[String]
required: org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.RDD[String]
var rdd:org.apache.spark.rdd.RDD[java.lang.String] = sc.textFile(filePath, numOfProcessors)
EDIT: Second case
val numOfProcessors = 2
val filePath = "s3n://somefile.csv"
var rdd = sc.textFile(filePath, numOfProcessors) //OK!
def doStuff(rdd: RDD[String]): RDD[String] = {rdd}
doStuff(rdd)
I get:
error: type mismatch;
found : org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.RDD[String]
required: org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.RDD[String]
doStuff(rdd)
^
No comment...
Any ideas why I get this error ?
The problem was a JAR file that created a conflict.