How to read probability Vector in Spark Dataframe LogisticRegression output - scala

I am trying to read the first probability on the logistic regression output so that I may perform decile binning on it.
Below is some test code that just emulates the outputs with a vector.
val r = sqlContext.createDataFrame(Seq(("jane", Vectors.dense(.98)),
("tom", Vectors.dense(.34)),
("nancy", Vectors.dense(.93)),
("tim", Vectors.dense(.02)),
("larry", Vectors.dense(.033)),
("lana", Vectors.dense(.85)),
("jack", Vectors.dense(.84)),
("john", Vectors.dense(.09)),
("jill", Vectors.dense(.12)),
("mike", Vectors.dense(.21)),
("jason", Vectors.dense(.31)),
("roger", Vectors.dense(.76)),
("ed", Vectors.dense(.77)),
("alan", Vectors.dense(.64)),
("ryan", Vectors.dense(.52)),
("ted", Vectors.dense(.66)),
("paul", Vectors.dense(.67)),
("brian", Vectors.dense(.68)),
("jeff", Vectors.dense(.05)))).toDF(CSMasterCustomerID, MLProbability)
var result = r.select(CSMasterCustomerID, MLProbability)
val schema = StructType(Seq(StructField(CSMasterCustomerID, StringType, false), StructField(MLProbability, DoubleType, true)))
result = sqlContext.createDataFrame(result.map((r: Row) => {
r match {
case Row(mcid: String, probability: Vector) =>
RowFactory.create(mcid, probability(0))
}
}), schema)
This fails to compile saying:
<console>:56: error: type mismatch;
found : Double
required: Object
Note: an implicit exists from scala.Double => java.lang.Double, but
methods inherited from Object are rendered ambiguous. This is to avoid
a blanket implicit which would convert any scala.Double to any AnyRef.
You may wish to use a type ascription: `x: java.lang.Double`.
RowFactory.create(mcid, probability(0))
Any suggestions to fix this or another approach?

Related

How to add one to every element of a SparseVector in Breeze?

Given a Breeze SparseVector object:
scala> val sv = new SparseVector[Double](Array(0, 4, 5), Array(1.5, 3.6, 0.4), 8)
sv: breeze.linalg.SparseVector[Double] = SparseVector(8)((0,1.5), (4,3.6), (5,0.4))
What is the best way to take the log of the values + 1?
Here is one way that works:
scala> new SparseVector(sv.index, log(sv.data.map(_ + 1)), sv.length)
res11: breeze.linalg.SparseVector[Double] = SparseVector(8)((0,0.9162907318741551), (4,1.5260563034950492), (5,0.3364722366212129))
I don't like this because it doesn't really make use of breeze to do the addition. We are using a breeze UFunc to take the log of an Array[Double], but that isn't much. I am concerned that in a distributed application with large SparseVectors, this will be slow.
Spark 1.6.3
You can define some UDF's to do arbitrary vectorized addition in Spark. First, you need to set up the ability to convert Spark vectors to Breeze vectors; an example of doing that is here. Once you have the implicit conversions in place, you have a few options.
To add any two columns you can use:
def addVectors(v1Col: String, v2Col: String, outputCol: String): DataFrame => DataFrame = {
// Error checking column names here
df: DataFrame => {
def add(v1: SparkVector, v2: SparkVector): SparkVector =
(v1.asBreeze + v2.asBreeze).fromBreeze
val func = udf((v1: SparkVector, v2: SparkVector) => add(v1, v2))
df.withColumn(outputCol, func(col(v1Col), col(v2Col)))
}
}
Note, the use of asBreeze and fromBreeze (as well as the alias for SparkVector) is established in the question linked above. A possible solution is to make a literal integer column by
df.withColumn(colName, lit(1))
and then add the columns.
The alternative for more complex mathematical functions is:
def applyMath(func: BreezeVector[Double] => BreezeVector[Double],
inColName: String, outColName: String): DataFrame => DataFrame = {
df: DataFrame => df.withColumn(outColName,
udf((v1: SparkVector) => func(v1.asBreeze).fromBreeze).apply(col(inColName)))
}
You could also make this generic in the Breeze vector parameter.

Scala: GraphX: error: class Array takes type parameters

I am trying to build an Edge RDD for GraphX. I am reading a csv file and converting to DataFrame Then trying to convert to an Edge RDD:
val staticDataFrame = spark.
read.
option("header", true).
option("inferSchema", true).
csv("/projects/pdw/aiw_test/aiw/haris/Customers_DDSW-withDN$.csv")
val edgeRDD: RDD[Edge[(VertexId, VertexId, String)]] =
staticDataFrame.select(
"dealer_customer_number",
"parent_dealer_cust_number",
"dealer_code"
).map{ (row: Array) =>
Edge((
row.getAs[Long]("dealer_customer_number"),
row.getAs[Long]("parent_dealer_cust_number"),
row("dealer_code")
))
}
But I am getting this error:
<console>:81: error: class Array takes type parameters
val edgeRDD: RDD[Edge[(VertexId, VertexId, String)]] = staticDataFrame.select("dealer_customer_number", "parent_dealer_cust_number", "dealer_code").map((row: Array) => Edge((row.getAs[Long]("dealer_customer_number"), row.getAs[Long]("parent_dealer_cust_number"), row("dealer_code"))))
^
The result for
staticDataFrame.select("dealer_customer_number", "parent_dealer_cust_number", "dealer_code").take(1)
is
res3: Array[org.apache.spark.sql.Row] = Array([0000101,null,B110])
First, Array takes type parameters, so you would have to write Array[Something]. But this is probably not what you want anyway.
The dataframe is a Dataset[Row], not a Dataset[Array[_]], therefore you have to change
.map{ (row: Array) =>
to
.map{ (row: Row) =>
Or just omit the typing completely (it should be inferred):
.map{ row =>

Return a tuple of Map and String from UDF

I am trying to construct a temporary column from an expensive UDF that I need to run on each row of my Dataset[Row]. Currently it looks something like:
val myUDF = udf((values: Array[Byte], schema: String) => {
val list = new MyDecoder(schema).decode(values)
val myMap = list.map(
(value: SomeStruct) => (value.field1, value.field2)
).toMap
val field3 = list.head.field3
return (myMap, field3)
})
val decoded = myDF.withColumn("decoded_tmp", myUDF(col("data"), lit(schema))
.withColumn("myMap", col("decoded_tmp._1"))
.withColumn("field3", col("decoded_tmp._2"))
.drop("decoded_tmp")
However, when I try to compile this, I get a type mismatch error:
type mismatch;
found : (scala.collection.immutable.Map[String,Double], String)
required: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
How can I get around this, or will I have to have 2 expensive UDF functions, one to produce myMap and the other to produce the field3 column?

How to properly map an Array of name and type to an Array of StrucField

the simple problem I have is this:
val t = StructField("attName", DecimalType,true)
type mismatch, expected: DataType, actual: DecimalType.type
I want to create a case class wich can be useful to automaticly generate an array of structField. So I begin to try this.
case class MyClass1(attributeName: String, attributeType: DataType)
In order to create my test1:
val test1 = Array(
MyClass1("att1", StringType),
MyClass1("att2", IntegerType),
MyClass1("att3", DecimalType)
)
This is wrong because the third line with DecimalType thrown an error. (DecimalType is not a DataType). So I tried with a second case class like this.
case class MyClass2[T](attributeName: String, attributeType: T)
val test2 = Array(
MyClass2("att1", StringType),
MyClass2("att2", IntegerType),
MyClass2("att3", DecimalType)
)
Now this is ok but lines below does not work because Decimal is not a DataType.
val myStruct = test2.map(x =>
StructField(x.attributeName,x.attributeType, true)
)
So here is my question. How to create a StructField with DecimalType and do you think my caseclass is a good approach? Thanks
DecimalType is not a singleton - it has to be initialized with a given precision and scale:
import org.apache.spark.sql.types.{DataType, DecimalType}
DecimalType(38, 10): DataType
org.apache.spark.sql.types.DataType = DecimalType(38,10)
That being said StructField is already a case class and provides default arguments for nullable (true) and metadata so another representation seems superfluous.

Flink linearRegression: how to load data (Scala)

I'm starting to train a Multiple Linear Regression algorithm in Flink.
I'm following the awesome official documentation and quickstart. I am using Zeppelin to develop this code.
If I load the data from a CSV file:
//Read the file:
val data = benv.readCsvFile[(Int, Double, Double, Double)]("/.../quake.csv")
val mapped = data.map {x => new org.apache.flink.ml.common.LabeledVector (x._4, org.apache.flink.ml.math.DenseVector(x._1,x._2,x._3)) }
//Data created:
mapped: org.apache.flink.api.scala.DataSet[org.apache.flink.ml.common.LabeledVector] = org.apache.flink.api.scala.DataSet#7cb37ad3
LabeledVector(6.7, DenseVector(33.0, -52.26, 28.3))
LabeledVector(5.8, DenseVector(36.0, 45.53, 150.93))
LabeledVector(5.8, DenseVector(57.0, 41.85, 142.78))
//Predict with the model created:
Predict with the model createdval predictions:DataSet[org.apache.flink.ml.common.LabeledVector] = mlr.predict(mapped)
If I load the data from LIBSVM file:
val testingDS: DataSet[(Vector, Double)] = MLUtils.readLibSVM(benv, "/home/borja/Desktop/bbb/quake.libsvm").map(x => (x.vector, x.label))
But I got this ERROR:
->CSV:
res13: org.apache.flink.api.scala.DataSet[org.apache.flink.ml.common.LabeledVector] = org.apache.flink.api.scala.DataSet#7cb37ad3
<console>:89: error: type mismatch;
found : org.apache.flink.api.scala.DataSet[Any]
required: org.apache.flink.api.scala.DataSet[org.apache.flink.ml.common.LabeledVector]
Note: Any >: org.apache.flink.ml.common.LabeledVector, but class DataSet is invariant in type T.
You may wish to define T as -T instead. (SLS 4.5)
Error occurred in an application involving default arguments.
val predictions:DataSet[org.apache.flink.ml.common.LabeledVector] = mlr.predict(mapped)
->LIBSVM:
<console>:111: error: type Vector takes type parameters
val testingDS: DataSet[(Vector, Double)] = MLUtils.readLibSVM(benv, "/home/borja/Desktop/bbb/quake.libsvm").map(x => (x.vector, x.label))
Ok, so I wrote:
New Code:
val testingDS: DataSet[(Vector[org.apache.flink.ml.math.Vector], Double)] = MLUtils.readLibSVM(benv, "/home/borja/Desktop/bbb/quake.libsvm").map(x => (x.vector, x.label))
New Error:
<console>:111: error: type mismatch;
found : org.apache.flink.ml.math.Vector
required: scala.collection.immutable.Vector[org.apache.flink.ml.math.Vector]
val testingDS: DataSet[(Vector[org.apache.flink.ml.math.Vector], Double)] = MLUtils.readLibSVM(benv, "/home/borja/Desktop/bbb/quake.libsvm").map(x => (x.vector, x.label))
I would really appreciate your help! :)
You should not import and use the Scala Vector class. Flink ML is shipped with its own Vector. This should work:
val testingDS: DataSet[(org.apache.flink.ml.math.Vector, Double)] = MLUtils.readLibSVM(benv, "/home/borja/Desktop/bbb/quake.libsvm").map(x => (x.vector, x.label))