Cannot pass arrays from MongoDB into Spark Machine Learning functions that require Vectors - mongodb

My use case:
Read data from a MongoDB collection of the form:
{
"_id" : ObjectId("582cab1b21650fc72055246d"),
"label" : 167.517838916715,
"features" : [
10.0964787450654,
218.621137772497,
18.8833848806122,
11.8010251302327,
1.67037687829152,
22.0766170950477,
11.7122322171201,
12.8014773524475,
8.30441804118235,
29.4821268054137
]
}
And pass it to the org.apache.spark.ml.regression.LinearRegression class to create a model for predictions.
My problem:
The Spark connector reads in "features" as Array[Double].
LinearRegression.fit(...) expects a DataSet with a Label column and a Features column.
The Features column must be of type VectorUDT (so DenseVector or SparseVector will work).
I cannot .map features from Array[Double] to DenseVector because there is no relevant Encoder:
Error:(23, 11) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
.map{case Row(label: Double, features: Array[Double]) => Row(label, Vectors.dense(features))}
Custom Encoders cannot be defined.
My question:
Is there a way I can set the configuration of the Spark connector to
read in the "features" array as a Dense/SparseVector?
Is there any
other way I can achieve this (without, for example, using an
intermediary .csv file and loading that using libsvm)?
My code:
import com.mongodb.spark.MongoSpark
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.sql.{Row, SparkSession}
case class DataPoint(label: Double, features: Array[Double])
object LinearRegressionWithMongo {
def main(args: Array[String]) {
val spark = SparkSession
.builder()
.appName("LinearRegressionWithMongo")
.master("local[4]")
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/LinearRegressionTest.DataPoints")
.getOrCreate()
import spark.implicits._
val dataPoints = MongoSpark.load(spark)
.map{case Row(label: Double, features: Array[Double]) => Row(label, Vectors.dense(features))}
val splitData = dataPoints.randomSplit(Array(0.7, 0.3), 42)
val training = splitData(0)
val test = splitData(1)
val linearRegression = new LinearRegression()
.setLabelCol("label")
.setFeaturesCol("features")
.setRegParam(0.0)
.setElasticNetParam(0.0)
.setMaxIter(100)
.setTol(1e-6)
// Train the model
val startTime = System.nanoTime()
val linearRegressionModel = linearRegression.fit(training)
val elapsedTime = (System.nanoTime() - startTime) / 1e9
println(s"Training time: $elapsedTime seconds")
// Print the weights and intercept for linear regression.
println(s"Weights: ${linearRegressionModel.coefficients} Intercept: ${linearRegressionModel.intercept}")
val modelEvaluator = new ModelEvaluator()
println("Training data results:")
modelEvaluator.evaluateRegressionModel(linearRegressionModel, training, "label")
println("Test data results:")
modelEvaluator.evaluateRegressionModel(linearRegressionModel, test, "label")
spark.stop()
}
}
Any help would be ridiculously appreciated!

There is quick fix for this. If data has been loaded into a DataFrame called df which has:
id - SQL double.
features - SQL array<double>.
like this one
val df = Seq((1.0, Array(2.3, 3.4, 4.5))).toDF("id", "features")
you select columns you need for downstream processing:
val idAndFeatures = df.select("id", "features")
convert to statically typed Dataset:
val tuples = idAndFeatures.as[(Double, Seq[Double])]
map and convert back to Dataset[Row]:
val spark: SparkSession = ???
import spark.implicits._
import org.apache.spark.ml.linalg.Vectors
tuples.map { case (id, features) =>
(id, Vectors.dense(features.toArray))
}.toDF("id", "features")
You can find a detailed explanation what is the difference compared to you current approach here.

Related

type TimestampType of colum not supported

i use spark with scala a have problem in type TimestampType
object regressionLinear {
case class X(
time:String,nodeID: Int, posX: Double,posY: Double,
speed: Double,period: Int)
def main (args: Array[String]) {
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
/**
* Read the input data
*/
var dataset = "C:\\spark\\A6-d07-h08.csv"
if (args.length > 0) {
dataset = args(0)
}
val spark = SparkSession
.builder
.appName("regressionsol")
.master("local[4]")
.getOrCreate()
import spark.implicits._
val data = spark.sparkContext.textFile(dataset)
.map(line=>line.split(","))
.map(userRecord => (userRecord(0).trim.toString,
userRecord(1).trim.toInt, userRecord(2).trim.toDouble,userRecord(3).trim.toDouble,userRecord(4).trim.toDouble,userRecord(5).trim.toInt))
.toDF("time","nodeID","posX", "posY","speed","period").withColumn("time", $"time".cast("timestamp"))
val assembler = new VectorAssembler()
.setInputCols( Array(
"time","nodeID","posX", "posY","speed","period"))
.setOutputCol("features")
val lr = new LinearRegression()
.setLabelCol("period")
.setFeaturesCol("features")
.setRegParam(0.1)
.setMaxIter(100)
.setSolver("l-bfgs")
val steps =
Array(assembler, lr)
val pipeline = new Pipeline()
.setStages(steps)
val Array(training, test) = data.randomSplit(Array(0.75, 0.25), seed = 12345)
val model = pipeline.fit {
training
}
val holdout = model.transform(test)
holdout.show(20)
val prediction = holdout.select("prediction", "period","nodeID").orderBy(abs(col("prediction")-col("period")))
prediction.show(20)
val rm = new RegressionMetrics(prediction.rdd.map{
x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double])
})
println(s"RMSE = ${rm.rootMeanSquaredError}")
println(s"R-squared = ${rm.r2}")
spark.stop()
}
}
it is error
Exception in thread "main" java.lang.IllegalArgumentException: Data type TimestampType of column time is not supported.
at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:184)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:136)
at regressionLinear$.main(regressionLinear.scala:100)
at regressionLinear.main(regressionLinear.scala)
VectorAssembler accepts only numeric columns. Other type of columns have to be encoded first. And considering that you apply LinearRegression data has to be encoded anyway.
Exact steps will depend on the domain specific knowledge:
If you expect linear trend based on time cast field to numeric first.
If you expect some type of seasonal effects you might have to extract individual components (day of week, hour / time of day, month and so on), and typically apply StringIndexer + `OneHotEncoder.

How to predict kmeans cluster with Spark org.apache.spark.ml.clustering.{KMeans, KMeansModel}

i have a problem with the two different MLLIB Implementations (org.apache.spark.ml. and org.apache.spark.mllib) and KMeans. I am using the new implementation of org.apache.spark.ml which is using Dataframes but I'm struggeling with the documentation and how to predict a cluster index.
import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
import org.apache.spark.ml.feature.LabeledPoint
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.{Row, SparkSession}
/**
* An example showcasing the use of kMeans
*/
object ExploreKMeans {
// Spark configuration.
// Retrieve sparkContext with spark.sparkContext.
private val spark = SparkSession.builder()
.appName("com.example.ml.exploration.kMeans")
.master("local[*]")
.getOrCreate()
// This import, after the definition of a valid SQLContext defines implicits for converting RDDs to Dataframes over .toDF().
import spark.implicits._
def main(args: Array[String]): Unit = {
val data = spark.sparkContext.parallelize(Array((5.0, 2.0,1.5), (2.0, 2.5,2.3), (1.0, 2.1,4.2), (2.0, 5.5, 8.5)))
val df = data.toDF().map { row =>
val label = row(0).asInstanceOf[Double]
val value1 = row(1).asInstanceOf[Double]
val value2 = row(2).asInstanceOf[Double]
LabeledPoint(label, Vectors.dense(value1,value2))
}
val kmeans = new KMeans().setK(3).setSeed(1L)
val model: KMeansModel = kmeans.fit(df)
// Evaluate clustering by computing Within Set Sum of Squared Errors.
val WSSSE = model.computeCost(df)
println(s"Within Set Sum of Squared Errors = $WSSSE")
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
//TODO How to predict cluster index?
//model.predict(???
}
}
How do I use the model to predict the cluster index of new values? The model.predict function is not visible. This API is really confusing...
well, a easier way to do this is:
model.summary.predictions.show
Ok, I got it. Predictions are now done with the transform method:
println("Transform ")
val transformed = model.transform(df)
transformed.collect().foreach(println)
Cluster Centers:
[2.25,1.9]
[5.5,8.5]
[2.1,4.2]
Transform:
[5.0,[2.0,1.5],0]
[2.0,[2.5,2.3],0]
[1.0,[2.1,4.2],2]
[2.0,[5.5,8.5],1]

(Array/ML Vector/MLlib Vector) RDD to ML Vector Dataframe coulmn

I need to convert an RDD to a single column o.a.s.ml.linalg.Vector DataFrame, in order to use the ML algorithms, specifically K-Means for this case. This is my RDD:
val parsedData = sc.textFile("/digits480x.csv").map(s => Row(org.apache.spark.mllib.linalg.Vectors.dense(s.split(',').slice(0,64).map(_.toDouble))))
I tried doing what this answer suggests with no luck, I suppose because you end up with a MLlib Vector, it throws a mismatch error when running the algorithm. Now if I change this:
import org.apache.spark.mllib.linalg.{Vectors, VectorUDT}
val schema = new StructType()
.add("features", new VectorUDT())
to this:
import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
val parsedData = sc.textFile("/digits480x.csv").map(s => Row(org.apache.spark.ml.linalg.Vectors.dense(s.split(',').slice(0,64).map(_.toDouble))))
val schema = new StructType()
.add("features", new VectorUDT())
I would get an error because ML VectorUDT is private.
I also tried converting the RDD as an array of doubles to Dataframe, and get the ML Dense Vector like this:
var parsedData = sc.textFile("/home/pililo/Documents/Mi_Memoria/Codigo/Datasets/Digits/digits480x.csv").map(s => Row(s.split(',').slice(0,64).map(_.toDouble)))
parsedData: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
val schema2 = new StructType().add("features", ArrayType(DoubleType))
schema2: org.apache.spark.sql.types.StructType = StructType(StructField(features,ArrayType(DoubleType,true),true))
val df = spark.createDataFrame(parsedData, schema2)
df: org.apache.spark.sql.DataFrame = [features: array<double>]
val df2 = df.map{ case Row(features: Array[Double]) => Row(org.apache.spark.ml.linalg.Vectors.dense(features)) }
Which throws the following error, even though spark.implicits._ is imported:
error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
Any help is greatly appreciated, thanks!
Out of the top of my head:
Use csv source and VectorAssembler:
import scala.util.Try
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.feature.VectorAssembler
val path: String = ???
val n: Int = ???
val m:Int = ???
val raw = spark.read.csv(path)
val featureCols = raw.columns.slice(n, m)
val exprs = featureCols.map(c => col(c).cast("double"))
val assembler = new VectorAssembler()
.setInputCols(featureCols)
.setOutputCol("features")
assembler.transform(raw.select(exprs: _*)).select($"features")
Use text source and UDF:
def parse_(n: Int, m: Int)(s: String) = Try(
Vectors.dense(s.split(',').slice(n, m).map(_.toDouble))
).toOption
def parse(n: Int, m: Int) = udf(parse_(n, m) _)
val raw = spark.read.text(path)
raw.select(parse(n, m)(col(raw.columns.head)).alias("features"))
Use text source and drop wrapping Row
spark.read.text(path).as[String].map(parse_(n, m)).toDF

Can't run LDA on Dataset[(scala.Long, org.apache.spark.mllib.linalg.Vector)] in Spark 2.0

I am following this tutorial video on LDA example and I'm getting the following issue :
<console>:37: error: overloaded method value run with alternatives:
(documents: org.apache.spark.api.java.JavaPairRDD[java.lang.Long,org.apache.spark.mllib.linalg.Vector])org.apache.spark.mllib.clustering.LDAModel <and>
(documents: org.apache.spark.rdd.RDD[(scala.Long, org.apache.spark.mllib.linalg.Vector)])org.apache.spark.mllib.clustering.LDAModel
cannot be applied to (org.apache.spark.sql.Dataset[(scala.Long, org.apache.spark.mllib.linalg.Vector)])
val model = run(lda_countVector)
^
So I want to convert this DF to RDD but it is always assigned as DataSet for me. Can anyone please look into this issue?
// Convert DF to RDD
import org.apache.spark.mllib.linalg.Vector
val lda_countVector = countVectors.map { case Row(id: Long, countVector: Vector) => (id, countVector) }
// import org.apache.spark.mllib.linalg.Vector
// lda_countVector: org.apache.spark.sql.Dataset[(Long, org.apache.spark.mllib.linalg.Vector)] = [_1: bigint, _2: vector]
Spark API changed between 1.x and 2.x branch. In particular DataFrame.map returns Dataset not an RDD so the result is not compatible with old MLlib RDD-based API. You should convert data to RDD first as followed :
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA}
val a = Vectors.dense(Array(1.0, 2.0, 3.0))
val b = Vectors.dense(Array(3.0, 4.0, 5.0))
val df = Seq((1L ,a), (2L, b), (2L, a)).toDF
val ldaDF = df.rdd.map {
case Row(id: Long, countVector: Vector) => (id, countVector)
}
val model = new LDA().setK(3).run(ldaDF)
or you can convert to typed dataset and then to RDD:
val model = new LDA().setK(3).run(df.as[(Long, Vector)].rdd)

OneHotEncoder in Spark Dataframe in Pipeline

I've been trying to get an example running in Spark and Scala with the adult dataset .
Using Scala 2.11.8 and Spark 1.6.1.
The problem (for now) lies in the amount of categorical features in that dataset that all need to be encoded to numbers before a Spark ML algorithm can do its job..
So far I have this:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.OneHotEncoder
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
object Adult {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Adult example").setMaster("local[*]")
val sparkContext = new SparkContext(conf)
val sqlContext = new SQLContext(sparkContext)
val data = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("src/main/resources/adult.data")
val categoricals = data.dtypes filter (_._2 == "StringType")
val encoders = categoricals map (cat => new OneHotEncoder().setInputCol(cat._1).setOutputCol(cat._1 + "_encoded"))
val features = data.dtypes filterNot (_._1 == "label") map (tuple => if(tuple._2 == "StringType") tuple._1 + "_encoded" else tuple._1)
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(encoders ++ Array(lr))
val model = pipeline.fit(training)
}
}
However, this doesn't work. Calling pipeline.fit still contains the original string features and thus throws an exception.
How can I remove these "StringType" columns in a pipeline?
Or maybe I'm doing it completely wrong, so if someone has a different suggestion I'm happy to all input :).
The reason why I choose to follow this flow is because I have an extensive background in Python and Pandas, but am trying to learn both Scala and Spark.
There is one thing that can be rather confusing here if you're used to higher level frameworks. You have to index the features before you can use encoder. As it is explained in the API docs:
one-hot encoder (...) maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index.
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer, OneHotEncoder}
val df = Seq((1L, "foo"), (2L, "bar")).toDF("id", "x")
val categoricals = df.dtypes.filter (_._2 == "StringType") map (_._1)
val indexers = categoricals.map (
c => new StringIndexer().setInputCol(c).setOutputCol(s"${c}_idx")
)
val encoders = categoricals.map (
c => new OneHotEncoder().setInputCol(s"${c}_idx").setOutputCol(s"${c}_enc")
)
val pipeline = new Pipeline().setStages(indexers ++ encoders)
val transformed = pipeline.fit(df).transform(df)
transformed.show
// +---+---+-----+-------------+
// | id| x|x_idx| x_enc|
// +---+---+-----+-------------+
// | 1|foo| 1.0| (1,[],[])|
// | 2|bar| 0.0|(1,[0],[1.0])|
// +---+---+-----+-------------+
As you can see there is no need to drop string columns from the pipeline. In practice OneHotEncoder will accept numeric column with NominalAttribute, BinaryAttribute or missing type attribute.