(Array/ML Vector/MLlib Vector) RDD to ML Vector Dataframe coulmn - scala

I need to convert an RDD to a single column o.a.s.ml.linalg.Vector DataFrame, in order to use the ML algorithms, specifically K-Means for this case. This is my RDD:
val parsedData = sc.textFile("/digits480x.csv").map(s => Row(org.apache.spark.mllib.linalg.Vectors.dense(s.split(',').slice(0,64).map(_.toDouble))))
I tried doing what this answer suggests with no luck, I suppose because you end up with a MLlib Vector, it throws a mismatch error when running the algorithm. Now if I change this:
import org.apache.spark.mllib.linalg.{Vectors, VectorUDT}
val schema = new StructType()
.add("features", new VectorUDT())
to this:
import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
val parsedData = sc.textFile("/digits480x.csv").map(s => Row(org.apache.spark.ml.linalg.Vectors.dense(s.split(',').slice(0,64).map(_.toDouble))))
val schema = new StructType()
.add("features", new VectorUDT())
I would get an error because ML VectorUDT is private.
I also tried converting the RDD as an array of doubles to Dataframe, and get the ML Dense Vector like this:
var parsedData = sc.textFile("/home/pililo/Documents/Mi_Memoria/Codigo/Datasets/Digits/digits480x.csv").map(s => Row(s.split(',').slice(0,64).map(_.toDouble)))
parsedData: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
val schema2 = new StructType().add("features", ArrayType(DoubleType))
schema2: org.apache.spark.sql.types.StructType = StructType(StructField(features,ArrayType(DoubleType,true),true))
val df = spark.createDataFrame(parsedData, schema2)
df: org.apache.spark.sql.DataFrame = [features: array<double>]
val df2 = df.map{ case Row(features: Array[Double]) => Row(org.apache.spark.ml.linalg.Vectors.dense(features)) }
Which throws the following error, even though spark.implicits._ is imported:
error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
Any help is greatly appreciated, thanks!

Out of the top of my head:
Use csv source and VectorAssembler:
import scala.util.Try
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.feature.VectorAssembler
val path: String = ???
val n: Int = ???
val m:Int = ???
val raw = spark.read.csv(path)
val featureCols = raw.columns.slice(n, m)
val exprs = featureCols.map(c => col(c).cast("double"))
val assembler = new VectorAssembler()
.setInputCols(featureCols)
.setOutputCol("features")
assembler.transform(raw.select(exprs: _*)).select($"features")
Use text source and UDF:
def parse_(n: Int, m: Int)(s: String) = Try(
Vectors.dense(s.split(',').slice(n, m).map(_.toDouble))
).toOption
def parse(n: Int, m: Int) = udf(parse_(n, m) _)
val raw = spark.read.text(path)
raw.select(parse(n, m)(col(raw.columns.head)).alias("features"))
Use text source and drop wrapping Row
spark.read.text(path).as[String].map(parse_(n, m)).toDF

Related

DoubleType to VectorUDT

I have this code written using Spark 2.1:
val mycolumns = originalFile.schema.fieldNames
mycolumns.map(cname => stddevPerColumnName(df.select(cname), cname))
def stddevPerColumnName(df: DataFrame, cname: String): DataFrame =
new StandardScaler()
.setInputCol(cname)
.setOutputCol("stddev")
.setWithStd(true)
.fit(df)
.transform(df)
Every single column has type DoubleType originally inferred from a CSV file.
When I run the code I get the Exception:
Column FirstColumn must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually DoubleType.
How can I convert the column type Double to VectorUDT?
you need to pass vector into ML model:use assembler to put double values into vector then do your ML then take values out of vector if required back to double
import org.apache.spark.ml.feature.{MinMaxScaler,VectorAssembler}
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.sql.functions._
val assembler = new VectorAssembler().setInputCols(Array("yourDoubleValue")).setOutputCol("features")
def assembler (ds: Dataset[T]) = {mlib.assembler.transform(ds)}
val vectorToColumn = udf{ (x: DenseVector, index: Int) => x(index) }
val scaler = new StandardScaler().setInputCol("features").setOutputCol("featuresScaled")
*use DenseVector or SparseVector depending on your data
full example:
val data = spark.read....
val data_assembled = assembler.transform(data)
val assembled = scaler.fit(ds).transform(ds)
.withColumn("backToMyDouble",round(mlib.vectorToColumn(col("featuresScaled"),lit(0)),2))

Cannot pass arrays from MongoDB into Spark Machine Learning functions that require Vectors

My use case:
Read data from a MongoDB collection of the form:
{
"_id" : ObjectId("582cab1b21650fc72055246d"),
"label" : 167.517838916715,
"features" : [
10.0964787450654,
218.621137772497,
18.8833848806122,
11.8010251302327,
1.67037687829152,
22.0766170950477,
11.7122322171201,
12.8014773524475,
8.30441804118235,
29.4821268054137
]
}
And pass it to the org.apache.spark.ml.regression.LinearRegression class to create a model for predictions.
My problem:
The Spark connector reads in "features" as Array[Double].
LinearRegression.fit(...) expects a DataSet with a Label column and a Features column.
The Features column must be of type VectorUDT (so DenseVector or SparseVector will work).
I cannot .map features from Array[Double] to DenseVector because there is no relevant Encoder:
Error:(23, 11) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
.map{case Row(label: Double, features: Array[Double]) => Row(label, Vectors.dense(features))}
Custom Encoders cannot be defined.
My question:
Is there a way I can set the configuration of the Spark connector to
read in the "features" array as a Dense/SparseVector?
Is there any
other way I can achieve this (without, for example, using an
intermediary .csv file and loading that using libsvm)?
My code:
import com.mongodb.spark.MongoSpark
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.sql.{Row, SparkSession}
case class DataPoint(label: Double, features: Array[Double])
object LinearRegressionWithMongo {
def main(args: Array[String]) {
val spark = SparkSession
.builder()
.appName("LinearRegressionWithMongo")
.master("local[4]")
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/LinearRegressionTest.DataPoints")
.getOrCreate()
import spark.implicits._
val dataPoints = MongoSpark.load(spark)
.map{case Row(label: Double, features: Array[Double]) => Row(label, Vectors.dense(features))}
val splitData = dataPoints.randomSplit(Array(0.7, 0.3), 42)
val training = splitData(0)
val test = splitData(1)
val linearRegression = new LinearRegression()
.setLabelCol("label")
.setFeaturesCol("features")
.setRegParam(0.0)
.setElasticNetParam(0.0)
.setMaxIter(100)
.setTol(1e-6)
// Train the model
val startTime = System.nanoTime()
val linearRegressionModel = linearRegression.fit(training)
val elapsedTime = (System.nanoTime() - startTime) / 1e9
println(s"Training time: $elapsedTime seconds")
// Print the weights and intercept for linear regression.
println(s"Weights: ${linearRegressionModel.coefficients} Intercept: ${linearRegressionModel.intercept}")
val modelEvaluator = new ModelEvaluator()
println("Training data results:")
modelEvaluator.evaluateRegressionModel(linearRegressionModel, training, "label")
println("Test data results:")
modelEvaluator.evaluateRegressionModel(linearRegressionModel, test, "label")
spark.stop()
}
}
Any help would be ridiculously appreciated!
There is quick fix for this. If data has been loaded into a DataFrame called df which has:
id - SQL double.
features - SQL array<double>.
like this one
val df = Seq((1.0, Array(2.3, 3.4, 4.5))).toDF("id", "features")
you select columns you need for downstream processing:
val idAndFeatures = df.select("id", "features")
convert to statically typed Dataset:
val tuples = idAndFeatures.as[(Double, Seq[Double])]
map and convert back to Dataset[Row]:
val spark: SparkSession = ???
import spark.implicits._
import org.apache.spark.ml.linalg.Vectors
tuples.map { case (id, features) =>
(id, Vectors.dense(features.toArray))
}.toDF("id", "features")
You can find a detailed explanation what is the difference compared to you current approach here.

Get Cluster_ID and the rest of table using Spark MLlib KMeans

I have this dataset (I'm putting some a few rows):
11.97,1355,401
3.49,25579,12908
9.29,129186,10882
28.73,10153,22356
3.69,22872,9798
13.49,160371,2911
24.36,106764,867
3.99,163670,16397
19.64,132547,401
And I'm trying to assign all this rows to 4 clusters using K-Means. For that I'm using the code that I see in this post: Spark MLLib Kmeans from dataframe, and back again
val data = sc.textFile("/user/cloudera/TESTE1")
val idPointRDD = data.map(s => (s(0), Vectors.dense(s(1).toInt,s(2).toInt))).cache()
val clusters = KMeans.train(idPointRDD.map(_._2), 4, 20)
val clustersRDD = clusters.predict(idPointRDD.map(_._2))
val idClusterRDD = idPointRDD.map(_._1).zip(clustersRDD)
val idCluster = idClusterRDD.toDF("purchase","id","product","cluster")
I'm getting this outputs:
scala> import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
scala> import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vectors
scala> val data = sc.textFile("/user/cloudera/TESTE")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/TESTE MapPartitionsRDD[7] at textFile at <console>:29
scala> val idPointRDD = data.map(s => (s(0), Vectors.dense(s(1).toInt,s(2).toInt))).cache()
idPointRDD: org.apache.spark.rdd.RDD[(Char, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[8] at map at <console>:31
But when I run it I'm getting the following error:
java.lang.UnsupportedOperationException: Schema for type Char is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:715)
How can I solve this problem?
Many thanks!
Here is the thing. You are actually reading a CSV of values into an RDD of String and not converting it properly to numeric values. Instead since a string is a collection of character when you call upon s(0) per example this actually works converts the Char value to an integer or a double but it's not what you are actually looking for.
You need to split your val data : RDD[String]
val data : RDD[String] = ???
val idPointRDD = data.map {
s =>
s.split(",") match {
case Array(x,y,z) => Vectors.dense(x.toDouble, Integer.parseInt(y).toDouble,Integer.parseInt(z).toDouble)
}
}.cache()
This should work for you !

Can't run LDA on Dataset[(scala.Long, org.apache.spark.mllib.linalg.Vector)] in Spark 2.0

I am following this tutorial video on LDA example and I'm getting the following issue :
<console>:37: error: overloaded method value run with alternatives:
(documents: org.apache.spark.api.java.JavaPairRDD[java.lang.Long,org.apache.spark.mllib.linalg.Vector])org.apache.spark.mllib.clustering.LDAModel <and>
(documents: org.apache.spark.rdd.RDD[(scala.Long, org.apache.spark.mllib.linalg.Vector)])org.apache.spark.mllib.clustering.LDAModel
cannot be applied to (org.apache.spark.sql.Dataset[(scala.Long, org.apache.spark.mllib.linalg.Vector)])
val model = run(lda_countVector)
^
So I want to convert this DF to RDD but it is always assigned as DataSet for me. Can anyone please look into this issue?
// Convert DF to RDD
import org.apache.spark.mllib.linalg.Vector
val lda_countVector = countVectors.map { case Row(id: Long, countVector: Vector) => (id, countVector) }
// import org.apache.spark.mllib.linalg.Vector
// lda_countVector: org.apache.spark.sql.Dataset[(Long, org.apache.spark.mllib.linalg.Vector)] = [_1: bigint, _2: vector]
Spark API changed between 1.x and 2.x branch. In particular DataFrame.map returns Dataset not an RDD so the result is not compatible with old MLlib RDD-based API. You should convert data to RDD first as followed :
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA}
val a = Vectors.dense(Array(1.0, 2.0, 3.0))
val b = Vectors.dense(Array(3.0, 4.0, 5.0))
val df = Seq((1L ,a), (2L, b), (2L, a)).toDF
val ldaDF = df.rdd.map {
case Row(id: Long, countVector: Vector) => (id, countVector)
}
val model = new LDA().setK(3).run(ldaDF)
or you can convert to typed dataset and then to RDD:
val model = new LDA().setK(3).run(df.as[(Long, Vector)].rdd)

How to vectorize DataFrame columns for ML algorithms?

have a DataFrame with some categorical string values (e.g uuid|url|browser).
I would to convert it in a double to execute an ML algorithm that accept double matrix.
As convertion method I used StringIndexer (spark 1.4) that map my string values to double values, so I defined a function like this:
def str(arg: String, df:DataFrame) : DataFrame =
(
val indexer = new StringIndexer().setInputCol(arg).setOutputCol(arg+"_index")
val newDF = indexer.fit(df).transform(df)
return newDF
)
Now the issue is that i would iterate foreach column of a df, call this function and add (or convert) the original string column in the parsed double column, so the result would be:
Initial df:
[String: uuid|String: url| String: browser]
Final df:
[String: uuid|Double: uuid_index|String: url|Double: url_index|String: browser|Double: Browser_index]
Thanks in advance
You can simply foldLeft over the Array of columns:
val transformed: DataFrame = df.columns.foldLeft(df)((df, arg) => str(arg, df))
Still, I will argue that it is not a good approach. Since src discards StringIndexerModel it cannot be used when you get new data. Because of that I would recommend using Pipeline:
import org.apache.spark.ml.Pipeline
val transformers: Array[org.apache.spark.ml.PipelineStage] = df.columns.map(
cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index")
)
// Add the rest of your pipeline like VectorAssembler and algorithm
val stages: Array[org.apache.spark.ml.PipelineStage] = transformers ++ ???
val pipeline = new Pipeline().setStages(stages)
val model = pipeline.fit(df)
model.transform(df)
VectorAssembler can be included like this:
val assembler = new VectorAssembler()
.setInputCols(df.columns.map(cname => s"${cname}_index"))
.setOutputCol("features")
val stages = transformers :+ assembler
You could also use RFormula, which is less customizable, but much more concise:
import org.apache.spark.ml.feature.RFormula
val rf = new RFormula().setFormula(" ~ uuid + url + browser - 1")
val rfModel = rf.fit(dataset)
rfModel.transform(dataset)