OneHotEncoder in Spark Dataframe in Pipeline - scala

I've been trying to get an example running in Spark and Scala with the adult dataset .
Using Scala 2.11.8 and Spark 1.6.1.
The problem (for now) lies in the amount of categorical features in that dataset that all need to be encoded to numbers before a Spark ML algorithm can do its job..
So far I have this:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.OneHotEncoder
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
object Adult {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Adult example").setMaster("local[*]")
val sparkContext = new SparkContext(conf)
val sqlContext = new SQLContext(sparkContext)
val data = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("src/main/resources/adult.data")
val categoricals = data.dtypes filter (_._2 == "StringType")
val encoders = categoricals map (cat => new OneHotEncoder().setInputCol(cat._1).setOutputCol(cat._1 + "_encoded"))
val features = data.dtypes filterNot (_._1 == "label") map (tuple => if(tuple._2 == "StringType") tuple._1 + "_encoded" else tuple._1)
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(encoders ++ Array(lr))
val model = pipeline.fit(training)
}
}
However, this doesn't work. Calling pipeline.fit still contains the original string features and thus throws an exception.
How can I remove these "StringType" columns in a pipeline?
Or maybe I'm doing it completely wrong, so if someone has a different suggestion I'm happy to all input :).
The reason why I choose to follow this flow is because I have an extensive background in Python and Pandas, but am trying to learn both Scala and Spark.

There is one thing that can be rather confusing here if you're used to higher level frameworks. You have to index the features before you can use encoder. As it is explained in the API docs:
one-hot encoder (...) maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index.
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer, OneHotEncoder}
val df = Seq((1L, "foo"), (2L, "bar")).toDF("id", "x")
val categoricals = df.dtypes.filter (_._2 == "StringType") map (_._1)
val indexers = categoricals.map (
c => new StringIndexer().setInputCol(c).setOutputCol(s"${c}_idx")
)
val encoders = categoricals.map (
c => new OneHotEncoder().setInputCol(s"${c}_idx").setOutputCol(s"${c}_enc")
)
val pipeline = new Pipeline().setStages(indexers ++ encoders)
val transformed = pipeline.fit(df).transform(df)
transformed.show
// +---+---+-----+-------------+
// | id| x|x_idx| x_enc|
// +---+---+-----+-------------+
// | 1|foo| 1.0| (1,[],[])|
// | 2|bar| 0.0|(1,[0],[1.0])|
// +---+---+-----+-------------+
As you can see there is no need to drop string columns from the pipeline. In practice OneHotEncoder will accept numeric column with NominalAttribute, BinaryAttribute or missing type attribute.

Related

Spark ML throws exception for Decision Tree classification: Column features must be of type numeric but was actually of type struct

I am trying to create a Spark ML model with the Decision Tree Classifier to perform classification , but I am getting an error saying the features in my training set should be of type numeric instead of type struct.
Here is the minimal reproducible example that I tried:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.VectorUDT
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml._
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
val df8 = Seq(
("2022-08-22 10:00:00",417.7,419.97,419.97,417.31,"nothing"),
("2022-08-22 11:30:00",417.35,417.33,417.46,416.77,"buy"),
("2022-08-22 13:00:00",417.55,417.68,418.04,417.48,"sell"),
("2022-08-22 14:00:00",417.22,417.8,421.13,416.83,"sell")
)
val df77 = spark.createDataset(df8).toDF("30mins_date","30mins_close","30mins_open","30mins_high","30mins_low", "signal")
val assembler_features = new VectorAssembler()
.setInputCols(Array("30mins_close","30mins_open","30mins_high","30mins_low"))
.setOutputCol("features")
val output2 = assembler_features.transform(df77)
val indexer = new StringIndexer()
.setInputCol("signal")
.setOutputCol("signalIndex")
val indexed = indexer.fit(output2).transform(output2)
val assembler_label = new VectorAssembler()
.setInputCols(Array("signalIndex"))
.setOutputCol("signalIndexV")
val output = assembler_label.transform(indexed)
val dt = new DecisionTreeClassifier()
.setLabelCol("features")
.setFeaturesCol("signalIndexV")
val Array(trainingData, testData) = output.select("features", "signalIndexV").randomSplit(Array(0.7, 0.3))
val model = dt.fit(trainingData)
Output error:
java.lang.IllegalArgumentException: requirement failed: Column features must be of type numeric but was actually of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:78)
at org.apache.spark.ml.PredictorParams.validateAndTransformSchema(Predictor.scala:54)
at org.apache.spark.ml.PredictorParams.validateAndTransformSchema$(Predictor.scala:47)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:73)
at org.apache.spark.ml.classification.ClassifierParams.validateAndTransformSchema(Classifier.scala:43)
at org.apache.spark.ml.classification.ClassifierParams.validateAndTransformSchema$(Classifier.scala:39)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:51)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:34)
at org.apache.spark.ml.classification.DecisionTreeClassifier.org$apache$spark$ml$tree$DecisionTreeClassifierParams$$super$validateAndTransformSchema(DecisionTreeClassifier.scala:46)
at org.apache.spark.ml.tree.DecisionTreeClassifierParams.validateAndTransformSchema(treeParams.scala:245)
at org.apache.spark.ml.tree.DecisionTreeClassifierParams.validateAndTransformSchema$(treeParams.scala:241)
at org.apache.spark.ml.classification.DecisionTreeClassifier.validateAndTransformSchema(DecisionTreeClassifier.scala:46)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:177)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:133)
... 61 elided
I tried above code in spark-shell environment:
spark v 3.3.1
scala v 2.12.15
Here is what trainingData looks like
+-----------------------------+------------+
|features |signalIndexV|
+-----------------------------+------------+
|[417.7,419.97,419.97,417.31] |[2.0] |
|[417.35,417.33,417.46,416.77]|[1.0] |
|[417.55,417.68,418.04,417.48]|[0.0] |
|[417.22,417.8,421.13,416.83] |[0.0] |
+-----------------------------+------------+
So what did I do wrong ? How can I convert column features into numeric type ?

String Matching Spark ML algorithm in Scala

I am new to Scala as well as Spark ML. Trying to create a string matching algorithm based on recommendation from PySpark String Matching. Based on it I was able to implement below so far
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql._
import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, RegexTokenizer}
import spark.implicits._
val vendorData = spark.read.option("header", "true").option("inferSchema", "true").json(path = "Data/*.json").as[vendorData]
// Load IMDB file into an Dataset
val imdbData = spark.read.option("header", "True").option("inferSchema", "True").option("sep", "\t").csv(path = "Data/title.basics.tsv").as[imdbData]
// Remove Special chaacters
val newVendorData = vendorData.withColumn("newtitle", functions.regexp_replace(vendorData.col("title"), "[^A-Za-z0-9_]",""))
val newImdbData = imdbData.withColumn("newprimaryTitle", functions.regexp_replace(imdbData.col("primaryTitle"), "[^A-Za-z0-9_]", ""))
//Algo to find match percentage
val tokenizer = new RegexTokenizer().setPattern("").setInputCol("text").setMinTokenLength(1).setOutputCol("tokens")
val ngram = new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams")
val vectorizer = new HashingTF().setInputCol("ngrams").setOutputCol("vectors")
val lsh = new MinHashLSH().setInputCol("vectors").setOutputCol("lsh")
val pipeline = new Pipeline().setStages(Array(tokenizer, ngram, vectorizer, lsh))
val model = pipeline.fit(newVendorData.select("newtitle"))
val vendorHashed = model.transform(newVendorData.select("newtitle"))
val imdbHashed = model.transform(newImdbData.select("newprimaryTitle"))
model.stages.last.asInstanceOf[ml.feature.MinHashLSHModel].approxSimilarityJoin(vendorHashed, imdbHashed, .85).show()
When running I am getting below error. On Further investigation I could find that the issue is at line no:
val model = pipeline.fit(newVendorData.select("newtitle"))
But can't see what it is.
Exception in thread "main" java.lang.IllegalArgumentException: text does not exist. Available: newtitle
at org.apache.spark.sql.types.StructType.$anonfun$apply$1(StructType.scala:278)
at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:168)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:277)
at org.apache.spark.ml.UnaryTransformer.transformSchema(Transformer.scala:109)
at org.apache.spark.ml.Pipeline.$anonfun$transformSchema$4(Pipeline.scala:184)
at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:184)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:136)
at MatchingJob$.$anonfun$main$1(MatchingJob.scala:84)
at MatchingJob$.$anonfun$main$1$adapted(MatchingJob.scala:43)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at MatchingJob$.main(MatchingJob.scala:43)
at MatchingJob.main(MatchingJob.scala)
Not sure what is wrong I am doing.
My inputs are below:
+------------------+
| newtitle|
+------------------+
| BhaagMilkhaBhaag|
| Fukrey|
| DilTohBacchaHaiJi|
|IndiasJungleHeroes|
| HrudayaGeethe|
**newprimaryTitle**
BhaagMilkhaBhaag
Fukrey
Carmencita
Leclownetseschiens
PauvrePierrot
Unbonbock
BlacksmithScene
ChineseOpiumDen
DilTohBacchaHaiJi
IndiasJungleHeroes
CorbettandCourtne
EdisonKinetoscopi
MissJerry
LeavingtheFactory
AkrobatischesPotp
TheArrivalofaTrain
ThePhotographical
TheWatererWatered
Autourdunecabine
Barquesortantduport
ItalienischerBaue
DasboxendeKnguruh
TheClownBarber
TheDerby1895

Cannot pass arrays from MongoDB into Spark Machine Learning functions that require Vectors

My use case:
Read data from a MongoDB collection of the form:
{
"_id" : ObjectId("582cab1b21650fc72055246d"),
"label" : 167.517838916715,
"features" : [
10.0964787450654,
218.621137772497,
18.8833848806122,
11.8010251302327,
1.67037687829152,
22.0766170950477,
11.7122322171201,
12.8014773524475,
8.30441804118235,
29.4821268054137
]
}
And pass it to the org.apache.spark.ml.regression.LinearRegression class to create a model for predictions.
My problem:
The Spark connector reads in "features" as Array[Double].
LinearRegression.fit(...) expects a DataSet with a Label column and a Features column.
The Features column must be of type VectorUDT (so DenseVector or SparseVector will work).
I cannot .map features from Array[Double] to DenseVector because there is no relevant Encoder:
Error:(23, 11) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
.map{case Row(label: Double, features: Array[Double]) => Row(label, Vectors.dense(features))}
Custom Encoders cannot be defined.
My question:
Is there a way I can set the configuration of the Spark connector to
read in the "features" array as a Dense/SparseVector?
Is there any
other way I can achieve this (without, for example, using an
intermediary .csv file and loading that using libsvm)?
My code:
import com.mongodb.spark.MongoSpark
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.sql.{Row, SparkSession}
case class DataPoint(label: Double, features: Array[Double])
object LinearRegressionWithMongo {
def main(args: Array[String]) {
val spark = SparkSession
.builder()
.appName("LinearRegressionWithMongo")
.master("local[4]")
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/LinearRegressionTest.DataPoints")
.getOrCreate()
import spark.implicits._
val dataPoints = MongoSpark.load(spark)
.map{case Row(label: Double, features: Array[Double]) => Row(label, Vectors.dense(features))}
val splitData = dataPoints.randomSplit(Array(0.7, 0.3), 42)
val training = splitData(0)
val test = splitData(1)
val linearRegression = new LinearRegression()
.setLabelCol("label")
.setFeaturesCol("features")
.setRegParam(0.0)
.setElasticNetParam(0.0)
.setMaxIter(100)
.setTol(1e-6)
// Train the model
val startTime = System.nanoTime()
val linearRegressionModel = linearRegression.fit(training)
val elapsedTime = (System.nanoTime() - startTime) / 1e9
println(s"Training time: $elapsedTime seconds")
// Print the weights and intercept for linear regression.
println(s"Weights: ${linearRegressionModel.coefficients} Intercept: ${linearRegressionModel.intercept}")
val modelEvaluator = new ModelEvaluator()
println("Training data results:")
modelEvaluator.evaluateRegressionModel(linearRegressionModel, training, "label")
println("Test data results:")
modelEvaluator.evaluateRegressionModel(linearRegressionModel, test, "label")
spark.stop()
}
}
Any help would be ridiculously appreciated!
There is quick fix for this. If data has been loaded into a DataFrame called df which has:
id - SQL double.
features - SQL array<double>.
like this one
val df = Seq((1.0, Array(2.3, 3.4, 4.5))).toDF("id", "features")
you select columns you need for downstream processing:
val idAndFeatures = df.select("id", "features")
convert to statically typed Dataset:
val tuples = idAndFeatures.as[(Double, Seq[Double])]
map and convert back to Dataset[Row]:
val spark: SparkSession = ???
import spark.implicits._
import org.apache.spark.ml.linalg.Vectors
tuples.map { case (id, features) =>
(id, Vectors.dense(features.toArray))
}.toDF("id", "features")
You can find a detailed explanation what is the difference compared to you current approach here.

Can I convert an incoming stream of data into an array?

I'm trying to learn streaming data and manipulating it with the telecom churn dataset provided here. I've written a method to calculate this in batch:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD, LogisticRegressionWithLBFGS, LogisticRegressionModel, NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
object batchChurn{
def main(args: Array[String]): Unit = {
//setting spark context
val conf = new SparkConf().setAppName("churn")
val sc = new SparkContext(conf)
//loading and mapping data into RDD
val csv = sc.textFile("file://filename.csv")
val data = csv.map {line =>
val parts = line.split(",").map(_.trim)
val stringvec = Array(parts(1)) ++ parts.slice(4,20)
val label = parts(20).toDouble
val vec = stringvec.map(_.toDouble)
LabeledPoint(label, Vectors.dense(vec))
}
val splits = data.randomSplit(Array(0.7,0.3))
val (training, testing) = (splits(0),splits(1))
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 6
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 7
val maxBins = 32
val model = RandomForest.trainClassifier(training, numClasses, categoricalFeaturesInfo,numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
val labelAndPreds = testing.map {point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
}
}
I've had no problems with this. Now, I looked at the NetworkWordCount example provided on the spark website, and changed the code slightly to see how it would behave.
val ssc = new StreamingContext(sc, Seconds(5))
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val data = lines.flatMap(_.split(","))
My question is: is it possible to convert this DStream to an array which I can input into my analysis code? Currently when I try to convert to Array using val data = lines.flatMap(_.split(",")), it clearly says that:error: value toArray is not a member of org.apache.spark.streaming.dstream.DStream[String]
Your DStream contains many RDDs you can get access to the RDDs using foreachRDD function.
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/streaming/dstream/DStream.html#foreachRDD(scala.Function1)
then each RDD can be converted to array using collect function.
this has already been shown here
For each RDD in a DStream how do I convert this to an array or some other typical Java data type?
DStream.foreachRDD gives you an RDD[String] for each interval of
course, you could collect in an array
val arr = new ArrayBuffer[String]();
data.foreachRDD {
arr ++= _.collect()
}
Also keep in mind you could end up having way more data than you want in your driver since a DStream can be huge.
To limit the data for your analysis , I would do this way
data.slice(new Time(fromMillis), new Time(toMillis)).flatMap(_.collect()).toSet
You cannot put all the elements of a DStream in an array because those elements will keep being read over the wire, and your array would have to be indefinitely extensible.
The adaptation of this decision tree model to a streaming mode, where training and testing data arrives continuously, is not trivial for algorithmical reasons — while the answers mentioning collect are technically correct, they're not the appropriate solution to what you're trying to do.
If you want to run decision trees on a Stream in Spark, you may want to look at Hoeffding trees.

How to vectorize DataFrame columns for ML algorithms?

have a DataFrame with some categorical string values (e.g uuid|url|browser).
I would to convert it in a double to execute an ML algorithm that accept double matrix.
As convertion method I used StringIndexer (spark 1.4) that map my string values to double values, so I defined a function like this:
def str(arg: String, df:DataFrame) : DataFrame =
(
val indexer = new StringIndexer().setInputCol(arg).setOutputCol(arg+"_index")
val newDF = indexer.fit(df).transform(df)
return newDF
)
Now the issue is that i would iterate foreach column of a df, call this function and add (or convert) the original string column in the parsed double column, so the result would be:
Initial df:
[String: uuid|String: url| String: browser]
Final df:
[String: uuid|Double: uuid_index|String: url|Double: url_index|String: browser|Double: Browser_index]
Thanks in advance
You can simply foldLeft over the Array of columns:
val transformed: DataFrame = df.columns.foldLeft(df)((df, arg) => str(arg, df))
Still, I will argue that it is not a good approach. Since src discards StringIndexerModel it cannot be used when you get new data. Because of that I would recommend using Pipeline:
import org.apache.spark.ml.Pipeline
val transformers: Array[org.apache.spark.ml.PipelineStage] = df.columns.map(
cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index")
)
// Add the rest of your pipeline like VectorAssembler and algorithm
val stages: Array[org.apache.spark.ml.PipelineStage] = transformers ++ ???
val pipeline = new Pipeline().setStages(stages)
val model = pipeline.fit(df)
model.transform(df)
VectorAssembler can be included like this:
val assembler = new VectorAssembler()
.setInputCols(df.columns.map(cname => s"${cname}_index"))
.setOutputCol("features")
val stages = transformers :+ assembler
You could also use RFormula, which is less customizable, but much more concise:
import org.apache.spark.ml.feature.RFormula
val rf = new RFormula().setFormula(" ~ uuid + url + browser - 1")
val rfModel = rf.fit(dataset)
rfModel.transform(dataset)