Using Spark ML's OneHotEncoder on multiple columns - scala

I've been able to create a pipeline that will allow me to index multiple string columns at once, but I am getting stuck encoding them, because unlike indexing, the encoder is not an estimator so I never call fit
according to the OneHotEncoder example in the docs.
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler,
OneHotEncoder}
import org.apache.spark.ml.Pipeline
val data = sqlContext.read.parquet("s3n://map2-test/forecaster/intermediate_data")
val df = data.select("win","bid_price","domain","size", "form_factor").na.drop()
//indexing columns
val stringColumns = Array("domain","size", "form_factor")
val index_transformers: Array[org.apache.spark.ml.PipelineStage] = stringColumns.map(
cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index")
)
// Add the rest of your pipeline like VectorAssembler and algorithm
val index_pipeline = new Pipeline().setStages(index_transformers)
val index_model = index_pipeline.fit(df)
val df_indexed = index_model.transform(df)
//encoding columns
val indexColumns = df_indexed.columns.filter(x => x contains "index")
val one_hot_encoders: Array[org.apache.spark.ml.PipelineStage] = indexColumns.map(
cname => new OneHotEncoder()
.setInputCol(cname)
.setOutputCol(s"${cname}_vec")
)
val one_hot_pipeline = new Pipeline().setStages(one_hot_encoders)
val df_encoded = one_hot_pipeline.transform(df_indexed)
The OneHotEncoder object doesn't have a fit method so putting it in the same pipeline as the indexers will not work- it throws an error when I call fit on the pipeline. I can also not call transform on the pipeline that I made with the array of pipeline stages, one_hot_encoders.
I have not found a good solution for using the OneHotEncoder without individually creating and calling transform on that transforming itself for all of the columns I want to encode

Spark >= 3.0:
In Spark 3.0 OneHotEncoderEstimator has been renamed to OneHotEncoder:
import org.apache.spark.ml.feature.{OneHotEncoder, OneHotEncoderModel}
val encoder = new OneHotEncoder()
.setInputCols(indexColumns)
.setOutputCols(indexColumns map (name => s"${name}_vec"))
Spark >= 2.3
Spark 2.3 introduced new classes OneHotEncoderEstimator, OneHotEncoderModel, which required fitting even if used outside Pipeline, and operate on multiple columns at the same time.
import org.apache.spark.ml.feature.{OneHotEncoderEstimator, OneHotEncoderModel}
val encoder = new OneHotEncoderEstimator()
.setInputCols(indexColumns)
.setOutputCols(indexColumns map (name => s"${name}_vec"))
encoder.fit(df_indexed).transform(df_indexed)
Spark < 2.3
Even if transformers you use don't require fitting you have to use fit method to create a PipelineModel which can be used to transform data.
one_hot_pipeline.fit(df_indexed).transform(df_indexed)
On a side note you can combine indexing and encoding into a single Pipeline:
val pipeline = new Pipeline()
.setStages(index_transformers ++ one_hot_encoders)
val model = pipeline.fit(df)
model.transform(df)
Edit:
Error you see means that one of your columns contains an empty String. It is accepted by indexer but cannot be used for encoding. Depending on you requirements you can drop these or use a dummy label. Unfortunately you cannot use NULLs until SPARK-11569) is resolved.

Related

Using map() and filter() in Spark instead of spark.sql

I have two datasets that I want to INNER JOIN to give me a whole new table with the desired data. I used SQL and manage to get it. But now I want to try it with map() and filter(), is it possible?
This is my code using the SPARK SQL:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object hello {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local")
.setAppName("quest9")
val sc = new SparkContext(conf)
val spark = SparkSession.builder().appName("quest9").master("local").getOrCreate()
val zip_codes = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/zip.csv")
val census = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/census.csv")
census.createOrReplaceTempView("census")
zip_codes.createOrReplaceTempView("zip")
//val query = spark.sql("SELECT * FROM census")
val query = spark.sql("SELECT DISTINCT census.Total_Males AS male, census.Total_Females AS female FROM census INNER JOIN zip ON census.Zip_Code=zip.Zip_Code WHERE zip.City = 'Inglewood' AND zip.County = 'Los Angeles'")
query.show()
query.write.parquet("/home/hdfs/Documents/population/census/IDE/census.parquet")
sc.stop()
}
}
The only sensible way, in general to do this would be to use the join() method of `Dataset̀€. I would urge you to question the need to use only map/filter to do this, as this is not intuitive, and will probably confuse any experienced spark developer (or simply put, make him roll his eyes). It may also lead to scalability issues should the dataset grow.
That said, in your use case, it is pretty simple to avoid using join. Another possibility would be to issue two separate jobs to spark :
fetch the zip code(s) that interests you
filter on the census data on that (those) zip code(s)
Step 1 collect the zip codes of interest (not sure of the exact syntax as I do not have a spark shell at hand, but it should be trivial to find the right one).
var codes: Seq[String] = zip_codes
// filter on the city
.filter(row => row.getAs[String]("City").equals("Inglewood"))
// filter on the county
.filter(row => row.getAs[String]("County").equals("Los Angeles"))
// map to zip code as a String
.map(row => row.getAs[String]("Zip_Code"))
.as[String]
// Collect on the driver side
.collect()
Then again, writing it this way instead of using select/where is pretty strange to anyone being used to spark.
Yet, the reason this will work is because we can be sure that zip codes matching a given town and county will be really small. So it is safe to perform driver side collcetion of the result.
Now on to step 2 :
census.filter(row => codes.contains(row.getAs[String]("Zip_Code")))
.map( /* whatever to get your data out */ )
What you need is a join, your query roughly translates to :
census.as("census")
.join(
broadcast(zip_codes
.where($"City"==="Inglewood")
.where($"County"==="Los Angeles")
.as("zip"))
,Seq("Zip_Code"),
"inner" // "leftsemi" would also be sufficient
)
.select(
$"census.Total_Males".as("male"),
$"census.Total_Females".as("female")
).distinct()

CrossValidator does not support VectorUDT as label in spark-ml

i have a problem with ml.crossvalidator in scala spark while using one hot encoder.
this is my code
val tokenizer = new Tokenizer().
setInputCol("subjects").
setOutputCol("subject")
//CountVectorizer / TF
val countVectorizer = new CountVectorizer().
setInputCol("subject").
setOutputCol("features")
// convert string into numerical values
val labelIndexer = new StringIndexer().
setInputCol("labelss").
setOutputCol("labelsss")
// convert numerical to one hot encoder
val labelEncoder = new OneHotEncoder().
setInputCol("labelsss").
setOutputCol("label")
val logisticRegression = new LogisticRegression()
val pipeline = new Pipeline().setStages(Array(tokenizer,countVectorizer,labelIndexer,labelEncoder,logisticRegression))
and give me an error like this
cv: org.apache.spark.ml.tuning.CrossValidator = cv_8cc1ae985e39
java.lang.IllegalArgumentException: requirement failed: Column label must be of type NumericType but was actually of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7.
i have no idea, how to fix it.
i need one hot encoder coz my label is categorical.
thanks for helping me :)
There is actually no need to use OneHotEncoder/OneHotEncoderEstimator for labels (target variables) and you actually shouldn't. This will create a vector (type org.apache.spark.ml.linalg.VectorUDT).
StringIndexer is enough to define that your labels are categorical.
Let's check that in a small example :
val df = Seq((0, "a"),(1, "b"),(2, "c"),(3, "a"),(4, "a"),(5, "c")).toDF("category", "text")
// df: org.apache.spark.sql.DataFrame = [category: int, text: string]
val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df)
// indexer: org.apache.spark.ml.feature.StringIndexerModel = strIdx_cf691c087e1d
val indexed = indexer.transform(df)
// indexed: org.apache.spark.sql.DataFrame = [category: int, text: string ... 1 more field]
indexed.schema.map(_.metadata).foreach(println)
// {}
// {}
// {"ml_attr":{"vals":["4","5","1","0","2","3"],"type":"nominal","name":"categoryIndex"}}
As you have noticed, StringIndexer actually attach metadata to that column (categoryIndex) and marks it as nominal a.k.a categorical.
You can also notice that in the attribute of the column, you have the list of categories.
More on this in my other answer about How to handle categorical features with spark-ml?
Concerning data preparation and metadata with spark-ml, I strongly advice you to read the following entry :
https://github.com/awesome-spark/spark-gotchas/blob/5ad4c399ffd2821875f608be8aff9f1338478444/06_data_preparation.md
Disclaimer: I'm the co-author of the entry in the link.
Note: (excerpt from the doc)
Because this existing OneHotEncoder is a stateless transformer, it is not usable on new data where the number of categories may differ from the training data. In order to fix this, a new OneHotEncoderEstimator was created that produces an OneHotEncoderModel when fitting. For more detail, please see SPARK-13030.
OneHotEncoder has been deprecated in 2.3.0 and will be removed in 3.0.0. Please use OneHotEncoderEstimator instead.

Spark DataFrame to RDD and back

I am writing an Apache Spark application using Scala. To handle and store data I use DataFrames. I have a nice pipeline with feature extraction and a MultiLayerPerceptron classifier, using the ML API.
I also want to use SVM (for comparison purposes). The thing is (and correct me if I am mistaken) only the MLLib provides SVM. And MLLib is not ready to handle DataFrames, only RDDs.
So I figured I can maintain the core of my application using DataFrames and to use SVM 1) I just convert the DataFrame's columns I need to an RDD[LabeledPoint] and 2) after the classification add the SVMs prediction to the DataFrame as a new column.
The first part I handled with a small function:
private def dataFrameToRDD(dataFrame : DataFrame) : RDD[LabeledPoint] = {
val rddMl = dataFrame.select("label", "features").rdd.map(r => (r.getInt(0).toDouble, r.getAs[org.apache.spark.ml.linalg.SparseVector](1)))
rddMl.map(r => new LabeledPoint(r._1, Vectors.dense(r._2.toArray)))
}
I have to specify and convert the type of vector since the feature extraction method uses ML API and not MLLib.
Then, this RDD[LabeledPoint] is fed to the SVM and classification goes smoothly, no issues. At the end and following spark's example I get an RDD[Double]:
val predictions = rdd.map(point => model.predict(point.features))
Now, I want to add the prediction score as column to the original DataFrame and return it. This is where I got stuck. I can convert the RDD[Double] to a DataFrame using
(sql context ommited)
import sqlContext.implicits._
val plDF = predictions.toDF("prediction")
But how do I join the two DataFrames where the second DataFrame becomes a column of the original one? I tried to use methods join and union but got SQL exceptions as the DataFrames have no equal columns to join or unite on.
EDIT
I tried
data.withColumn("prediction", plDF.col("prediction"))
But I get an Analysis Exception :(
I haven't figured out how to do it without recurring to RDDs, but anyway here's how I solved it with RDD. Added the rest of the code so that anyone can understand the complete logic. Any suggestions are appreciated.
package stuff
import java.util.logging.{Level, Logger}
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Row, SQLContext}
/**
* Created by camandros on 10-03-2017.
*/
class LinearSVMClassifier extends Classifier with Serializable{
#transient lazy val log: Logger = Logger.getLogger(getClass.getName)
private var model : SVMModel = _
override def train(data : DataFrame): Unit = {
val rdd = dataFrameToRDD(data)
// Run training algorithm to build the model
val numIter : Int = 100
val step = Osint.properties(Osint.SVM_STEPSIZE).toDouble
val c = Osint.properties(Osint.SVM_C).toDouble
log.log(Level.INFO, "Initiating SVM training with parameters: C="+c+", step="+step)
model = SVMWithSGD.train(rdd, numIterations = numIter, stepSize = step, regParam = c)
log.log(Level.INFO, "Model training finished")
// Clear the default threshold.
model.clearThreshold()
}
override def classify(data : DataFrame): DataFrame = {
log.log(Level.INFO, "Converting DataFrame to RDD")
val rdd = dataFrameToRDD(data)
log.log(Level.INFO, "Conversion finished; beginning classification")
// Compute raw scores on the test set.
val predictions = rdd.map(point => model.predict(point.features))
log.log(Level.INFO, "Classification finished; Transforming RDD to DataFrame")
val sqlContext : SQLContext = Osint.spark.sqlContext
val tupleRDD = data.rdd.zip(predictions).map(t => Row.fromSeq(t._1.toSeq ++ Seq(t._2)))
sqlContext.createDataFrame(tupleRDD, data.schema.add("predictions", "Double"))
//TODO this should work it doesn't since this "withColumn" method seems to be applicable only to add
// new columns using information from the same dataframe; therefore I am using the horrible rdd conversion
//val sqlContext : SQLContext = Osint.spark.sqlContext
//import sqlContext.implicits._
//val plDF = predictions.toDF("predictions")
//data.withColumn("prediction", plDF.col("predictions"))
}
private def dataFrameToRDD(dataFrame : DataFrame) : RDD[LabeledPoint] = {
val rddMl = dataFrame.select("label", "features").rdd.map(r => (r.getInt(0).toDouble, r.getAs[org.apache.spark.ml.linalg.SparseVector](1)))
rddMl.map(r => new LabeledPoint(r._1, Vectors.dense(r._2.toArray)))
}
}

How to create a Spark Dataset from an RDD

I have an RDD[LabeledPoint] intended to be used within a machine learning pipeline. How do we convert that RDD to a DataSet? Note the newer spark.ml apis require inputs in the Dataset format.
Here is an answer that traverses an extra step - the DataFrame. We use the SQLContext to create a DataFrame and then create a DataSet using the desired object type - in this case a LabeledPoint:
val sqlContext = new SQLContext(sc)
val pointsTrainDf = sqlContext.createDataFrame(training)
val pointsTrainDs = pointsTrainDf.as[LabeledPoint]
Update Ever heard of a SparkSession ? (neither had I until now..)
So apparently the SparkSession is the Preferred Way (TM) in Spark 2.0.0 and moving forward. Here is the updated code for the new (spark) world order:
Spark 2.0.0+ approaches
Notice in both of the below approaches (simpler one of which credit #zero323) we have accomplished an important savings as compared to the SQLContext approach: no longer is it necessary to first create a DataFrame.
val sparkSession = SparkSession.builder().getOrCreate()
val pointsTrainDf = sparkSession.createDataset(training)
val model = new LogisticRegression()
.train(pointsTrainDs.as[LabeledPoint])
Second way for Spark 2.0.0+ Credit to #zero323
val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._
val trainDs = training.toDS()
Traditional Spark 1.X and earlier approach
val sqlContext = new SQLContext(sc) // Note this is *deprecated* in 2.0.0
import sqlContext.implicits._
val training = splits(0).cache()
val test = splits(1)
val trainDs = training**.toDS()**
See also: How to store custom objects in Dataset? by the esteemed #zero323 .

How to Cache an Array of Dataframes/Values in Spark

I am trying to built a large amount of random forest models by group using Spark. My approach is to cache a large input data file, split it into pieces based on the school_id, cache the individual school input file in memory, run a model on each of them, and then extract the label and predictions.
model_input.cache()
val schools = model_input.select("School_ID").distinct.collect.flatMap(_.toSeq)
val bySchoolArray = schools.map(School_ID => model_input.where($"School_ID" <=> School_ID).cache)
import org.apache.spark.sql.DataFrame
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.{Pipeline, PipelineModel}
def trainModel(df: DataFrame): PipelineModel = {
val rf = new RandomForestClassifier()
//omit some parameters
val pipeline = new Pipeline().setStages(Array(rf))
pipeline.fit(df)
}
val bySchoolArrayModels = bySchoolArray.map(df => trainModel(df))
val preds = (0 to schools.length -1).map(i => bySchoolArrayModels(i).transform(bySchoolArray(i)).select("prediction", "label")
preds.write.format("com.databricks.spark.csv").
option("header","true").
save("predictions/pred"+schools(i))
The code works fine on a small subset but it takes longer than I expected. It seems to me every time I run an individual model, Spark reads the entire file and it takes forever to complete all the model runs. I was wondering whether I did not cache the files correctly or anything went wrong with the way I code it.
Any suggestions would be useful. Thanks!
rdd's methods are immutable, so rdd.cache() returns a new rdd. So you need to assign the cachedRdd to an other variable and then re-use that. Otherwise your are not using the cached rdd.
val cachedModelInput = model_input.cache()
val schools = cachedModelInput.select("School_ID").distinct.collect.flatMap(_.toSeq)
....