String Matching Spark ML algorithm in Scala - scala

I am new to Scala as well as Spark ML. Trying to create a string matching algorithm based on recommendation from PySpark String Matching. Based on it I was able to implement below so far
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql._
import org.apache.spark.ml.feature.{HashingTF, MinHashLSH, NGram, RegexTokenizer}
import spark.implicits._
val vendorData = spark.read.option("header", "true").option("inferSchema", "true").json(path = "Data/*.json").as[vendorData]
// Load IMDB file into an Dataset
val imdbData = spark.read.option("header", "True").option("inferSchema", "True").option("sep", "\t").csv(path = "Data/title.basics.tsv").as[imdbData]
// Remove Special chaacters
val newVendorData = vendorData.withColumn("newtitle", functions.regexp_replace(vendorData.col("title"), "[^A-Za-z0-9_]",""))
val newImdbData = imdbData.withColumn("newprimaryTitle", functions.regexp_replace(imdbData.col("primaryTitle"), "[^A-Za-z0-9_]", ""))
//Algo to find match percentage
val tokenizer = new RegexTokenizer().setPattern("").setInputCol("text").setMinTokenLength(1).setOutputCol("tokens")
val ngram = new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams")
val vectorizer = new HashingTF().setInputCol("ngrams").setOutputCol("vectors")
val lsh = new MinHashLSH().setInputCol("vectors").setOutputCol("lsh")
val pipeline = new Pipeline().setStages(Array(tokenizer, ngram, vectorizer, lsh))
val model = pipeline.fit(newVendorData.select("newtitle"))
val vendorHashed = model.transform(newVendorData.select("newtitle"))
val imdbHashed = model.transform(newImdbData.select("newprimaryTitle"))
model.stages.last.asInstanceOf[ml.feature.MinHashLSHModel].approxSimilarityJoin(vendorHashed, imdbHashed, .85).show()
When running I am getting below error. On Further investigation I could find that the issue is at line no:
val model = pipeline.fit(newVendorData.select("newtitle"))
But can't see what it is.
Exception in thread "main" java.lang.IllegalArgumentException: text does not exist. Available: newtitle
at org.apache.spark.sql.types.StructType.$anonfun$apply$1(StructType.scala:278)
at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:168)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:277)
at org.apache.spark.ml.UnaryTransformer.transformSchema(Transformer.scala:109)
at org.apache.spark.ml.Pipeline.$anonfun$transformSchema$4(Pipeline.scala:184)
at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:184)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:136)
at MatchingJob$.$anonfun$main$1(MatchingJob.scala:84)
at MatchingJob$.$anonfun$main$1$adapted(MatchingJob.scala:43)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at MatchingJob$.main(MatchingJob.scala:43)
at MatchingJob.main(MatchingJob.scala)
Not sure what is wrong I am doing.
My inputs are below:
+------------------+
| newtitle|
+------------------+
| BhaagMilkhaBhaag|
| Fukrey|
| DilTohBacchaHaiJi|
|IndiasJungleHeroes|
| HrudayaGeethe|
**newprimaryTitle**
BhaagMilkhaBhaag
Fukrey
Carmencita
Leclownetseschiens
PauvrePierrot
Unbonbock
BlacksmithScene
ChineseOpiumDen
DilTohBacchaHaiJi
IndiasJungleHeroes
CorbettandCourtne
EdisonKinetoscopi
MissJerry
LeavingtheFactory
AkrobatischesPotp
TheArrivalofaTrain
ThePhotographical
TheWatererWatered
Autourdunecabine
Barquesortantduport
ItalienischerBaue
DasboxendeKnguruh
TheClownBarber
TheDerby1895

Related

Spark ML throws exception for Decision Tree classification: Column features must be of type numeric but was actually of type struct

I am trying to create a Spark ML model with the Decision Tree Classifier to perform classification , but I am getting an error saying the features in my training set should be of type numeric instead of type struct.
Here is the minimal reproducible example that I tried:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.VectorUDT
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml._
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
val df8 = Seq(
("2022-08-22 10:00:00",417.7,419.97,419.97,417.31,"nothing"),
("2022-08-22 11:30:00",417.35,417.33,417.46,416.77,"buy"),
("2022-08-22 13:00:00",417.55,417.68,418.04,417.48,"sell"),
("2022-08-22 14:00:00",417.22,417.8,421.13,416.83,"sell")
)
val df77 = spark.createDataset(df8).toDF("30mins_date","30mins_close","30mins_open","30mins_high","30mins_low", "signal")
val assembler_features = new VectorAssembler()
.setInputCols(Array("30mins_close","30mins_open","30mins_high","30mins_low"))
.setOutputCol("features")
val output2 = assembler_features.transform(df77)
val indexer = new StringIndexer()
.setInputCol("signal")
.setOutputCol("signalIndex")
val indexed = indexer.fit(output2).transform(output2)
val assembler_label = new VectorAssembler()
.setInputCols(Array("signalIndex"))
.setOutputCol("signalIndexV")
val output = assembler_label.transform(indexed)
val dt = new DecisionTreeClassifier()
.setLabelCol("features")
.setFeaturesCol("signalIndexV")
val Array(trainingData, testData) = output.select("features", "signalIndexV").randomSplit(Array(0.7, 0.3))
val model = dt.fit(trainingData)
Output error:
java.lang.IllegalArgumentException: requirement failed: Column features must be of type numeric but was actually of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:78)
at org.apache.spark.ml.PredictorParams.validateAndTransformSchema(Predictor.scala:54)
at org.apache.spark.ml.PredictorParams.validateAndTransformSchema$(Predictor.scala:47)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:73)
at org.apache.spark.ml.classification.ClassifierParams.validateAndTransformSchema(Classifier.scala:43)
at org.apache.spark.ml.classification.ClassifierParams.validateAndTransformSchema$(Classifier.scala:39)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:51)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:34)
at org.apache.spark.ml.classification.DecisionTreeClassifier.org$apache$spark$ml$tree$DecisionTreeClassifierParams$$super$validateAndTransformSchema(DecisionTreeClassifier.scala:46)
at org.apache.spark.ml.tree.DecisionTreeClassifierParams.validateAndTransformSchema(treeParams.scala:245)
at org.apache.spark.ml.tree.DecisionTreeClassifierParams.validateAndTransformSchema$(treeParams.scala:241)
at org.apache.spark.ml.classification.DecisionTreeClassifier.validateAndTransformSchema(DecisionTreeClassifier.scala:46)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:177)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:133)
... 61 elided
I tried above code in spark-shell environment:
spark v 3.3.1
scala v 2.12.15
Here is what trainingData looks like
+-----------------------------+------------+
|features |signalIndexV|
+-----------------------------+------------+
|[417.7,419.97,419.97,417.31] |[2.0] |
|[417.35,417.33,417.46,416.77]|[1.0] |
|[417.55,417.68,418.04,417.48]|[0.0] |
|[417.22,417.8,421.13,416.83] |[0.0] |
+-----------------------------+------------+
So what did I do wrong ? How can I convert column features into numeric type ?

type mismatch; found : org.apache.spark.sql.DataFrame required: org.apache.spark.rdd.RDD

I am new to scala and mllib and I have been getting the following error. Please let me know if anyone has been able to resolve something similar.
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
.
.
.
val conf = new SparkConf().setMaster("local").setAppName("SampleApp")
val sContext = new SparkContext(conf)
val sc = SparkSession.builder().master("local").appName("SampleApp").getOrCreate()
val sampleData = sc.read.json("input/sampleData.json")
val clusters = KMeans.train(sampleData, 10, 10)
val WSSSE = clusters.computeCost(sampleData)
clusters.save(sc, "target/org/apache/spark/KMeansExample/KMeansModel")
val sameModel = KMeansModel.load(sContext, "target/org/apache/spark/KMeansExample/KMeansModel")
this above line gives an error as:
type mismatch; found : org.apache.spark.sql.DataFrame (which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
So I tried:
import org.apache.spark.ml.clustering.KMeans
val kmeans = new KMeans().setK(20)
val model = kmeans.fit(sampleData)
val predictions = model.transform(sampleData)
val evaluator = new ClusteringEvaluator()
val silhouette = evaluator.evaluate(predictions)
This gives the error:
Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
Available fields: address, attributes, business_id
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:266)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnTypes(SchemaUtils.scala:58)
at org.apache.spark.ml.util.SchemaUtils$.validateVectorCompatibleColumn(SchemaUtils.scala:119)
at org.apache.spark.ml.clustering.KMeansParams$class.validateAndTransformSchema(KMeans.scala:96)
at org.apache.spark.ml.clustering.KMeans.validateAndTransformSchema(KMeans.scala:285)
at org.apache.spark.ml.clustering.KMeans.transformSchema(KMeans.scala:382)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:341)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:340)
at org.apache.spark.ml.util.Instrumentation$$anonfun$11.apply(Instrumentation.scala:183)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:183)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:340)
I have been referring to https://spark.apache.org/docs/latest/ml-clustering.html and https://spark.apache.org/docs/latest/mllib-clustering.html
Edit
Using setFeaturesCol()
import org.apache.spark.ml.clustering.KMeans
val assembler = new VectorAssembler()
.setInputCols(Array("is_open", "review_count", "stars"))
.setOutputCol("features")
val output = assembler.transform(sampleData).select("features")
val kmeans = new KMeans().setK(20).setFeaturesCol("features")
val model = kmeans.fit(output)
val predictions = model.transform(sampleData)
val evaluator = new ClusteringEvaluator()
val silhouette = evaluator.evaluate(predictions)
println(s"Silhouette with squared euclidean distance = $silhouette")
This gives a different error still:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.util.Utils$.getSimpleName(Ljava/lang/Class;)Ljava/lang/String;
at org.apache.spark.ml.util.Instrumentation.logPipelineStage(Instrumentation.scala:52)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:350)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:340)
at org.apache.spark.ml.util.Instrumentation$$anonfun$11.apply(Instrumentation.scala:183)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:183)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:340)
Thanks.
Use the scala pipeline
val assembler = new VectorAssembler()
.setInputCols(Array("feature1",feature2","feature3"))
.setOutputCol("assembled_features")
val scaler = new StandardScaler()
.setInputCol("assembled_features")
.setOutputCol("features")
.setWithStd(true)
.setWithMean(false)
val kmeans = new KMeans().setK(2).setSeed(1L)
// create the pipeline
val pipeline = new Pipeline()
.setStages(Array(assembler, scaler, kmeans))
// Fit the model
val clussterModel = pipeline.fit(train)

OneHotEncoder in Spark Dataframe in Pipeline

I've been trying to get an example running in Spark and Scala with the adult dataset .
Using Scala 2.11.8 and Spark 1.6.1.
The problem (for now) lies in the amount of categorical features in that dataset that all need to be encoded to numbers before a Spark ML algorithm can do its job..
So far I have this:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.OneHotEncoder
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
object Adult {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Adult example").setMaster("local[*]")
val sparkContext = new SparkContext(conf)
val sqlContext = new SQLContext(sparkContext)
val data = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("src/main/resources/adult.data")
val categoricals = data.dtypes filter (_._2 == "StringType")
val encoders = categoricals map (cat => new OneHotEncoder().setInputCol(cat._1).setOutputCol(cat._1 + "_encoded"))
val features = data.dtypes filterNot (_._1 == "label") map (tuple => if(tuple._2 == "StringType") tuple._1 + "_encoded" else tuple._1)
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(encoders ++ Array(lr))
val model = pipeline.fit(training)
}
}
However, this doesn't work. Calling pipeline.fit still contains the original string features and thus throws an exception.
How can I remove these "StringType" columns in a pipeline?
Or maybe I'm doing it completely wrong, so if someone has a different suggestion I'm happy to all input :).
The reason why I choose to follow this flow is because I have an extensive background in Python and Pandas, but am trying to learn both Scala and Spark.
There is one thing that can be rather confusing here if you're used to higher level frameworks. You have to index the features before you can use encoder. As it is explained in the API docs:
one-hot encoder (...) maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index.
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer, OneHotEncoder}
val df = Seq((1L, "foo"), (2L, "bar")).toDF("id", "x")
val categoricals = df.dtypes.filter (_._2 == "StringType") map (_._1)
val indexers = categoricals.map (
c => new StringIndexer().setInputCol(c).setOutputCol(s"${c}_idx")
)
val encoders = categoricals.map (
c => new OneHotEncoder().setInputCol(s"${c}_idx").setOutputCol(s"${c}_enc")
)
val pipeline = new Pipeline().setStages(indexers ++ encoders)
val transformed = pipeline.fit(df).transform(df)
transformed.show
// +---+---+-----+-------------+
// | id| x|x_idx| x_enc|
// +---+---+-----+-------------+
// | 1|foo| 1.0| (1,[],[])|
// | 2|bar| 0.0|(1,[0],[1.0])|
// +---+---+-----+-------------+
As you can see there is no need to drop string columns from the pipeline. In practice OneHotEncoder will accept numeric column with NominalAttribute, BinaryAttribute or missing type attribute.

How to create a DataFrame from a text file in Spark

I have a text file on HDFS and I want to convert it to a Data Frame in Spark.
I am using the Spark Context to load the file and then try to generate individual columns from that file.
val myFile = sc.textFile("file.txt")
val myFile1 = myFile.map(x=>x.split(";"))
After doing this, I am trying the following operation.
myFile1.toDF()
I am getting an issues since the elements in myFile1 RDD are now array type.
How can I solve this issue?
Update - as of Spark 1.6, you can simply use the built-in csv data source:
spark: SparkSession = // create the Spark Session
val df = spark.read.csv("file.txt")
You can also use various options to control the CSV parsing, e.g.:
val df = spark.read.option("header", "false").csv("file.txt")
For Spark version < 1.6:
The easiest way is to use spark-csv - include it in your dependencies and follow the README, it allows setting a custom delimiter (;), can read CSV headers (if you have them), and it can infer the schema types (with the cost of an extra scan of the data).
Alternatively, if you know the schema you can create a case-class that represents it and map your RDD elements into instances of this class before transforming into a DataFrame, e.g.:
case class Record(id: Int, name: String)
val myFile1 = myFile.map(x=>x.split(";")).map {
case Array(id, name) => Record(id.toInt, name)
}
myFile1.toDF() // DataFrame will have columns "id" and "name"
I have given different ways to create DataFrame from text file
val conf = new SparkConf().setAppName(appName).setMaster("local")
val sc = SparkContext(conf)
raw text file
val file = sc.textFile("C:\\vikas\\spark\\Interview\\text.txt")
val fileToDf = file.map(_.split(",")).map{case Array(a,b,c) =>
(a,b.toInt,c)}.toDF("name","age","city")
fileToDf.foreach(println(_))
spark session without schema
import org.apache.spark.sql.SparkSession
val sparkSess =
SparkSession.builder().appName("SparkSessionZipsExample")
.config(conf).getOrCreate()
val df = sparkSess.read.option("header",
"false").csv("C:\\vikas\\spark\\Interview\\text.txt")
df.show()
spark session with schema
import org.apache.spark.sql.types._
val schemaString = "name age city"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header",
"false").schema(schema).csv("C:\\vikas\\spark\\Interview\\text.txt")
dfWithSchema.show()
using sql context
import org.apache.spark.sql.SQLContext
val fileRdd =
sc.textFile("C:\\vikas\\spark\\Interview\\text.txt").map(_.split(",")).map{x
=> org.apache.spark.sql.Row(x:_*)}
val sqlDf = sqlCtx.createDataFrame(fileRdd,schema)
sqlDf.show()
If you want to use the toDF method, you have to convert your RDD of Array[String] into a RDD of a case class. For example, you have to do:
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
You will not able to convert it into data frame until you use implicit conversion.
val sqlContext = new SqlContext(new SparkContext())
import sqlContext.implicits._
After this only you can convert this to data frame
case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
val df = spark.read.textFile("abc.txt")
case class Abc (amount:Int, types: String, id:Int) //columns and data types
val df2 = df.map(rec=>Amount(rec(0).toInt, rec(1), rec(2).toInt))
rdd2.printSchema
root
|-- amount: integer (nullable = true)
|-- types: string (nullable = true)
|-- id: integer (nullable = true)
A txt File with PIPE (|) delimited file can be read as :
df = spark.read.option("sep", "|").option("header", "true").csv("s3://bucket_name/folder_path/file_name.txt")
I know I am quite late to answer this but I have come up with a different answer:
val rdd = sc.textFile("/home/training/mydata/file.txt")
val text = rdd.map(lines=lines.split(",")).map(arrays=>(ararys(0),arrays(1))).toDF("id","name").show
You can read a file to have an RDD and then assign schema to it. Two common ways to creating schema are either using a case class or a Schema object [my preferred one]. Follows the quick snippets of code that you may use.
Case Class approach
case class Test(id:String,name:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
Schema Approach
import org.apache.spark.sql.types._
val schemaString = "id name"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header","false").schema(schema).csv("file.txt")
dfWithSchema.show()
The second one is my preferred approach since case class has a limitation of max 22 fields and this will be a problem if your file has more than 22 fields!

How to create correct data frame for classification in Spark ML

I am trying to run random forest classification by using Spark ML api but I am having issues with creating right data frame input into pipeline.
Here is sample data:
age,hours_per_week,education,sex,salaryRange
38,40,"hs-grad","male","A"
28,40,"bachelors","female","A"
52,45,"hs-grad","male","B"
31,50,"masters","female","B"
42,40,"bachelors","male","B"
age and hours_per_week are integers while other features including label salaryRange are categorical (String)
Loading this csv file (lets call it sample.csv) can be done by Spark csv library like this:
val data = sqlContext.csvFile("/home/dusan/sample.csv")
By default all columns are imported as string so we need to change "age" and "hours_per_week" to Int:
val toInt = udf[Int, String]( _.toInt)
val dataFixed = data.withColumn("age", toInt(data("age"))).withColumn("hours_per_week",toInt(data("hours_per_week")))
Just to check how schema looks now:
scala> dataFixed.printSchema
root
|-- age: integer (nullable = true)
|-- hours_per_week: integer (nullable = true)
|-- education: string (nullable = true)
|-- sex: string (nullable = true)
|-- salaryRange: string (nullable = true)
Then lets set the cross validator and pipeline:
val rf = new RandomForestClassifier()
val pipeline = new Pipeline().setStages(Array(rf))
val cv = new CrossValidator().setNumFolds(10).setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)
Error shows up when running this line:
val cmModel = cv.fit(dataFixed)
java.lang.IllegalArgumentException: Field "features" does not exist.
It is possible to set label column and feature column in RandomForestClassifier ,however I have 4 columns as predictors (features) not only one.
How I should organize my data frame so it has label and features columns organized correctly?
For your convenience here is full code :
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.CrossValidator
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.mllib.linalg.{Vector, Vectors}
object SampleClassification {
def main(args: Array[String]): Unit = {
//set spark context
val conf = new SparkConf().setAppName("Simple Application").setMaster("local");
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import com.databricks.spark.csv._
//load data by using databricks "Spark CSV Library"
val data = sqlContext.csvFile("/home/dusan/sample.csv")
//by default all columns are imported as string so we need to change "age" and "hours_per_week" to Int
val toInt = udf[Int, String]( _.toInt)
val dataFixed = data.withColumn("age", toInt(data("age"))).withColumn("hours_per_week",toInt(data("hours_per_week")))
val rf = new RandomForestClassifier()
val pipeline = new Pipeline().setStages(Array(rf))
val cv = new CrossValidator().setNumFolds(10).setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)
// this fails with error
//java.lang.IllegalArgumentException: Field "features" does not exist.
val cmModel = cv.fit(dataFixed)
}
}
Thanks for help!
As of Spark 1.4, you can use Transformer org.apache.spark.ml.feature.VectorAssembler.
Just provide column names you want to be features.
val assembler = new VectorAssembler()
.setInputCols(Array("col1", "col2", "col3"))
.setOutputCol("features")
and add it to your pipeline.
You simply need to make sure that you have a "features" column in your dataframe that is of type VectorUDF as show below:
scala> val df2 = dataFixed.withColumnRenamed("age", "features")
df2: org.apache.spark.sql.DataFrame = [features: int, hours_per_week: int, education: string, sex: string, salaryRange: string]
scala> val cmModel = cv.fit(df2)
java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.mllib.linalg.VectorUDT#1eef but was actually IntegerType.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:50)
at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:118)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:164)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:164)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:164)
at org.apache.spark.ml.tuning.CrossValidator.transformSchema(CrossValidator.scala:142)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59)
at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:107)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:67)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:72)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:74)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:76)
EDIT1
Essentially there need to be two fields in your data frame "features" for feature vector and "label" for instance labels. Instance must be of type Double.
To create a "features" fields with Vector type first create a udf as show below:
val toVec4 = udf[Vector, Int, Int, String, String] { (a,b,c,d) =>
val e3 = c match {
case "hs-grad" => 0
case "bachelors" => 1
case "masters" => 2
}
val e4 = d match {case "male" => 0 case "female" => 1}
Vectors.dense(a, b, e3, e4)
}
Now to also encode the "label" field, create another udf as shown below:
val encodeLabel = udf[Double, String]( _ match { case "A" => 0.0 case "B" => 1.0} )
Now we transform original dataframe using these two udf:
val df = dataFixed.withColumn(
"features",
toVec4(
dataFixed("age"),
dataFixed("hours_per_week"),
dataFixed("education"),
dataFixed("sex")
)
).withColumn("label", encodeLabel(dataFixed("salaryRange"))).select("features", "label")
Note that there can be extra columns / fields present in the dataframe, but in this case I have selected only features and label:
scala> df.show()
+-------------------+-----+
| features|label|
+-------------------+-----+
|[38.0,40.0,0.0,0.0]| 0.0|
|[28.0,40.0,1.0,1.0]| 0.0|
|[52.0,45.0,0.0,0.0]| 1.0|
|[31.0,50.0,2.0,1.0]| 1.0|
|[42.0,40.0,1.0,0.0]| 1.0|
+-------------------+-----+
Now its upto you to set correct parameters for your learning algorithm to make it work.
According to spark documentation on mllib - random trees, seems to me that you should define the features map that you are using and the points should be a labeledpoint.
This will tell the algorithm which column should be used as prediction and which ones are the features.
https://spark.apache.org/docs/latest/mllib-decision-tree.html