Apache Flink - Prediction Handling - scala

I am currently working with Apache Flink's SVM-Class to predict some text data.
The class provides a predict-function which is taking a DataSet[Vector] as an input and gives me a DataSet[Prediction] as result. So far so good.
My problem is, that i dont have the context which prediction belongs to which text and i cant insert the text within the predict()-function to have it afterwards.
Code:
val tweets: DataSet[(SparseVector, String)] =
source.flatMap(new SelectEnglishTweetWithCreatedAtFlatMapper)
.map(tweet => (featureVectorService.transform(tweet._2))
model.predict(tweets).print
result example:
(SparseVector((462,8.73165920153676), (10844,8.508515650222549), (15656,2.931052542245018)),-1.0)
Is there a way to keep other data next to the prediction to have everything together ? because without context the prediction is not helping me.
Or maybe there is a way to just predict one vector instead of a DataSet, that i could call the function inside the map function above.

The SVM predictor expects as input a sub type of Vector. Hence there are two options to solve this problem:
Create a sub type of Vector which contains the tweet text as a tag. It will then be looped through the predictor. This approach has the advantage that no additional operation is needed. However, one needs define new classes an utilities to represent different vector types with tags:
val env = ExecutionEnvironment.getExecutionEnvironment
val input = env.fromElements("foobar", "barfo", "test")
val vectorizedInput = input.map(word => {
val value = word.chars().sum()
new DenseVectorWithTag(Array(value), word)
})
val svm = SVM().setBlocks(env.getParallelism)
val weights = env.fromElements(DenseVector(1.0))
svm.weightsOption = Option(weights) // skipping the training here
val predictionResult: DataSet[(DenseVectorWithTag, Double)] = svm.predict(vectorizedInput)
class DenseVectorWithTag(override val data: Array[Double], tag: String)
extends DenseVector(data) {
override def toString: String = "(" + super.toString + ", " + tag + ")"
}
Join the prediction DataSet with the input DataSet on the vectorized representation of the tweets. This approach has the advantage that we don't need to introduce new classes. The price we pay for this is an additional join operation which might be expensive:
val input = env.fromElements("foobar", "barfo", "test")
val vectorizedInput = input.map(word => {
val value = word.chars().sum()
(DenseVector(value), word)
})
val svm = SVM().setBlocks(env.getParallelism)
val weights = env.fromElements(DenseVector(1.0))
svm.weightsOption = Option(weights) // skipping the training here
val predictionResult = svm.predict(vectorizedInput.map(a => a._1))
val inputWithPrediction: DataSet[(String, Double)] = vectorizedInput
.join(predictionResult)
.where(0)
.equalTo(0)
.apply((t, p) => (t._2, p._2))

Related

Parsing CSV file for decision tree classifier in spark

I have a csv file like this :
0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.
My goal is to use Decision trees in order to predict the last column (either normal or something else)
As you can see, not all the fields from my csv file are the same type, there are strings, int and double.
At first I wanted to create a RDD and use it like this :
def load_part1(file: String): RDD[(Int, String, String,String,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int, Int, Int, Double, Double, Double, Double, Double, Double, Double, Int, Int, Double, Double, Double, Double, Double, Double, Double, Double, String)]
val data = context.textFile(file)
val res = data.map(x => {
val s = x.split(",")
(s(0).toInt, s(1), s(2), s(3), s(4).toInt, s(5).toInt, s(6).toInt, s(7).toInt, s(8).toInt, s(9).toInt, s(10).toInt, s(11).toInt, s(12).toInt, s(13).toInt, s(14).toInt, s(15).toInt, s(16).toInt, s(17).toInt, s(18).toInt, s(19).toInt, s(20).toInt, s(21).toInt, s(22).toInt, s(23).toInt, s(24).toDouble, s(25).toDouble, s(26).toDouble, s(27).toDouble, s(28).toDouble, s(29).toDouble, s(30).toDouble, s(31).toInt, s(32).toInt, s(33).toDouble, s(34).toDouble, s(35).toDouble, s(36).toDouble, s(37).toDouble, s(38).toDouble, s(39).toDouble, s(40).toDouble, s(41))
})
.persist(StorageLevel.MEMORY_AND_DISK)
return res
}
But it won't accept it because a tuple cannot have more than 22 fields in scala.
And now I am stuck because I don't know how to load an parse my csv file to use it as training and test for the decision tree.
When i look at the decision tree examples on spark doc, they use libsvm format : is this the only format I can use ? Because the thing is that:
not all my features have the same type : do I need to convert all the features into the same type ?
My labels are not integers but strings, so do I need to convert my labels to integers in order to use decision tree classifier ?
I tried to look at some topic like this one or this one but it is quite different as for the first link all of his features have the same format (double) and for the second I have tried to load and parse my data like this :
val csv = context.textFile("/home/hvfd8529/Datasets/KDDCup99/kddcup.data_10_percent_corrected") // original file
val data = csv.map(line => line.split(",").map(elem => elem.trim))
But it took almost 2 min for my computer to do it, besides it made it crash ?!
I am thinking about programming a little python code in order to change all the string format into integers so that I could apply a CSV2LibSVM python code and then use the decision tree classifier like the example on the spar documentation, but is it really necessary? Can't I directly use my csv file ?
I am a newbie at scala and spark :)
Thank you
Here is how you can do it in spark 2.1
First define the schema for your csv
StructType schema = new StructType(new StructField[]{
new StructField("col1", DataTypes.StringType, true, Metadata.empty()),
new StructField("col2", DataTypes.DoubleType, true, Metadata.empty())})
Dataset<Row> dataset = spark.read().format("csv").load("data.csv");
StringIndexerModel indexer = new StringIndexer()
.setInputCol("col1")
.setOutputCol("col1Indexed").setHandleInvalid("skip").fit(data);
VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[]{"col1Indexed","col2"})
.setOutputCol("features");
//Prepare data
Dataset<Row>[] splits = data.randomSplit(new double[]{0.7, 0.3});
Dataset<Row> trainingData = splits[0];
Dataset<Row> testData = splits[1];
DecisionTreeRegressor dt = new DecisionTreeRegressor().setFeaturesCol("features").setLabelCol("commission").setPredictionCol("prediction");
Pipeline pipeline = new Pipeline()
.setStages(new PipelineStage[]{indexer,assembler, dt});
// Train model. This also runs the indexer.
PipelineModel model = pipeline.fit(trainingData);
// Make predictions.
Dataset<Row> predictions = model.transform(testData);
Basically, You have to index your string features using StringIndexer and use VectorAssembler to merge the new columns.
(the code is in java but I think its pretty straightforward)
You could use a List[Any]:
def load_part1(file: String): RDD[List[Any]]
val data = context.textFile(file)
val res = data.map(x => {
val s = x.split(",")
List(s(0).toInt, s(1), s(2), s(3), s(4).toInt, s(5).toInt, s(6).toInt, s(7).toInt, s(8).toInt, s(9).toInt, s(10).toInt, s(11).toInt, s(12).toInt, s(13).toInt, s(14).toInt, s(15).toInt, s(16).toInt, s(17).toInt, s(18).toInt, s(19).toInt, s(20).toInt, s(21).toInt, s(22).toInt, s(23).toInt, s(24).toDouble, s(25).toDouble, s(26).toDouble, s(27).toDouble, s(28).toDouble, s(29).toDouble, s(30).toDouble, s(31).toInt, s(32).toInt, s(33).toDouble, s(34).toDouble, s(35).toDouble, s(36).toDouble, s(37).toDouble, s(38).toDouble, s(39).toDouble, s(40).toDouble, s(41))
})
.persist(StorageLevel.MEMORY_AND_DISK)
return res
}
If you know up front that the text fields have low cardinality - if you see what I mean - you could encode them numerically using something like one-hot encoding, and cast your ints to doubles, so you will return RDD[List[Double]].
Here is some information on one-hot encoding and similar methods of representing categorical data for machine learning models: http://www.kdnuggets.com/2015/12/beyond-one-hot-exploration-categorical-variables.html

How to deal with more than one categorical feature in a decision tree?

I read a piece of code about binary decision tree from a book. It has only one categorical feature, which is field(3), in the raw data, and is converted to one-of-k(one-hot encoding).
def PrepareData(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint], RDD[LabeledPoint], Map[String, Int]) = {
val rawDataWithHeader = sc.textFile("data/train.tsv")
val rawData = rawDataWithHeader.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }
val lines = rawData.map(_.split("\t"))
val categoriesMap = lines.map(fields => fields(3)).distinct.collect.zipWithIndex.toMap
val labelpointRDD = lines.map { fields =>
val trFields = fields.map(_.replaceAll("\"", ""))
val categoryFeaturesArray = Array.ofDim[Double](categoriesMap.size)
val categoryIdx = categoriesMap(fields(3))
categoryFeaturesArray(categoryIdx) = 1
val numericalFeatures = trFields.slice(4, fields.size - 1).map(d => if (d == "?") 0.0 else d.toDouble)
val label = trFields(fields.size - 1).toInt
LabeledPoint(label, Vectors.dense(categoryFeaturesArray ++ numericalFeatures))
}
val Array(trainData, validationData, testData) = labelpointRDD.randomSplit(Array(8, 1, 1))
return (trainData, validationData, testData, categoriesMap)
}
I wonder how to revise the code if there are several categorical features in the raw data, let's say field(3), field(5), field(7) are all categorical features.
I revised the first line:
def PrepareData(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint], RDD[LabeledPoint], Map[String, Int], Map[String, Int], Map[String, Int], Map[String, Int]) =......
Then, I converted another two fields into 1-of-k encoding as it was done like:
val categoriesMap5 = lines.map(fields => fields(5)).distinct.collect.zipWithIndex.toMap
val categoriesMap7 = lines.map(fields => fields(7)).distinct.collect.zipWithIndex.toMap
val categoryFeaturesArray5 = Array.ofDim[Double](categoriesMap5.size)
val categoryFeaturesArray7 = Array.ofDim[Double](categoriesMap7.size)
val categoryIdx3 = categoriesMap5(fields(5))
val categoryIdx5 = categoriesMap7(fields(7))
categoryFeaturesArray5(categoryIdx5) = 1
categoryFeaturesArray7(categoryIdx7) = 1
Finally, I revised LabeledPoint and return like:
LabeledPoint(label, Vectors.dense(categoryFeaturesArray ++ categoryFeaturesArray5 ++ categoryFeaturesArray7 ++ numericalFeatures))
return (trainData, validationData, testData, categoriesMap, categoriesMap5, categoriesMap7)
Is it correct?
==================================================
The second problem I encountered is: the following code from that book, in the trainModel, it uses
DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
Here is the code:
def trainModel(trainData: RDD[LabeledPoint], impurity: String, maxDepth: Int, maxBins: Int): (DecisionTreeModel, Double) = {
val startTime = new DateTime()
val model = DecisionTree.trainClassifier(trainData, 2, Map[Int, Int](), impurity, maxDepth, maxBins)
val endTime = new DateTime()
val duration = new Duration(startTime, endTime)
(model, duration.getMillis())
}
The question is: how do I pass the categoricalFeaturesInfo into this method if it has three categorical features mentioned previously?
I just want to follow the step on the book to build up a prediction system on my own by using a decision tree. To be more specific, the data sets I chose has several categorical features like :
Gender: male, female
Education: HS-grad, Bachelors, Master, PH.D, ......
Country: US, Canada, England, Australia, ......
But I don't know how to merge them into one single categoryFeatures ++ numericalFeatures to put into Vector.dense(), and one single categoricalFeaturesInfo to put into DecisionTree.trainRegressor()
It is not clear for me what exactly you're doing here but it looks like it is wrong from the beginning.
Ignoring the fact that you're reinventing the wheel by implementing one-hot-encoding from scratch, the whole point of encoding is to convert categorical variables to numerical ones. This is required for linear models but arguably it doesn't make sense when working with decision trees.
Keeping that in mind you have two choices:
Index categorical fields without encoding and pass indexed features to categoricalFeaturesInfo.
One-hot-encode categorical features and treat these as numerical variables.
I believe that the former approach is the right approach. The latter one should work in practice but it just artificially increases dimensionality without providing any benefits. It may also be in conflict with some heuristics used by Spark implementation.
One way or another you should consider using ML Pipelines which provide all required indexing, encoding, and merging tools.

How to convert a RDD<String> to a RDD<Vector> in Spark?

I have a file where each line is in this way
info1,info2
info3,info4
...
After scanning it, I want to run the k-means algorithm:
val rawData = sc.textFile(myFile)
val converted = convertToVector(rawData)
val kmeans = new KMeans()
kmeans.setK(10)
kmeans.setRuns(10)
kmeans.setEpsilon(1.0e-6)
val model = kmeans.run(rawData) -> problem: k-means accepts only RDD<Vector>
Because k-means only accepts RDD<Vector>, I created a function that converts my RDD<String> rawData to a RDD<Vector>. But I'm getting stuck on how to do this, this function below is work in progress:
def converToVector(rawData: RDD[String]): RDD[Vector] = {
//TODO...
val toConvert = rawData.collect().toVector
val map = rawData.map {
line => line.split(",").toVector
}
map
}
Any suggestions on how to achieve this?
Thanks in advance.
This is a very basic operation considering that each line of your input file is a hypothetical vector represented by a comma separated string.
You just need to map each string entrie, split it on the separator and then create a dense Vector from it:
val parsedData = rawData.map(s => Vectors.dense(s.split(',').map(_.toDouble)))

MLLIb: Saving and loading a model

I'm using LinearRegressionWithSGD and then I save the model weights and intercept.
File that contains weights has this format:
1.20455
0.1356
0.000456
Intercept is 0 since I am using train not setting the intercept so it can be ignored for the moment. I would now like to initialize a new model object and using these saved weights from the above file. We are using CDH 5.1
Something along these lines:
// Here is the code the load data and train the model on it.
val weights = sc.textFile("linear-weights");
val model = new LinearRegressionWithSGD(weights);
then use is as:
// Here is where I want to use the trained model to predict on new data.
val valuesAndPreds = testData.map { point =>
// Predicting on new data.
val prediction = model.predict(point.features)
(point.label, prediction)
}
Any pointers to how do I do that?
It appears you are duplicating the training portion of the LinearRegressionWithSGD - which takes a LibSVM file as input.
Are you certain that you want to provide your own weights - instead of allowing the library to do its job in the training phase?
if so, then you can create your own LinearRegressionWithSGD and override the createModel
Here would be your steps given you already have calculated your desired weights / performed the training your own way:
// Stick in your weights below ..
var model = algorithm.createModel(weights, 0.0)
// Now you can run the last steps of the 'normal' process
val prediction = model.predict(test.map(_.features))
val predictionAndLabel = prediction.zip(test.map(_.label))
BTW for reference here is the more 'standard' approach that includes the training steps:
val data = MLUtils.loadLibSVMFile(sc, inputFile).cache()
val splits = examples.randomSplit(Array(0.8, 0.2))
val training = splits(0).cache()
val test = splits(1).cache()
val updater = params.regType match {
case NONE => new SimpleUpdater()
case L1 => new L1Updater()
case L2 => new SquaredL2Updater()
}
val algorithm = new LinearRegressionWithSGD()
val algorithm = new LinearRegressionWithSGD()
algorithm.optimizer
.setNumIterations(params.numIterations)
.setStepSize(params.stepSize)
.setUpdater(updater)
.setRegParam(params.regParam)
val model = algorithm.run(training)
val prediction = model.predict(test.map(_.features))
val predictionAndLabel = prediction.zip(test.map(_.label))

scala.MatchError: null on spark RDDs

I am relatively new to both spark and scala.
I was trying to implement collaborative filtering using scala on spark.
Below is the code
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
val data = sc.textFile("/user/amohammed/CB/input-cb.txt")
val distinctUsers = data.map(x => x.split(",")(0)).distinct().map(x => x.toInt)
val distinctKeywords = data.map(x => x.split(",")(1)).distinct().map(x => x.toInt)
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) => Rating(user.toInt,item.toInt, rate.toDouble)
})
val model = ALS.train(ratings, 1, 20, 0.01)
val keywords = distinctKeywords collect
distinctUsers.map(x => {(x, keywords.map(y => model.predict(x,y)))}).collect()
It throws a scala.MatchError: null
org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571) at the last line
Thw code works fine if I collect the distinctUsers rdd into an array and execute the same code:
val users = distinctUsers collect
users.map(x => {(x, keywords.map(y => model.predict(x, y)))})
Where am I getting it wrong when dealing with RDDs?
Spark Version : 1.0.0
Scala Version : 2.10.4
Going one call further back in the stack trace, line 43 of the MatrixFactorizationModel source says:
val userVector = new DoubleMatrix(userFeatures.lookup(user).head)
Note that the userFeatures field of model is itself another RDD; I believe it isn't getting serialized properly when the anonymous function block closes over model, and thus the lookup method on it is failing. I also tried placing both model and keywords into broadcast variables, but that didn't work either.
Instead of falling back to Scala collections and losing the benefits of Spark, it's probably better to stick with RDDs and take advantage of other ways of transforming them.
I'd start with this:
val ratings = data.map(_.split(',') match {
case Array(user, keyword, rate) => Rating(user.toInt, keyword.toInt, rate.toDouble)
})
// instead of parsing the original RDD's strings three separate times,
// you can map the "user" and "product" fields of the Rating case class
val distinctUsers = ratings.map(_.user).distinct()
val distinctKeywords = ratings.map(_.product).distinct()
val model = ALS.train(ratings, 1, 20, 0.01)
Then, instead of calculating each prediction one by one, we can obtain the Cartesian product of all possible user-keyword pairs as an RDD and use the other predict method in MatrixFactorizationModel, which takes an RDD of such pairs as its argument.
val userKeywords = distinctUsers.cartesian(distinctKeywords)
val predictions = model.predict(userKeywords).map { case Rating(user, keyword, rate) =>
(user, Map(keyword -> rate))
}.reduceByKey { _ ++ _ }
Now predictions has an immutable map for each user that can be queried for the predicted rating of a particular keyword. If you specifically want arrays as in your original example, you can do:
val keywords = distinctKeywords.collect() // add .sorted if you want them in order
val predictionArrays = predictions.mapValues(keywords.map(_))
Caveat: I tested this with Spark 1.0.1 as it's what I had installed, but it should work with 1.0.0 as well.