I am currently working with Apache Flink's SVM-Class to predict some text data.
The class provides a predict-function which is taking a DataSet[Vector] as an input and gives me a DataSet[Prediction] as result. So far so good.
My problem is, that i dont have the context which prediction belongs to which text and i cant insert the text within the predict()-function to have it afterwards.
Code:
val tweets: DataSet[(SparseVector, String)] =
source.flatMap(new SelectEnglishTweetWithCreatedAtFlatMapper)
.map(tweet => (featureVectorService.transform(tweet._2))
model.predict(tweets).print
result example:
(SparseVector((462,8.73165920153676), (10844,8.508515650222549), (15656,2.931052542245018)),-1.0)
Is there a way to keep other data next to the prediction to have everything together ? because without context the prediction is not helping me.
Or maybe there is a way to just predict one vector instead of a DataSet, that i could call the function inside the map function above.
The SVM predictor expects as input a sub type of Vector. Hence there are two options to solve this problem:
Create a sub type of Vector which contains the tweet text as a tag. It will then be looped through the predictor. This approach has the advantage that no additional operation is needed. However, one needs define new classes an utilities to represent different vector types with tags:
val env = ExecutionEnvironment.getExecutionEnvironment
val input = env.fromElements("foobar", "barfo", "test")
val vectorizedInput = input.map(word => {
val value = word.chars().sum()
new DenseVectorWithTag(Array(value), word)
})
val svm = SVM().setBlocks(env.getParallelism)
val weights = env.fromElements(DenseVector(1.0))
svm.weightsOption = Option(weights) // skipping the training here
val predictionResult: DataSet[(DenseVectorWithTag, Double)] = svm.predict(vectorizedInput)
class DenseVectorWithTag(override val data: Array[Double], tag: String)
extends DenseVector(data) {
override def toString: String = "(" + super.toString + ", " + tag + ")"
}
Join the prediction DataSet with the input DataSet on the vectorized representation of the tweets. This approach has the advantage that we don't need to introduce new classes. The price we pay for this is an additional join operation which might be expensive:
val input = env.fromElements("foobar", "barfo", "test")
val vectorizedInput = input.map(word => {
val value = word.chars().sum()
(DenseVector(value), word)
})
val svm = SVM().setBlocks(env.getParallelism)
val weights = env.fromElements(DenseVector(1.0))
svm.weightsOption = Option(weights) // skipping the training here
val predictionResult = svm.predict(vectorizedInput.map(a => a._1))
val inputWithPrediction: DataSet[(String, Double)] = vectorizedInput
.join(predictionResult)
.where(0)
.equalTo(0)
.apply((t, p) => (t._2, p._2))
I read a piece of code about binary decision tree from a book. It has only one categorical feature, which is field(3), in the raw data, and is converted to one-of-k(one-hot encoding).
def PrepareData(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint], RDD[LabeledPoint], Map[String, Int]) = {
val rawDataWithHeader = sc.textFile("data/train.tsv")
val rawData = rawDataWithHeader.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }
val lines = rawData.map(_.split("\t"))
val categoriesMap = lines.map(fields => fields(3)).distinct.collect.zipWithIndex.toMap
val labelpointRDD = lines.map { fields =>
val trFields = fields.map(_.replaceAll("\"", ""))
val categoryFeaturesArray = Array.ofDim[Double](categoriesMap.size)
val categoryIdx = categoriesMap(fields(3))
categoryFeaturesArray(categoryIdx) = 1
val numericalFeatures = trFields.slice(4, fields.size - 1).map(d => if (d == "?") 0.0 else d.toDouble)
val label = trFields(fields.size - 1).toInt
LabeledPoint(label, Vectors.dense(categoryFeaturesArray ++ numericalFeatures))
}
val Array(trainData, validationData, testData) = labelpointRDD.randomSplit(Array(8, 1, 1))
return (trainData, validationData, testData, categoriesMap)
}
I wonder how to revise the code if there are several categorical features in the raw data, let's say field(3), field(5), field(7) are all categorical features.
I revised the first line:
def PrepareData(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint], RDD[LabeledPoint], Map[String, Int], Map[String, Int], Map[String, Int], Map[String, Int]) =......
Then, I converted another two fields into 1-of-k encoding as it was done like:
val categoriesMap5 = lines.map(fields => fields(5)).distinct.collect.zipWithIndex.toMap
val categoriesMap7 = lines.map(fields => fields(7)).distinct.collect.zipWithIndex.toMap
val categoryFeaturesArray5 = Array.ofDim[Double](categoriesMap5.size)
val categoryFeaturesArray7 = Array.ofDim[Double](categoriesMap7.size)
val categoryIdx3 = categoriesMap5(fields(5))
val categoryIdx5 = categoriesMap7(fields(7))
categoryFeaturesArray5(categoryIdx5) = 1
categoryFeaturesArray7(categoryIdx7) = 1
Finally, I revised LabeledPoint and return like:
LabeledPoint(label, Vectors.dense(categoryFeaturesArray ++ categoryFeaturesArray5 ++ categoryFeaturesArray7 ++ numericalFeatures))
return (trainData, validationData, testData, categoriesMap, categoriesMap5, categoriesMap7)
Is it correct?
==================================================
The second problem I encountered is: the following code from that book, in the trainModel, it uses
DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
Here is the code:
def trainModel(trainData: RDD[LabeledPoint], impurity: String, maxDepth: Int, maxBins: Int): (DecisionTreeModel, Double) = {
val startTime = new DateTime()
val model = DecisionTree.trainClassifier(trainData, 2, Map[Int, Int](), impurity, maxDepth, maxBins)
val endTime = new DateTime()
val duration = new Duration(startTime, endTime)
(model, duration.getMillis())
}
The question is: how do I pass the categoricalFeaturesInfo into this method if it has three categorical features mentioned previously?
I just want to follow the step on the book to build up a prediction system on my own by using a decision tree. To be more specific, the data sets I chose has several categorical features like :
Gender: male, female
Education: HS-grad, Bachelors, Master, PH.D, ......
Country: US, Canada, England, Australia, ......
But I don't know how to merge them into one single categoryFeatures ++ numericalFeatures to put into Vector.dense(), and one single categoricalFeaturesInfo to put into DecisionTree.trainRegressor()
It is not clear for me what exactly you're doing here but it looks like it is wrong from the beginning.
Ignoring the fact that you're reinventing the wheel by implementing one-hot-encoding from scratch, the whole point of encoding is to convert categorical variables to numerical ones. This is required for linear models but arguably it doesn't make sense when working with decision trees.
Keeping that in mind you have two choices:
Index categorical fields without encoding and pass indexed features to categoricalFeaturesInfo.
One-hot-encode categorical features and treat these as numerical variables.
I believe that the former approach is the right approach. The latter one should work in practice but it just artificially increases dimensionality without providing any benefits. It may also be in conflict with some heuristics used by Spark implementation.
One way or another you should consider using ML Pipelines which provide all required indexing, encoding, and merging tools.
I have a file where each line is in this way
info1,info2
info3,info4
...
After scanning it, I want to run the k-means algorithm:
val rawData = sc.textFile(myFile)
val converted = convertToVector(rawData)
val kmeans = new KMeans()
kmeans.setK(10)
kmeans.setRuns(10)
kmeans.setEpsilon(1.0e-6)
val model = kmeans.run(rawData) -> problem: k-means accepts only RDD<Vector>
Because k-means only accepts RDD<Vector>, I created a function that converts my RDD<String> rawData to a RDD<Vector>. But I'm getting stuck on how to do this, this function below is work in progress:
def converToVector(rawData: RDD[String]): RDD[Vector] = {
//TODO...
val toConvert = rawData.collect().toVector
val map = rawData.map {
line => line.split(",").toVector
}
map
}
Any suggestions on how to achieve this?
Thanks in advance.
This is a very basic operation considering that each line of your input file is a hypothetical vector represented by a comma separated string.
You just need to map each string entrie, split it on the separator and then create a dense Vector from it:
val parsedData = rawData.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
I am trying to perform Scala operation on Shark. I am creating an RDD as follows:
val tmp: shark.api.TableRDD = sc.sql2rdd("select duration from test")
I need it to convert it to RDD[Array[Double]]. I tried toArray, but it doesn't seem to work.
I also tried converting it to Array[String] and then converting using map as follows:
val tmp_2 = tmp.map(row => row.getString(0))
val tmp_3 = tmp_2.map { row =>
val features = Array[Double] (row(0))
}
But this gives me a Spark's RDD[Unit] which cannot be used in the function. Is there any other way to proceed with this type conversion?
Edit I also tried using toDouble, but this gives me an RDD[Double] type, not RDD[Array[Double]]
val tmp_5 = tmp_2.map(_.toDouble)
Edit 2:
I managed to do this as follows:
A sample of the data:
296.98567000000003
230.84362999999999
212.89751000000001
914.02404000000001
305.55383
A Spark Table RDD was created first.
val tmp = sc.sql2rdd("select duration from test")
I made use of getString to translate it to a RDD[String] and then converted it to an RDD[Array[Double]].
val duration = tmp.map(row => Array[Double](row.getString(0).toDouble))
I'm new to scala so was wondering if anyone can help me with a function that would take an arbitrary list of labels, a delimted text string and return something like a Map or Dictionary.
val labels = Seq("color", "cost", "name")
val data = ("blue|$9.99|smurf")
private def getData(data:String, labels:Seq[String]) {
val values = labels.split("|")
//now how to map this split values with the the labels to create a nice map or dictionary
}
val labels = Seq("color", "cost", "name")
val values = "blue|$9.99|smurf".split("\\|")
// Array(blue, $9.99, smurf)
val map = labels.zip(values).toMap
// Map(color -> blue, cost -> $9.99, name -> smurf)