Parsing CSV file for decision tree classifier in spark - scala

I have a csv file like this :
0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.
My goal is to use Decision trees in order to predict the last column (either normal or something else)
As you can see, not all the fields from my csv file are the same type, there are strings, int and double.
At first I wanted to create a RDD and use it like this :
def load_part1(file: String): RDD[(Int, String, String,String,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int, Int, Int, Double, Double, Double, Double, Double, Double, Double, Int, Int, Double, Double, Double, Double, Double, Double, Double, Double, String)]
val data = context.textFile(file)
val res = data.map(x => {
val s = x.split(",")
(s(0).toInt, s(1), s(2), s(3), s(4).toInt, s(5).toInt, s(6).toInt, s(7).toInt, s(8).toInt, s(9).toInt, s(10).toInt, s(11).toInt, s(12).toInt, s(13).toInt, s(14).toInt, s(15).toInt, s(16).toInt, s(17).toInt, s(18).toInt, s(19).toInt, s(20).toInt, s(21).toInt, s(22).toInt, s(23).toInt, s(24).toDouble, s(25).toDouble, s(26).toDouble, s(27).toDouble, s(28).toDouble, s(29).toDouble, s(30).toDouble, s(31).toInt, s(32).toInt, s(33).toDouble, s(34).toDouble, s(35).toDouble, s(36).toDouble, s(37).toDouble, s(38).toDouble, s(39).toDouble, s(40).toDouble, s(41))
})
.persist(StorageLevel.MEMORY_AND_DISK)
return res
}
But it won't accept it because a tuple cannot have more than 22 fields in scala.
And now I am stuck because I don't know how to load an parse my csv file to use it as training and test for the decision tree.
When i look at the decision tree examples on spark doc, they use libsvm format : is this the only format I can use ? Because the thing is that:
not all my features have the same type : do I need to convert all the features into the same type ?
My labels are not integers but strings, so do I need to convert my labels to integers in order to use decision tree classifier ?
I tried to look at some topic like this one or this one but it is quite different as for the first link all of his features have the same format (double) and for the second I have tried to load and parse my data like this :
val csv = context.textFile("/home/hvfd8529/Datasets/KDDCup99/kddcup.data_10_percent_corrected") // original file
val data = csv.map(line => line.split(",").map(elem => elem.trim))
But it took almost 2 min for my computer to do it, besides it made it crash ?!
I am thinking about programming a little python code in order to change all the string format into integers so that I could apply a CSV2LibSVM python code and then use the decision tree classifier like the example on the spar documentation, but is it really necessary? Can't I directly use my csv file ?
I am a newbie at scala and spark :)
Thank you

Here is how you can do it in spark 2.1
First define the schema for your csv
StructType schema = new StructType(new StructField[]{
new StructField("col1", DataTypes.StringType, true, Metadata.empty()),
new StructField("col2", DataTypes.DoubleType, true, Metadata.empty())})
Dataset<Row> dataset = spark.read().format("csv").load("data.csv");
StringIndexerModel indexer = new StringIndexer()
.setInputCol("col1")
.setOutputCol("col1Indexed").setHandleInvalid("skip").fit(data);
VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[]{"col1Indexed","col2"})
.setOutputCol("features");
//Prepare data
Dataset<Row>[] splits = data.randomSplit(new double[]{0.7, 0.3});
Dataset<Row> trainingData = splits[0];
Dataset<Row> testData = splits[1];
DecisionTreeRegressor dt = new DecisionTreeRegressor().setFeaturesCol("features").setLabelCol("commission").setPredictionCol("prediction");
Pipeline pipeline = new Pipeline()
.setStages(new PipelineStage[]{indexer,assembler, dt});
// Train model. This also runs the indexer.
PipelineModel model = pipeline.fit(trainingData);
// Make predictions.
Dataset<Row> predictions = model.transform(testData);
Basically, You have to index your string features using StringIndexer and use VectorAssembler to merge the new columns.
(the code is in java but I think its pretty straightforward)

You could use a List[Any]:
def load_part1(file: String): RDD[List[Any]]
val data = context.textFile(file)
val res = data.map(x => {
val s = x.split(",")
List(s(0).toInt, s(1), s(2), s(3), s(4).toInt, s(5).toInt, s(6).toInt, s(7).toInt, s(8).toInt, s(9).toInt, s(10).toInt, s(11).toInt, s(12).toInt, s(13).toInt, s(14).toInt, s(15).toInt, s(16).toInt, s(17).toInt, s(18).toInt, s(19).toInt, s(20).toInt, s(21).toInt, s(22).toInt, s(23).toInt, s(24).toDouble, s(25).toDouble, s(26).toDouble, s(27).toDouble, s(28).toDouble, s(29).toDouble, s(30).toDouble, s(31).toInt, s(32).toInt, s(33).toDouble, s(34).toDouble, s(35).toDouble, s(36).toDouble, s(37).toDouble, s(38).toDouble, s(39).toDouble, s(40).toDouble, s(41))
})
.persist(StorageLevel.MEMORY_AND_DISK)
return res
}
If you know up front that the text fields have low cardinality - if you see what I mean - you could encode them numerically using something like one-hot encoding, and cast your ints to doubles, so you will return RDD[List[Double]].
Here is some information on one-hot encoding and similar methods of representing categorical data for machine learning models: http://www.kdnuggets.com/2015/12/beyond-one-hot-exploration-categorical-variables.html

Related

Apache Flink - Prediction Handling

I am currently working with Apache Flink's SVM-Class to predict some text data.
The class provides a predict-function which is taking a DataSet[Vector] as an input and gives me a DataSet[Prediction] as result. So far so good.
My problem is, that i dont have the context which prediction belongs to which text and i cant insert the text within the predict()-function to have it afterwards.
Code:
val tweets: DataSet[(SparseVector, String)] =
source.flatMap(new SelectEnglishTweetWithCreatedAtFlatMapper)
.map(tweet => (featureVectorService.transform(tweet._2))
model.predict(tweets).print
result example:
(SparseVector((462,8.73165920153676), (10844,8.508515650222549), (15656,2.931052542245018)),-1.0)
Is there a way to keep other data next to the prediction to have everything together ? because without context the prediction is not helping me.
Or maybe there is a way to just predict one vector instead of a DataSet, that i could call the function inside the map function above.
The SVM predictor expects as input a sub type of Vector. Hence there are two options to solve this problem:
Create a sub type of Vector which contains the tweet text as a tag. It will then be looped through the predictor. This approach has the advantage that no additional operation is needed. However, one needs define new classes an utilities to represent different vector types with tags:
val env = ExecutionEnvironment.getExecutionEnvironment
val input = env.fromElements("foobar", "barfo", "test")
val vectorizedInput = input.map(word => {
val value = word.chars().sum()
new DenseVectorWithTag(Array(value), word)
})
val svm = SVM().setBlocks(env.getParallelism)
val weights = env.fromElements(DenseVector(1.0))
svm.weightsOption = Option(weights) // skipping the training here
val predictionResult: DataSet[(DenseVectorWithTag, Double)] = svm.predict(vectorizedInput)
class DenseVectorWithTag(override val data: Array[Double], tag: String)
extends DenseVector(data) {
override def toString: String = "(" + super.toString + ", " + tag + ")"
}
Join the prediction DataSet with the input DataSet on the vectorized representation of the tweets. This approach has the advantage that we don't need to introduce new classes. The price we pay for this is an additional join operation which might be expensive:
val input = env.fromElements("foobar", "barfo", "test")
val vectorizedInput = input.map(word => {
val value = word.chars().sum()
(DenseVector(value), word)
})
val svm = SVM().setBlocks(env.getParallelism)
val weights = env.fromElements(DenseVector(1.0))
svm.weightsOption = Option(weights) // skipping the training here
val predictionResult = svm.predict(vectorizedInput.map(a => a._1))
val inputWithPrediction: DataSet[(String, Double)] = vectorizedInput
.join(predictionResult)
.where(0)
.equalTo(0)
.apply((t, p) => (t._2, p._2))

How to add one to every element of a SparseVector in Breeze?

Given a Breeze SparseVector object:
scala> val sv = new SparseVector[Double](Array(0, 4, 5), Array(1.5, 3.6, 0.4), 8)
sv: breeze.linalg.SparseVector[Double] = SparseVector(8)((0,1.5), (4,3.6), (5,0.4))
What is the best way to take the log of the values + 1?
Here is one way that works:
scala> new SparseVector(sv.index, log(sv.data.map(_ + 1)), sv.length)
res11: breeze.linalg.SparseVector[Double] = SparseVector(8)((0,0.9162907318741551), (4,1.5260563034950492), (5,0.3364722366212129))
I don't like this because it doesn't really make use of breeze to do the addition. We are using a breeze UFunc to take the log of an Array[Double], but that isn't much. I am concerned that in a distributed application with large SparseVectors, this will be slow.
Spark 1.6.3
You can define some UDF's to do arbitrary vectorized addition in Spark. First, you need to set up the ability to convert Spark vectors to Breeze vectors; an example of doing that is here. Once you have the implicit conversions in place, you have a few options.
To add any two columns you can use:
def addVectors(v1Col: String, v2Col: String, outputCol: String): DataFrame => DataFrame = {
// Error checking column names here
df: DataFrame => {
def add(v1: SparkVector, v2: SparkVector): SparkVector =
(v1.asBreeze + v2.asBreeze).fromBreeze
val func = udf((v1: SparkVector, v2: SparkVector) => add(v1, v2))
df.withColumn(outputCol, func(col(v1Col), col(v2Col)))
}
}
Note, the use of asBreeze and fromBreeze (as well as the alias for SparkVector) is established in the question linked above. A possible solution is to make a literal integer column by
df.withColumn(colName, lit(1))
and then add the columns.
The alternative for more complex mathematical functions is:
def applyMath(func: BreezeVector[Double] => BreezeVector[Double],
inColName: String, outColName: String): DataFrame => DataFrame = {
df: DataFrame => df.withColumn(outColName,
udf((v1: SparkVector) => func(v1.asBreeze).fromBreeze).apply(col(inColName)))
}
You could also make this generic in the Breeze vector parameter.

spark whole textiles - many small files

I want to ingest many small text files via spark to parquet. Currently, I use wholeTextFiles and perform some parsing additionally.
To be more precise - these small text files are ESRi ASCII Grid files each with a maximum size of around 400kb. GeoTools are used to parse them as outlined below.
Do you see any optimization possibilities? Maybe something to avoid the creation of unnecessary objects? Or something to better handle the small files. I wonder if it is better to only get the paths of the files and manually read them instead of using String -> ByteArrayInputStream.
case class RawRecords(path: String, content: String)
case class GeometryId(idPath: String, value: Double, geo: String)
#transient lazy val extractor = new PolygonExtractionProcess()
#transient lazy val writer = new WKTWriter()
def readRawFiles(path: String, parallelism: Int, spark: SparkSession) = {
import spark.implicits._
spark.sparkContext
.wholeTextFiles(path, parallelism)
.toDF("path", "content")
.as[RawRecords]
.mapPartitions(mapToSimpleTypes)
}
def mapToSimpleTypes(iterator: Iterator[RawRecords]): Iterator[GeometryId] = iterator.flatMap(r => {
val extractor = new PolygonExtractionProcess()
// http://docs.geotools.org/latest/userguide/library/coverage/arcgrid.html
val readRaster = new ArcGridReader(new ByteArrayInputStream(r.content.getBytes(StandardCharsets.UTF_8))).read(null)
// TODO maybe consider optimization of known size instead of using growable data structure
val vectorizedFeatures = extractor.execute(readRaster, 0, true, null, null, null, null).features
val result: collection.Seq[GeometryId] with Growable[GeometryId] = mutable.Buffer[GeometryId]()
while (vectorizedFeatures.hasNext) {
val vectorizedFeature = vectorizedFeatures.next()
val geomWKTLineString = vectorizedFeature.getDefaultGeometry match {
case g: Geometry => writer.write(g)
}
val geomUserdata = vectorizedFeature.getAttribute(1).asInstanceOf[Double]
result += GeometryId(r.path, geomUserdata, geomWKTLineString)
}
result
})
I have suggestions:
use wholeTextFile -> mapPartitions -> convert to Dataset. Why? If you make mapPartitions on Dataset, then all rows are converted from internal format to object - it causes additional serialization.
Run Java Mission Control and sample your application. It will show all compilations and times of execution of methods
Maybe you can use binaryFiles, it will give you Stream, so you can parse it without additional reading in mapPartitions

How to deal with more than one categorical feature in a decision tree?

I read a piece of code about binary decision tree from a book. It has only one categorical feature, which is field(3), in the raw data, and is converted to one-of-k(one-hot encoding).
def PrepareData(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint], RDD[LabeledPoint], Map[String, Int]) = {
val rawDataWithHeader = sc.textFile("data/train.tsv")
val rawData = rawDataWithHeader.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }
val lines = rawData.map(_.split("\t"))
val categoriesMap = lines.map(fields => fields(3)).distinct.collect.zipWithIndex.toMap
val labelpointRDD = lines.map { fields =>
val trFields = fields.map(_.replaceAll("\"", ""))
val categoryFeaturesArray = Array.ofDim[Double](categoriesMap.size)
val categoryIdx = categoriesMap(fields(3))
categoryFeaturesArray(categoryIdx) = 1
val numericalFeatures = trFields.slice(4, fields.size - 1).map(d => if (d == "?") 0.0 else d.toDouble)
val label = trFields(fields.size - 1).toInt
LabeledPoint(label, Vectors.dense(categoryFeaturesArray ++ numericalFeatures))
}
val Array(trainData, validationData, testData) = labelpointRDD.randomSplit(Array(8, 1, 1))
return (trainData, validationData, testData, categoriesMap)
}
I wonder how to revise the code if there are several categorical features in the raw data, let's say field(3), field(5), field(7) are all categorical features.
I revised the first line:
def PrepareData(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint], RDD[LabeledPoint], Map[String, Int], Map[String, Int], Map[String, Int], Map[String, Int]) =......
Then, I converted another two fields into 1-of-k encoding as it was done like:
val categoriesMap5 = lines.map(fields => fields(5)).distinct.collect.zipWithIndex.toMap
val categoriesMap7 = lines.map(fields => fields(7)).distinct.collect.zipWithIndex.toMap
val categoryFeaturesArray5 = Array.ofDim[Double](categoriesMap5.size)
val categoryFeaturesArray7 = Array.ofDim[Double](categoriesMap7.size)
val categoryIdx3 = categoriesMap5(fields(5))
val categoryIdx5 = categoriesMap7(fields(7))
categoryFeaturesArray5(categoryIdx5) = 1
categoryFeaturesArray7(categoryIdx7) = 1
Finally, I revised LabeledPoint and return like:
LabeledPoint(label, Vectors.dense(categoryFeaturesArray ++ categoryFeaturesArray5 ++ categoryFeaturesArray7 ++ numericalFeatures))
return (trainData, validationData, testData, categoriesMap, categoriesMap5, categoriesMap7)
Is it correct?
==================================================
The second problem I encountered is: the following code from that book, in the trainModel, it uses
DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
Here is the code:
def trainModel(trainData: RDD[LabeledPoint], impurity: String, maxDepth: Int, maxBins: Int): (DecisionTreeModel, Double) = {
val startTime = new DateTime()
val model = DecisionTree.trainClassifier(trainData, 2, Map[Int, Int](), impurity, maxDepth, maxBins)
val endTime = new DateTime()
val duration = new Duration(startTime, endTime)
(model, duration.getMillis())
}
The question is: how do I pass the categoricalFeaturesInfo into this method if it has three categorical features mentioned previously?
I just want to follow the step on the book to build up a prediction system on my own by using a decision tree. To be more specific, the data sets I chose has several categorical features like :
Gender: male, female
Education: HS-grad, Bachelors, Master, PH.D, ......
Country: US, Canada, England, Australia, ......
But I don't know how to merge them into one single categoryFeatures ++ numericalFeatures to put into Vector.dense(), and one single categoricalFeaturesInfo to put into DecisionTree.trainRegressor()
It is not clear for me what exactly you're doing here but it looks like it is wrong from the beginning.
Ignoring the fact that you're reinventing the wheel by implementing one-hot-encoding from scratch, the whole point of encoding is to convert categorical variables to numerical ones. This is required for linear models but arguably it doesn't make sense when working with decision trees.
Keeping that in mind you have two choices:
Index categorical fields without encoding and pass indexed features to categoricalFeaturesInfo.
One-hot-encode categorical features and treat these as numerical variables.
I believe that the former approach is the right approach. The latter one should work in practice but it just artificially increases dimensionality without providing any benefits. It may also be in conflict with some heuristics used by Spark implementation.
One way or another you should consider using ML Pipelines which provide all required indexing, encoding, and merging tools.

How to convert Spark's TableRDD to RDD[Array[Double]] in Scala?

I am trying to perform Scala operation on Shark. I am creating an RDD as follows:
val tmp: shark.api.TableRDD = sc.sql2rdd("select duration from test")
I need it to convert it to RDD[Array[Double]]. I tried toArray, but it doesn't seem to work.
I also tried converting it to Array[String] and then converting using map as follows:
val tmp_2 = tmp.map(row => row.getString(0))
val tmp_3 = tmp_2.map { row =>
val features = Array[Double] (row(0))
}
But this gives me a Spark's RDD[Unit] which cannot be used in the function. Is there any other way to proceed with this type conversion?
Edit I also tried using toDouble, but this gives me an RDD[Double] type, not RDD[Array[Double]]
val tmp_5 = tmp_2.map(_.toDouble)
Edit 2:
I managed to do this as follows:
A sample of the data:
296.98567000000003
230.84362999999999
212.89751000000001
914.02404000000001
305.55383
A Spark Table RDD was created first.
val tmp = sc.sql2rdd("select duration from test")
I made use of getString to translate it to a RDD[String] and then converted it to an RDD[Array[Double]].
val duration = tmp.map(row => Array[Double](row.getString(0).toDouble))