XGBoost4J - Scala dataframe to sparse dmatrix - scala

What is the most efficient and scalable way to convert a scala dataframe to a sparse dmatrix for XGBoost4J?
Say I have a dataframe train with columns row_index, column_index, and value, it would be something like
new DMatrix(train.select("row_index"), train.select("column_index"), train.select("Value"), DMatrix.SparseType.CSR, n_col)
However the above code results in a type mismatch because DMatrix expects Array[Long].
train.select(F.collect_list("row_index")).first().getList[Long](0) seems like a possible option but it doesn't seem to be memory friendly and scalable.
I am doing this on Databricks so solutions in the other supported languages (python, SQL, scala) are welcome.

The answer was to use sparse vectors by row rather than trying to create sparse matrix or dmatrix.
train.rdd.map(r => (r.getInt(0), (r.getInt(1), r.getInt(2).toDouble))).groupByKey().map(r => (r._1, Vectors.sparse(n_col, r._2.toSeq))).toDF
I tested scoring a sample of the data in R using Matrix::sparseMatrix and xgboost::dmatrix and the results matched up.

Related

Pyspark improving performance for multiple column operations

I have written a class which performs standard scaling over grouped data.
class Scaler:
.
.
.
.
def __transformOne__(self, df_with_stats, newName, colName):
return df_with_stats\
.withColumn(newName,
(F.col(colName)-F.col(f'avg({colName})'))/(F.col(f'stddev_samp({colName})')+self.tol))\
.drop(colName)\
.withColumnRenamed(newName, colName)
def transform(self, df):
df_with_stats = df.join(....) #calculate stats here by doing a groupby and then do a join
return reduce(lambda df_with_stats, kv: self.__transformOne__(df_with_stats, *kv),
self.__tempNames__(), df_with_stats)[df.columns]
The idea is to save the mean and variances in columns and simply do a column subtraction/division on the column i want to scale. This part is done in the function transformOne. So basically its an arithmetic operation on one column.
If i want to scale multiple columns I just call the function transformOne multiple times but a bit more efficiently using functools.reduce (see the function transform. The class works fast enough for a single column but when I have multiple columns it takes too much time.
I have no idea about internals of spark so im a complete newbie. Is there a way i can improve this computation over multiple columns ?
My solution does a lot of calls to withColumn function. Hence i changed the solution by using select instead of withColumn. There is substantial difference in the physical plans of both the approaches. For my application I improved from 15 minutes to 2 minutes using select. More information about this in this SO post.

Should I choose RDD over DataSet/DataFrame if I intend to perform a lot of aggregations by key?

I have a use case where I intend to group by key(s) while aggregating over column(s). I am using Dataset and tried to achieve these operations by using groupBy and agg. For example take the following scenario
case class Result(deptId:String,locations:Seq[String])
case class Department(deptId:String,location:String)
// using spark 2.0.2
// I have a Dataset `ds` of type Department
+-------+--------------------+
|deptId | location |
+-------+--------------------+
| d1|delhi |
| d1|mumbai |
| dp2|calcutta |
| dp2|hyderabad |
+-------+--------------------+
I intended to convert it to
// Dataset `result` of type Result
+-------+--------------------+
|deptId | locations |
+-------+--------------------+
| d1|[delhi,mumbai] |
| dp2|[calcutta,hyderabad]|
+-------+--------------------+
For this I searched on stack and found the following:
val flatten = udf(
(xs: Seq[Seq[String]]) => xs.flatten)
val result = ds.groupBy("deptId").
agg(flatten(collect_list("location")).as("locations")
The above seemed pretty neat for me.
But before searching for the above, I first searched if Dataset had an inbuilt reduceByKey like a RDD does. But couldn't find, so opted for above. But I read this article grouByKey vs reduceByKey and came to know reduceByKey has less shuffles and is more efficient. Which is my first reason to ask the question, should I opt for RDD in my scenario ?
The reason I initially went for Dataset was solely enforcement of type,ie. each row being of type Department. But as my result has an entirely different schema should I bother with type safety ? So I tried doing result.as[Result] but that doesn't seem to do any compile time type check. Another reason I chose Dataset was, I'll pass the result Dataset to some other function, having a structure makes code easy to maintain. Also the case class can be highly nested, I cannot imagine maintaining that nesting in pairRDD while writing reduce/map operations.
Another thing I'm unsure of is about using udf. I came across post, where people said they would prefer changing Dataset to RDD, rather than using udf for complex aggregations/grouby.
I also googled around a bit and saw posts/articles where people said Dataset has overhead of type checking, but in higher spark version is better performance wise compared to RDD. Again not sure should I switch back to RDD ?
PS: please forgive, if I used some terms incorrectly.
To answer some of you questions:
groupBy + agg is not groupByKey - DataFrame / Dataset groupBy behaviour/optimization - in general case. There are specific cases where it might behave like one, this includes collect_list.
reduceByKey is not better than RDD-style groupByKey when groupByKey-like logic is required - Be Smart About groupByKey - and in fact it is almost always worse.
There is a important trade-off between static type checking and performance in Spark's Dataset - Spark 2.0 Dataset vs DataFrame
The linked post specifically advises against using UserDefinedAggregateFunction (not UserDefinedFunction) because of excessive copying of data - Spark UDAF with ArrayType as bufferSchema performance issues
You don't even need UserDefinedFunction as flattening is not required in your case:
val df = Seq[Department]().toDF
df.groupBy("deptId").agg(collect_list("location").as("locations"))
And this is what you should go for.
A statically typed equivalent would be
val ds = Seq[Department]().toDS
ds
.groupByKey(_.deptId)
.mapGroups { case (deptId, xs) => Result(deptId, xs.map(_.location).toSeq) }
considerably more expensive than the DataFrame option.

Ques Classification Using Support Vector Machines

I am trying to classify Questions using SVM. I am following this link for reference -
https://shirishkadam.com/2017/07/03/nlp-question-classification-using-support-vector-machines-spacyscikit-learnpandas/
But they have used SPACY,SCIKIT-LEARN and PANDAS. I want to do the same thing using Spark Mllib.
I am using this code to create a Dataframe -
sc = SparkContext(conf=sconf) # SparkContext
sqlContext = SQLContext(sc)
data = sc.textFile("<path_to_csv_file>")
header = data.first()
trainingDF = sqlContext.createDataFrame(data
.filter(lambda line: line != header)
.map(lambda line: line.split("|"))
.map(lambda line: ([line[0]], [line[2]], [line[6]]))).toDF("Question", "WH-Bigram", "Class")
And I am getting following result by printing the dataframe- trainingDF.show(3)
+--------------------+-------------------+------+
| Question| WH-Bigram| Class|
+--------------------+-------------------+------+
|[How did serfdom ...| [How did]|[DESC]|
|[What films featu...| [What films]|[ENTY]|
|[How can I find a...| [How can]|[DESC]|
My sample csv file is -
#Question|WH|WH-Bigram|Class
How did serfdom develop in and then leave Russia ?|How|How did|DESC
I am using word2vec to create training data for SVM and trying to train using SVM.
word2Vec1 = Word2Vec(vectorSize=2, minCount=0, inputCol="Question", outputCol="result1")
training = word2Vec1.fit(trainingDF).transform(trainingDF)
model = SVMWithSGD.train(training, iterations=100)
After using word2vec my data is converted in this format -
[Row(Question=[u'How did serfdom develop in and then leave Russia ?'], WH-Bigram=[u'How did'], Class=[u'DESC'], result1=DenseVector([0.0237, -0.186])), Row(Question=[u'What films featured the character Popeye Doyle ?'], WH-Bigram=[u'What films'], Class=[u'ENTY'], result1=DenseVector([-0.2429, 0.0935]))]
But when I try to train the dataframe using SVM then getting error that TypeError: data should be an RDD of LabeledPoint, but got <class 'pyspark.sql.types.Row'>
I am stuck here...i think the dataframe that i have created is not correct.
Do any body know how to create a suitable dataframe for training it with SVM. And Please let me know if I am doing something wrong.
Great that you are trying out one of the machine learning methods in Spark, but there are multiple problems with your approach,
1) Your data has multiple classes, it is not a binary classification model hence SVM in Spark won't work on this dataset (you can have a look at the source code here). You can try the one class vs all others approach and train as many models as there are classes in your data. However, you would be better off using something like the MultilayerPerceptronClassifier or the multiclass logistic model in Spark.
2) Secondly, Mllib is very unforgiving in terms of the class labels that you use, you can only specify 0,1,2 or 0.0,1.0,2.0 etc i.e it does not automatically infer the number of classes based on your output column. Even if you specify two classes as 1.0 & 2.0 it will not work it has to be 0.0 & 1.0.
3) You need to use a labeledpoint RDD instead of a spark dataframe, remember that spark.mllib is for use with RDD's whereas spark.ml is for use with dataframes. On help for how to create a Labeledpoint rdd you may refer to the spark documentation here where there are multiple examples.
4) On a feature engineering note, I don't think you would want to take the vectorSize as 2 for your word2vec model (something like 10 would be more appropriate as a starting point), these are simply too less for giving a reasonable prediction.

Using MLUtils.convertVectorColumnsToML() inside a UDF?

I have a Dataset/Dataframe with a mllib.linalg.Vector (of Doubles) as one of the columns. I would like to add another column to this dataset of type ml.linalg.Vector to this data set (so I will have both types of Vectors). The reason is I am evaluating few algorithms and some of those expect mllib vector and some expect ml vector. Also, I have to feed o/p of one algorithm to another and each use different types.
Can someone please help me convert mllib.linalg.Vector to ml.linalg.Vector and append a new column to the data set in hand. I tried using MLUtils.convertVectorColumnsToML() inside an UDF and regular functions but not able to get it to working. I am trying to avoid creating a new dataset and then doing inner join and dropping the columns as the data set will be huge eventually and joins are expensive.
You can use the method toML to convert from mllib to ml vector. An UDF and usage example can look like this:
val convertToML = udf((mllibVec: org.apache.spark.mllib.linalg.Vector) = > {
mllibVec.asML
})
val df2 = df.withColumn("mlVector", convertToML($"mllibVector"))
Assuming df to be the original dataframe and the column with the mllib vector to be named mllibVector.

Scala: wrapper for Breeze DenseMatrix for column and row referencing

I am new to Scala. Looking at it as an alternative to MATLAB for some applications.
I would like to program in Scala a wrapping class in order to be able to assign column names ("QuantityQ" && "QuantityP" -> Range) and row names (dates -> Range) to Breeze DenseMatrices (http://www.scalanlp.org/) in order to reference columns and rows.
The usage should resemble Python Pandas or Scala Saddle (http://saddle.github.io).
Saddle is very interesting but its usage is limited to 2D matrices. A huge limitation.
My Ideas:
Columns:
I thought a Map would do the job for colums but that may not be the best implementation.
Rows:
For rows, I could maintain a separate Breeze vector with timestamps and provide methods that convert dates into timestamps, doing the numbercruncing through Breeze. This comes with a loss of generality as a user may want to give whatever string names to rows.
Concerning dates I use nscala-time (a scala wrapper for joda)?
What are the drawbacks of my implementation?
Would you design the data structure differently?
Thank you for your help.