How to specify multiple columns in xgboost.trainWithDataframe when using spark? - scala

enter image description here
This is the api doc present in xgboost.com , it seems that I can just set only one column as the "featureCol" .

As with any ML Estimator on Spark, this one expects inputCol to be a Vector of assembled features. Before you apply the Estimator, you should use tools from org.apache.spark.ml.feature to extract, transform and assemble feature vector.
You can check How to vectorize DataFrame columns for ML algorithms? for example Pipeline.

Related

join datasets with tfx tensorflow transform

I am trying to replicate some data preprocessing that I have done in pandas into tensorflow transform.
I have a few CSV files, which I joined and aggregated with pandas to produce a training dataset. Now as part of productionising the model I would like this preprocessing to be done at scale with apache beam and tensorflow transform. However it is not quite clear to me how I can reproduce the same data manipulation there. Let's look at two main operations: JOIN dataset a and dataset b to produce c and group by col1 on dataset c. This would be a quite straightforward operation in pandas, but how would I do this in tensorflow transform running on apache beam? Am I using the wrong tool for the job? What would be the right tool then?
You can use the Beam Dataframes API to do the join and other preprocessing exactly as you would have in Pandas. You can then use to_pcollection to get a PCollection that you can pass directly to your Tensorflow Transform operations, or save it as a file to read in later.
For top-level functions (such as merge) one needs to do
from apache_beam.dataframe.pandas_top_level_functions import pd_wrapper as beam_pd
and use operations beam_pd.func(...) in place of pd.func(...).

Invalid labels for classification logistic regression model in pyspark databricks

I am using Spark ML library for classification problem using a logistic regression.
I have vectorized input features and created training dataset and test dataset.
While fitting the model I get invalid labels issue.
the training dataset is :
where my input features as Independent_features and my target feature as Category_con.
Use the words : label, features instead of independent_features and Category_con while creating your vectors.
For the labels, you would need to change them into just 3 categories. It looks like you might have 6 from the error message. You would need to use conditional replacement to group or bin the categories like below:
train_df.withColumn('label', when((col('Category_con') == firstCondition) ).otherwise(when((col('Category_con') == secondCondition) ).otherwise(lastCondition))

Is it necessary to convert categorical attributes to numerical attributes to use LabeledPoint function in Pyspark?

I am new to Pyspark. I have a dataset that contains categorical features and I want to use regression models from pyspark to predict continuous values. I am stuck in pre-processing of data that is required for using MLlib models.
Yes, it is necessary. You have to not only convert to numerical but also encode to make them useful for linear models. Both steps are implemented in pyspark.ml (not mllib) with:
pyspark.ml.feature.StringIndexer - indexing.
pyspark.ml.feature.OneHotEncoder - encoding.

spark ml : how to find feature importance

I am new to ML and I am building a prediction system using Spark ml. I read that a major part of feature engineering is to find the importance of each feature in doing the required prediction. In my problem, I have three categorical features and two string features. I use the OneHotEncoding technique for transforming the categorical features and simple HashingTF mechanism to transform the string features. And, then these are input as various stages of the Pipeline, including ml NaiveBayes and a VectorAssembler (to assemble all the features into a single column), fit and transformed using the training and test data sets respectively.
Everything is good, except, how do I decide the importance of each feature? I know I have only a handful of features now, but I will be adding more soon. The closest thing I came across was the ChiSqSelector available with spark ml module, but it seems to only work for categorical features.
Thanks, any leads appreciated!
You can see these example:
The method mentioned in question's comment
Information Gain based feature selection in Spark’s MLlib
This package contains several feature selection methods (also InfoGain):
Information Theoretic Feature Selection Framework
Using ChiSqSelector is okay, you can simply discretize your continuous features (the HashingTF values). One example is provided in: http://spark.apache.org/docs/latest/mllib-feature-extraction.html, I copy here the part of interest:
// Discretize data in 16 equal bins since ChiSqSelector requires categorical features
// Even though features are doubles, the ChiSqSelector treats each unique value as a category
val discretizedData = data.map { lp =>
LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x => (x / 16).floor })) }
L1 regulation is also an option.
You may use L1 to get the features importance from the coefficients, and decide which features to use for the Bayes training accordingly.
Example of getting coefficients
Update:
Some conditions under which coefficients not work very well

Interpreting the result of StreamingKMeans in mahout 0.8

What I want to achieve is simply find out which input points are included in a given cluster!?
I have a personal dataset which contains some documents that are grouped in 12 clusters manually.
I know how to interpret kmenas result in mahout .7 with using namedVector class and one of dumpers (like clusterdumper). after clustering using kmeans driver, a directory named clusteredPoints has created which contains clustering result and using clusterDumper, you can see the created clusters and the points that are in each one. in below link there is a good solution for this :
How to read Mahout clustering output
But, as I mentioned in title I want to have this capability to interpret Streaming Kmeans result which is a new feature in mahout .8.
In this feature, it uses a Centroid class for holding data points and each cluster seeds. The generated result of StreamingKMeans algorithm is only a sequence file which is constructed of centroid vectors + keys and weights of each cluster. And in this output there is no information of input data points to know the distribution of them between clusters. However, it is not possible to me to get a sense of accuracy of clustering.
by the way, How to get this information in clustering output ? It is not implemented or just I failed to find and use prepared soulution? How can I analysis the result of streamingKMeans?
thanks.