Is it necessary to convert categorical attributes to numerical attributes to use LabeledPoint function in Pyspark? - pyspark

I am new to Pyspark. I have a dataset that contains categorical features and I want to use regression models from pyspark to predict continuous values. I am stuck in pre-processing of data that is required for using MLlib models.

Yes, it is necessary. You have to not only convert to numerical but also encode to make them useful for linear models. Both steps are implemented in pyspark.ml (not mllib) with:
pyspark.ml.feature.StringIndexer - indexing.
pyspark.ml.feature.OneHotEncoder - encoding.

Related

Invalid labels for classification logistic regression model in pyspark databricks

I am using Spark ML library for classification problem using a logistic regression.
I have vectorized input features and created training dataset and test dataset.
While fitting the model I get invalid labels issue.
the training dataset is :
where my input features as Independent_features and my target feature as Category_con.
Use the words : label, features instead of independent_features and Category_con while creating your vectors.
For the labels, you would need to change them into just 3 categories. It looks like you might have 6 from the error message. You would need to use conditional replacement to group or bin the categories like below:
train_df.withColumn('label', when((col('Category_con') == firstCondition) ).otherwise(when((col('Category_con') == secondCondition) ).otherwise(lastCondition))

How to specify multiple columns in xgboost.trainWithDataframe when using spark?

enter image description here
This is the api doc present in xgboost.com , it seems that I can just set only one column as the "featureCol" .
As with any ML Estimator on Spark, this one expects inputCol to be a Vector of assembled features. Before you apply the Estimator, you should use tools from org.apache.spark.ml.feature to extract, transform and assemble feature vector.
You can check How to vectorize DataFrame columns for ML algorithms? for example Pipeline.

spark ml : how to find feature importance

I am new to ML and I am building a prediction system using Spark ml. I read that a major part of feature engineering is to find the importance of each feature in doing the required prediction. In my problem, I have three categorical features and two string features. I use the OneHotEncoding technique for transforming the categorical features and simple HashingTF mechanism to transform the string features. And, then these are input as various stages of the Pipeline, including ml NaiveBayes and a VectorAssembler (to assemble all the features into a single column), fit and transformed using the training and test data sets respectively.
Everything is good, except, how do I decide the importance of each feature? I know I have only a handful of features now, but I will be adding more soon. The closest thing I came across was the ChiSqSelector available with spark ml module, but it seems to only work for categorical features.
Thanks, any leads appreciated!
You can see these example:
The method mentioned in question's comment
Information Gain based feature selection in Spark’s MLlib
This package contains several feature selection methods (also InfoGain):
Information Theoretic Feature Selection Framework
Using ChiSqSelector is okay, you can simply discretize your continuous features (the HashingTF values). One example is provided in: http://spark.apache.org/docs/latest/mllib-feature-extraction.html, I copy here the part of interest:
// Discretize data in 16 equal bins since ChiSqSelector requires categorical features
// Even though features are doubles, the ChiSqSelector treats each unique value as a category
val discretizedData = data.map { lp =>
LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x => (x / 16).floor })) }
L1 regulation is also an option.
You may use L1 to get the features importance from the coefficients, and decide which features to use for the Bayes training accordingly.
Example of getting coefficients
Update:
Some conditions under which coefficients not work very well

Training SVM classifier in MATLAB with numeric+text data

I want to train a SVM classifier in MATLAB for threat detection. The training data is in Excel file and contains both numeric and text fields/columns. When I export this data to MATLAB, it is either in table or cell format. How do I convert it in matrix format?
P.S: Using xlsread function does not import text data.
There are 4 type of attributes in data. Numerical ,discrete , nominal and ordinal. Here you can read more about them . First run an statistical analysis for each feature in your dataset to know the basic statistics such as mean, median, max , min , variable type and if it like nominal or ordinal distinct words and all. So you then have a pretty good idea what you are dealing with.Then according to the variable type you can decide which vectorization we are using.if it is an numerical variable you can divide it into different classes and feature scaling . if it an ordinal variable you can give logical order . if it is nominal variable you can give a identical numerical names. Here , you are just checking how much each feature bring the impact to final prediction
My advice , use Weka GUI too to visualize the data. Then you can pre process the data with column by column
You need to transform your text fields into numeric using dummy variables or another technique, or drop them entirely if they actually are id's (e.g. patient name for medical data, record number, respondent uuid for a survey, etc.)
This would actually be easier in R or Python+Pandas, but in Matlab, you will need to perform encoding by yourself, working from the cell array towards a matrix. Or you can try this toolbox.

Clustering and classification

I need to perform clustering and classification on data, which is present in a csv file. The data is in form of simple text containing the vendor names.
Is there some free library available for this task?
Thanks,
Ashish
I don't understand what you mean by "clustering a classification" since those two are different from each other, but you can do clustering and classification with these libraries:
Python-Scikit
Java-weka
First convert your Dataset from csv to arff using the following link.
http://www.cs.ccsu.edu/~markov/MDLclustering/MDLmanual.pdf
After doing this please let me know that what are your expectations from the data as every algorithm in weka show some different results.
You can simply apply k-means and any other algorithm once you convert the data.