Feature selection and One-Hot-Encoding in Apache Spark - scala

I'm working on a classification model and I have a problem to create the correct form of data for the model.
In my dataset there are 3 columns with sums. I discretized these columns with the given Bucketizer. The rest of the columns are categorical with Strings as values. I used the StringIndexer to transform these features. Afterwards I select the best columns via ChiSqSelector. So far so good.
But now I want to transform the categorical features in dummy variables. I don't know how to do that because I already have the data in the form of LabeledPoints. Is there a easy way or given solution to transform the values from a set of vectors to dummy variables? Or does anyone has a suggestion to solve this problem in another way?

#zero323 The input for ChiSqSelector has to be an RDD[LabeledPoint]. My data has 25 features.I select the 15 best features but for simplicity let's say I have the following LabeledPoints:
LabeledPoint(1, [1, 2, 3])
LabeledPoint(0, [2, 1, 3])
LabeledPoint(1, [1, 3, 1])
For example ChiSqSelector selects only the best (first) feature so my LabeledPoints are:
LabeledPoint(1, [1])
LabeledPoint(0, [2])
LabeledPoint(1, [1])
How can I encode the features from the feature vector to dummy variables now that my LabeledPoints are:
LabeledPoint(1, [1, 0])
LabeledPoint(0, [0, 1])
LabeledPoint(1, [1, 0])
Hope that helps. Or do you need some code?
Edit:
My idea right now is something like this:
Convert the label and features from each LabeledPoint to a Row and convert this RDD to DataFrame to use the OneHotEncoder:
val data = chiData.map{ r=>
val label = r.label
val feature1 = r.features.toArray(0)
val feature2 = r.features.toArray(1)
val feature3 = r.features.toArray(2)
....
Row(label, feature1, feature2, feature3, ...)
}
//Convert RDD to DataFrame
//Use OneHotEncoder
//Create LabeledPoints again for use in Algorithms
But I think this is not the smartest way.

Related

De-nest elements of cell-matrix into a matrix

EDIT: It turns out this problem is not solved, as it fails to handle empty cells in the source data. i.e. k = {1 2 [] 4 5}; cat( 2, k{:} ) gives 1 2 4 5 not 1 2 NaN 4 5. So the subsequent reshape is now misaligned. Can anyone outline a strategy for handling this?
I have data of the form:
X = { ...
{ '014-03-01' [1.1] [1.2] [1.3] }; ...
{ '014-03-02' [2.1] [2.2] [2.3] }; ... %etc
}
I wish to replace [1.1] with 1.1 etc, and also replace the date with an integer using datenum
So I may as well use a standard 2D matrix to hold the result (as every element can be expressed as a Double).
But how to go out this repacking?
I was hoping to switch the dates in-place using X{:,1} = datenum( X{:,1} ) but this command fails.
I can do:
A = cat( 1, X{:} )
dateNums = datenum( cat( 1, A{:,1} ) )
values = reshape( cat( 1, A{:,2:end} ), size(X,1), [] )
final = [dateNums values]
Well, that works, but I don't feel at all comfortable with it.
>> u = A{:,1}
u =
014-03-01
>> cat(1,u)
ans =
014-03-01
This suggests only one value is output. But:
>> cat(1,A{:,1})
ans =
014-03-01
014-03-02
So A{:,1} must be emitting a sequential stream of values, and cat must be accepting varargs.
So now if I do A{:,2:end}, it is now spitting out that 2D subgrid again as a sequential stream of values...? And the only way to get at that grid is to cat -> reshape it. Is this a correct understanding?
I'm finding MATLAB's console output infuriatingly inconsistent.
The "sequential stream of values" is known as a comma-separated list. Doing A{:,1} in MATLAB in the console is equivalent to the following syntax:
>> A{1,1}, A{2,1}, A{3,1}, ..., A{end,1}
This is why you see a stream of values because it is literally typing out each row of the cell for the first column, separated by commas and showing that in the command prompt. This is probably the source of your infuriation as you're getting a verbose dump of all of the contents in the cell when you are unpacking them into a comma-separated list. In any case, this is why you use cat because doing cat(1, A{:,1}) is equivalent to doing:
cat(1, A{1,1}, A{2,1}, A{3,1}, ... A{end,1})
The end result is that it takes all elements in the 2D cell array of the first column and creates a new result concatenating all of these together. Similarly, doing A{:, 2:end} is equivalent to (note the column-major order):
>> A{1, 2}, A{2, 2}, A{3, 2}, ..., A{end, 2}, A{1, 3}, A{2, 3}, A{3, 3}..., A{end, 3}, ..., A{end, end}
This is why you need to perform a reshape because if you did cat with this by itself, it will only give you a single vector as a result. You probably want a 2D matrix, so the reshape is necessary to convert the vector into matrix form.
Comma-separated lists are very similar to Python's splat operator if you're familiar with Python. The splat operator is used for unpacking input arguments that are placed in a single list or iterator type... so for example, if you had the following in Python:
l = [1, 2, 3, 4]
func(*l)
This is equivalent to doing:
func(1, 2, 3, 4)
The above isn't really necessary to understand comma-separated lists, but I just wanted to show you that they're used in many programming languages, including Python.
There is a problem with empty cells: cat will skip them. Which means that a subsequent reshape will throw a 'dimension mismatch' error.
The following code simply removes rows containing empty cells (which is what I require) as a preprocessing step.
(It would only take a minor alteration to replace empty cells with NaNs).
A = cat( 1, X{:} );
% Remove any row containing empty cells
emptiesInRow = sum( cellfun('isempty',A), 2 );
A( emptiesInRow > 0, : ) = [];
% Date is first col
dateNums = datenum( cat( 1, A{:,1} ) );
% Get other cols
values = reshape( cat( 1, A{:,2:end} ), size(A,1), [] );
% Recombine into (double) array
grid = [dateNums values]; %#ok<NASGU>

How to get element with python?

I have not any idea about how to implement it with such library and function. Anybody can give me some idea. Just some function name or idea or some helpful website url would be ok! Thanks!
I thinks it's different.
How does linear regression map to your problem?
Given a matrix X, the rows may represent the samples, and the columns the variables.
The column containing the nunmpy.nan value represents the target variable ("y"). The remaining columns represent the input variables (the x1, x2,...).
The rows with observed values represent the training set, the rest represent the test set.
The code
Below is a code snippet that implements these points using your example matrix X.
import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([[1, 2, 3], [2, 4, np.nan], [3, 6, 9]])
# Unknown rows (test examples), i.e. rows with a nan
impute_rows = np.any(np.isnan(X), axis=1)
# Known rows (training examples), i.e. rows without a nan
full_rows = np.logical_not(impute_rows)
# Column acting as variable to predict
output_var = np.any(np.isnan(X), axis=0)
input_var = np.logical_not(output_var)
# Check only one variable to predict
assert(np.sum(output_var)==1)
# Construct traing/test input/output
train_input = X[np.ix_(full_rows, input_var)]
train_output = X[np.ix_(full_rows, output_var)]
test_input = X[np.ix_(impute_rows, input_var)]
# Perform regression
lr = LinearRegression()
lr.fit(train_input, train_output)
lr.predict(test_input)
Note that using the specific X you provided represents an oversimplified case, where only two points are fitted, but these ideas should be applicable to larger matrices.
Also note there exist other more specialized methods to impute missing values from matrices (is understood in your question that this was an exercise). This specific method may be valid in cases where there is a linear relationship between the elements of the matrix (as is the case in your simplified example).

Concatenate Sparse Vectors in Spark?

Say you have two Sparse Vectors. As an example:
val vec1 = Vectors.sparse(2, List(0), List(1)) // [1, 0]
val vec2 = Vectors.sparse(2, List(1), List(1)) // [0, 1]
I want to concatenate these two vectors so that the result is equivalent to:
val vec3 = Vectors.sparse(4, List(0, 2), List(1, 1)) // [1, 0, 0, 1]
Does Spark have any such convenience method to do this?
If you have the data in a DataFrame, then VectorAssembler would be the right thing to use. For example:
from pyspark.ml.feature import VectorAssembler
dataset = spark.createDataFrame(
[(0, Vectors.sparse(10, {0: 0.6931, 5: 0.0, 7: 0.5754, 9: 0.2877}), Vectors.sparse(10, {3: 0.2877, 4: 0.6931, 5: 0.0, 6: 0.6931, 8: 0.6931}))],
["label", "userFeatures1", "userFeatures2"])
assembler = VectorAssembler(
inputCols=["userFeatures1", "userFeatures2"],
outputCol="features")
output = assembler.transform(dataset)
output.select("features", "label").show(truncate=False)
You would get the following output for this:
+---------------------------------------------------------------------------+-----+
|features |label|
+---------------------------------------------------------------------------+-----+
|(20,[0,7,9,13,14,16,18], [0.6931,0.5754,0.2877,0.2877,0.6931,0.6931,0.6931])|0|
+---------------------------------------------------------------------------+-----+
I think you have a slight problem understanding SparseVectors. Therefore I will make a little explanation about them, the first argument is the number of features | columns | dimensions of the data, besides every entry of the List in the second argument represent the position of the feature, and the values in the the third List represent the value for that column, therefore SparseVectors are locality sensitive, and from my point of view your approach is incorrect.
If you pay more attention you are summing or combining two vectors that have the same dimensions, hence the real result would be different, the first argument tells us that the vector has only 2 dimensions, so [1,0] + [0,1] => [1,1] and the correct representation would be Vectors.sparse(2, [0,1], [1,1]), not four dimensions.
In the other hand if each vector has two different dimensions and you are trying to combine them and represent them in a higher dimensional space, let's say four then your operation might be valid, however this functionality isn't provided by the SparseVector class, and you would have to program a function to do that, something like (a bit imperative but I accept suggestions):
def combine(v1:SparseVector, v2:SparseVector):SparseVector = {
val size = v1.size + v2.size
val maxIndex = v1.size
val indices = v1.indices ++ v2.indices.map(e => e + maxIndex)
val values = v1.values ++ v2.values
new SparseVector(size, indices, values)
}
If your vectors represent different columns of a dataframe, you can use VectorAssembler. Just need to set setInputcols (your 2 vectors) and Spark will make your wish come true ;)

spark conversion from RowMatrix to CoordinateMatrix in scala

I am trying to perform some operations in Distributed matrix space in Scala Spark, and I am wondering if there is any straightforward way to convert a distributed RowMatrix to a distributed CoordinateMatrix?
For example, if
rdd_input = Array[org.apache.spark.mllib.linalg.Vector] = Array([10], [0], [0], [40])
The conversion of RowMatrix to 2 X 2 dense matrix was accomplished by,
rdd_input.zipWithIndex.groupBy{case (x, i) => i / 2}.map(_._2.map(_._1))
If the Matrix is sparse, how can I convert this matrix to a CoordinateMatrix?
Any help is appreciated.

How would i vectorize that 'for' loop in Matlab?

I have this toy example:
l = [1, 2, 3, 4, 5];
a = zeros(3, 1);
for p = 1:3
a(p) = sum(l(p:p+2));
end;
This example calculates sum of each 3 elements in 'l', which are close to each other and writes it in 'a'. Now, is it possible to rewrite the same code using vectorization and matrix operations, without 'for' loop? I have tried something similar to:
p = 1:3;
a(p) = sum(l(p:p+2));
The result was [6, 6, 6], it have calculated sum of the first three elements and wrote it in 'a' on each position. This is only a toy example, but i need to do something similar on 3d 256x256x128 array with 20x20x20 cube elements sum, using 'for' loops was VERY slow there. So, is there a fast way to solve this problem?
Edit :
Divakar have suggested me to use my actual case here. Well, here is the most important part of it, which i am trying to rewrite:
Edit2 :
Thank you for answer. Understood the idea, removed code above, rewriting it now.
You're describing a basic convolution:
a = conv(l,ones(1,3),'valid'); % a should have 3 elements
For 3d, use convn in a similar fashion with ones(20,20,20)