Spark - Polynomial Expansion vecor size exceeded

Spark - Polynomial Expansion vecor size exceeded - scala

I'm using spark to run LinearRegression.
Since my data can not be predicted to a linear model, I added some higher polynomial features to get a better result.
This works fine!
Instead of modifying the data myself, I wanted to use the PolynomialExpansion function from the spark library.
To find the best solution I used a loop over different degrees.
After 10 iterations (degree 10) I ran into the following error:
Caused by: java.lang.IllegalArgumentException: requirement failed: You provided 77 indices and values, which exceeds the specified vector size -30.
I used trainingData with 2 features. This sounds like I have too many features after the polynomial expansion when using a degree of 10, but the vector size -30 confuses me.
In order to fix this I started experimenting with different example data and degrees. For testing I used the following lines of code with different testData (with only one entry line) in libsvm format:
val data = spark.read.format("libsvm").load("data/testData2.txt")
val polynomialExpansion = new PolynomialExpansion()
.setInputCol("features")
.setOutputCol("polyFeatures")
.setDegree(10)
val polyDF2 = polynomialExpansion.transform(data)
polyDF2.select("polyFeatures").take(3).foreach(println)
ExampleData: 0 1:1 2:2 3:3
polynomialExpansion.setDegree(11)
Caused by: java.lang.IllegalArgumentException: requirement failed: You provided 333 indices and values, which exceeds the specified vector size 40.
ExampleData: 0 1:1 2:2 3:3 4:4
polynomialExpansion.setDegree(10)
Caused by: java.lang.IllegalArgumentException: requirement failed: You provided 1000 indices and values, which exceeds the specified vector size -183.
ExampleData: 0 1:1 2:2 3:3 4:4 5:5
polynomialExpansion.setDegree(10)
Caused by: java.lang.IllegalArgumentException: requirement failed: You provided 2819 indices and values, which exceeds the specified vector size -548.
It looks like the number of features from the data has an affect on the highest possible degree, but the number of features after the polynomial expansion seems not to be the cause for the error since it differs a lot.
It also doesn't crash at the expansion function but when I try to print the new features in the last line of code.
I was thinking that maybe my memory was full at that time, but I checked the system control and there was still some free memory available.
I'm using:
Eclipse IDE
Maven project
Scala 2.11.7
Spark 2.0.0
Spark-mllib 2.0.0
Ubuntu 16.04
I'm glad for any ideas regarding this problem

Related

Matlab (ANFIS) Error in training

I have a Sugeno 2-input - 1-output fuzzy system with 5mfs per rule and 5mfs for the output. However, whenever I am trying to train it, i receive the following error:
As you may see, the number of rules and the number of output membership functions is the same. I am also posting the console output below.
ANFIS info:
Number of nodes: 23
Number of linear parameters: 9
Number of nonlinear parameters: 12
Total number of parameters: 21
Number of training data pairs: 2084
Number of checking data pairs: 0
Number of fuzzy rules: 3
Start training ANFIS ...
1 0.0163803
2 0.0163785
Designated epoch number reached --> ANFIS training completed at epoch 2.
Too many outputs requested. Most likely cause is missing [] around left hand side that has a comma
separated list expansion.
Error in fisgui (line 91)
name=nameList{currGui};
Error in mfedit (line 669)
fisgui #findgui
Error in mfedit (line 602)
mfedit #selectvar
Error in mfdlg (line 296)
mfedit('#update',varType,varIndex)
Error using waitfor
Error while evaluating DestroyedObject Callback
I am relatively new to Matlab, so I am terribly sorry if I asked something trivial.

Finally I found out that unfortunately you can't have the same consequent for different antecedents in order to train a fuzzy system; and I had such rules. However, this is very inconvenient when you want to train a fuzzy set with many input membership functions.

Matlab non-linear binary Minimisation

I have to set up a phoneme table with a specific probability distribution for encoding things.
Now there are 22 base elements (each with an assigned probability, sum 100%), which shall be mapped on a 12 element table, which has desired element probabilities (sum 100%).
So part of the minimisation is to merge several base elements to get 12 table elements. Each base element must occur exactly once.
In addition, the table has 3 rows. So the same 12 element composition of the 22 base elements must minimise the error for 3 target vectors. Let's say the given target vectors are b1,b2,b3 (dimension 12x1), the given base vector is x (dimension 22x1) and they are connected by the unknown matrix A (12x22) by:
b1+err1=Ax
b2+err2=Ax
b3+err3=Ax
To sum it up: A is to be found so that dot_prod(err1+err2+err3, err1+err2+err3)=min (least squares). And - according to the above explanation - A must contain only 1's and 0's, while having exactly one 1 per column.
Unfortunately I have no idea how to approach this problem. Can it be expressed in a way different from the matrix-vector form?
Which tools in matlab could do it?

I think I found the answer while parsing some sections of the Matlab documentation.
First of all, the problem can be rewritten as:
errSum=err1+err2+err3=3Ax-b1-b2-b3
=> dot_prod(errSum, errSum) = min(A)
Applying the dot product (least squares) yields a quadratic scalar expression.
Syntax-wise, the fmincon tool within the optimization box could do the job. It has constraints parameters, which allow to force Aij to be binary and each column to be 1 in sum.
But apparently fmincon is not ideal for binary problems algorithm-wise and the ga tool should be used instead, which can be called in a similar way.
Since the equation would be very long in my case and needs to be written out, I haven't tried yet. Please correct me, if I'm wrong. Or add further solution-methods, if available.

Unreasonable [positive] log-likelihood values from matlab "fitgmdist" function

I want to fit a data sets with Gaussian mixture model, the data sets contains about 120k samples and each sample has about 130 dimensions. When I use matlab to do it, so I run scripts (with cluster number 1000):
gm = fitgmdist(data, 1000, 'Options', statset('Display', 'iter'), 'RegularizationValue', 0.01);
I get the following outputs:
iter log-likelihood
1 -6.66298e+07
2 -1.87763e+07
3 -5.00384e+06
4 -1.11863e+06
5 299767
6 985834
7 1.39525e+06
8 1.70956e+06
9 1.94637e+06
The log likelihood is bigger than 0! I think it's unreasonable, and don't know why.
Could somebody help me?

First of all, it is not a problem of how large your dataset is.
Here is some code that produces similar results with a quite small dataset:
options = statset('Display', 'iter');
x = ones(5,2) + (rand(5,2)-0.5)/1000;
fitgmdist(x,1,'Options',options);
this produces
iter log-likelihood
1 64.4731
2 73.4987
3 73.4987
Of course you know that the log function (the natural logarithm) has a range from -inf to +inf. I guess your problem is that you think the input to the log (i.e. the aposteriori function) should be bounded by [0,1]. Well, the aposteriori function is a pdf function, which means that its value can be very large for very dense dataset.
PDFs must be positive (which is why we can use the log on them) and must integrate to 1. But they are not bounded by [0,1].
You can verify this by reducing the density in the above code
x = ones(5,2) + (rand(5,2)-0.5)/1;
fitgmdist(x,1,'Options',options);
this produces
iter log-likelihood
1 -8.99083
2 -3.06465
3 -3.06465
So, I would rather assume that your dataset contains several duplicate (or very close) values.

Error in Cascade training "trainCascadeObjectDetector" in Matlab

I'm trying to train a cascade classifier by the built-in Matlab function "trainCascadeObjectDetector", but that always shows the following error message when I call this function:
trainCascadeObjectDetector('MCsDetector.xml',positiveInstances(1:5000,:),'./negativeSubFolder/',...
'FalseAlarmRate',0.01,'NumCascadeStages',5, 'FeatureType', 'LBP');
Automatically setting ObjectTrainingSize to [ 32, 32 ]
Using at most 980 of 1000 positive samples per stage
Using at most 1960 negative samples per stage
265 ocvTrainCascade(filenameParams, trainerParams, cascadeParams, boostParams, ...
Training stage 1 of 5
[....................................................Time to train stage 1: 12 seconds
Error using ocvTrainCascade
Error in generating samples for training. No samples could be generated for training the first cascade stage.
Error in trainCascadeObjectDetector (line 265)
ocvTrainCascade(filenameParams, trainerParams, cascadeParams, boostParams, ...
The number of samples are 5000 positive images and 11000 negative images. The Matlab version is 2014a that is running on Ubuntu 12.04.
I am not sure if I need to increase more training data, because the error message is:
Error in generating samples for training. No samples could be generated for training the first cascade stage.
Could you please have a look at this? Thanks!

First of all, what is the data type of positiveInstances? It should be a 1D array of structs with two fields: imageFileName and objectBoundingBoxes. positiveInstances(1:5000,:) looks a bit suspicious, because you are treating it as a 2D matrix.
The second thing to check is the negativeSubFolder. It should contain a lot of images without the objects of interest to be able to generate 1960 negative samples per stage.
For future reference, there is a tutorial in the MATLAB documentation.

ILMath.Vec() Appears to be Generation Slightly Wrong Output (Potential Bug?)

I'm comparing ILMath.Vec() with MatLab's and I'm seeing significant rounding errors.
For example, if I create a vector (using Start:0, Step:1.2635048525911006, End:20700) for each system:
MatLab: [Start:Step:End]
ILNumerics: Vec<double>(Start, Step, End)
And then take the average abs difference, I get an average error of 1.56019608343883E-09. However, I if create the vector by hand (using multiplication), I get an average error of only 3.10973469197506E-13, 4 magnitudes smaller error.
After looking at ILNumerics' vec function (using Reflector), I think I know why the average error value is so large. The ILMath.vec() function is using addition vs. multiplication. Summing the step value 16,384 times is not the same thing as multiplying the step value x N (where N is the current loop count) 16,384 times! The addition's rounding errors add up very quickly!
Please consider fixing this issue.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse