pyspark.ml random forest model feature importances result empty? [closed] - pyspark

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am training a RandomForestClassifier in pyspark.ml and when trying to get the feature importances of the trained model via featureImportances attribute of the Estimator, I am seeing nothing in the returned tuple for the feature indices or importance weights:
(37,[],[])
I'd expect something like...
(37,[<feature indices>],[<feature importance weights>])
...(certainly not having it just be totally blank). It is odd b/c it appears to recognize that there are 37 features, but does not have any info in the other lists. Nothing in the docs seems to address this.
What could be going on here?

TLDR: Sparse vector is typically represented in a particular way. If your sparse vector is being printed empty, it likely means that all the values in your sparse vector are zeros.
Checking/printing the type of the RandomForestClassificationModel Transformer's featureImportance attribute, we can see that it is a SparseVector. In most cases when a sparse vector is printed, you see something like...
(<size>, <list of non-zero indices>, <list of non-zero values associated with the indices>)
...(if anyone has any links to documents confirming that this is how to interpret a sparse vector, do let me know b/c I can't remember how I know this or where this can be confirmed).
An example of how SparseVectors are printed is shown below:
from pyspark.mllib.linalg import SparseVector
import pprint
a = SparseVector(5,{})
print(a)
# (5,[],[])
pprint.pprint(a)
# SparseVector(5, {})
pprint.pprint(a.toArray())
# array([0., 0., 0., 0., 0.])
b = SparseVector(5,{0:1, 2:3, 4:5})
print(b)
# (5,[0,2,4],[1.0,3.0,5.0])
pprint.pprint(b)
# SparseVector(5, {0: 1.0, 2: 3.0, 4: 5.0})
pprint.pprint(b.toArray())
# array([1., 0., 3., 0., 5.])
So if you are getting a sparse vector like (<size>, [], []) for your featureImportances, (I'm pretty sure) it means that the Estimator did not find any of your features particularly important (ie. sadly, your/my chosen features are not very good (at least from the Estimator's POV) and more data analysis is in order).

Related

Scikit-Learn's DPGMM fitting: number of components?

I'm trying to fit a mixed normal model to some data using scikit-learn's DPGMM algorithm. One of the advantages advertised on [0] is that I don't need to specify the number of components; which is good, because I do not know the number of components in my data. The documentation states that I only need to specify an upper bound. However, it looks very much like that is not true:
>>> data = numpy.random.normal(loc = 0.0, scale = 1.0, size = 1000)
>>> from sklearn.mixture import DPGMM
>>> d = DPGMM(n_components=5)
>>> d.fit(data.reshape(-1,1))
DPGMM(alpha=1.0, covariance_type='diag', init_params='wmc', min_covar=None,
n_components=5, n_iter=10, params='wmc', random_state=None, thresh=None,
tol=0.001, verbose=0)
>>> d.n_components
5
>>> d.means_
array([[-0.02283383],
[ 0.06259168],
[ 0.00390097],
[ 0.02934676],
[-0.05533165]])
As you can see, the fitting reports five components (the upper bound) even for data clearly sampled from just one normal distribution.
Am I doing something wrong? Did I misunderstand something?
Thanks a lot in advance,
Lukas
[0] http://scikit-learn.org/stable/modules/mixture.html#dpgmm
I recently had similar doubts about results of this DPGMM implementation. If you check provided example you notice that DPGMM always return model with n_components, now the trick is to remove redundant components. This can be done with predict function.
Unfortunately this important pice is hidden in comment in code example.
# as the DP will not use every component it has access to
# unless it needs it, we shouldn't plot the redundant components
Perhaps look at using an improved sklearn solution for this kind of problem, namely a Bayesian Gaussian Mixture. With this model, the suggested prior number of components must be given, but once trained, the model assigns weightings to each component, which essentially indicate their relevance. Here is a pretty cool visual demo of BGMM in action.
Once you have experimented with training a few BGMMs on your data, you can get a feel for a sensible estimate to the number of components for your given problem.

MATLAB - singularity warning [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
when my matlab code gets to the line:
vE(:,:,i)=(mY(:,:,i))\(-mA*(vIs-mG(:,:,i)*vVs));
The following warning comes up:
Warning: Matrix is close to singular or badly scaled. Results may be inaccurate.
RCOND = 1.682710e-16.
Whats wrong?
Full code:
function [ vE, vV_node, vI_node ] = ...
node_analysis( vIs, vVs, mA, mG, mY )
[A,B,N]=size(mY);
vE=zeros(4,1,N);
for i=1:N
vE(:,:,i)=(mY(:,:,i))\(-mA*(vIs-mG(:,:,i)*vVs));
vV_node(:,:,i)=mA'*vE(:,:,i);
vI_node(:,:,i)=mG(:,:,i)*vV_node(:,:,i)+(vIs-mG(:,:,i)*vVs);
end
end
vE=mY^-1 * (-mA*(cIs-mG*vVs))
vE is (4x1xN) size
mY(4x4xN)
mA(4x9)
vIs(9x1)
mG(9x9xN)
vVs(9x1)
When you use the \ operator with a matrix, MATLAB will try and solve the least squares problem to estimate x given y in the equation y = A*x. Depending on the size and shape of A, solving this equation might be easy, hard, or impossible without additional information. It just depends on your particular problem.
As Oli mentioned the comments, this is because your matrix is close to singular or its singular values are close to zero. MATLAB is properly informing you that the MATRIX likely has either unknown information that is going to screw up the answer or that some of the information in the MATRIX is so small compared to other pieces that the small part is going to make solving for x almost impossible and error prone.
Depending on your math background, you might consider the following cod where I create create a matrix with one value very small. This will reproduce your error:
%% Make some data:
randn('seed', 1982);
n = 3;
A = zeros(n);
for ind = 1:n-1
v = randn(n,1);
A = A + v*v';
end
% Last bit is very tiny compared to the others:
A = A + 1e-14*randn(n,1)*randn(1,n);
%% Try and solve Ax=y for x= 1,2,3...
x = (1:n)';
y = A*x
x_est = A \ y
There are various ways to start trying to fix this, usually by reformulating the problem and/or adding some kind of regularization term. A good first try, though, is so add a simple Tikhonov regularization which bumps up all the small values to something reasonable that MATLAB can work with. This may mess up your data but you can plat with it.
Roughly, try this:
tikk = 1e-12;
x_est2 = (A + tikk * eye(n)) \ y
For larger or smaller values of tikk and you will see the error goes away but the solution is to some degree wrong. You might find this acceptable or not.
Note that in my example the answer is quite wrong because I used n=3. As you increase the problem size n you will be better results.
Finally, to begin exploring what is wrong with your matrix A ((-mA*(vIs-mG(:,:,i)*vVs))), you might consider seeing how fast the values s in s=svd(A) decay. Some of them should be quite close to zero. Also, you might look at Tihkonov regularization and what you can do by actually decomposing the matrix into the SVD and scaling things better.

Find statement in matlab and multiple conditions

The following code is embedded in a function
rsp = find(response_times >= current1 & response_times < current2 & response_times ~= current2);
Here, I am looking for the indices of responses that occur between current1 and current2, where current1 and current2 are times such as 16.22 and 16.32, respectively, and the response times can be equal to current1 but not current2.
For the most part this works as intended, however, every so often it pulls an index of a value equal to current2.
Does anyone know why this might be the case or how I can improve this one line of code to fix it.
Here is an example array this code operates on:
response_times = [ 8.73000000000000
11.4300000000000 13.4800000000000
14.7900000000000 16.3200000000000
18.0400000000000 20.3800000000000
20.9900000000000 21.3400000000000
24.2800000000000 24.6800000000000 ];
It's not actually (exactly) equal to current2. You generally shouldn't compare floating point numbers for equality. Read this article for more information, but the essence of the problem is that most values cannot be represented exactly with a floating point representation (i.e. IEEE 754), hence it is not advisable to test for equality. Inequalities are fine. There is actually a neat mini-site dedicated to providing a basic explanation of the issue.
For the Stack Overflow matlab version of the explanation see this Q&A, entitled "Why is 24.0000 not equal to 24.0000 in MATLAB?". It's quite an interesting read!
To verify that your inequality is actually working, have it compute abs(response_times-current2) and you should find that the value is not zero in those cases, but rather something small like 1.35e-15. If you want to reject these "too close" values, include a test such as (current2-response_times)>tol and set tol to something large enough to reject these points.

Matlab exercise: i just don't get it [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Given the signals:
f1[n] = sinc[n] {1[n+5]-1[n-5]}
f2[n] = 1-rect[n]
f3[n] = 1[n]-1[n-5]
write a programm in matlab in which you will check the following proprieties:
1)sinc[n]:=sin(phi*n)/phi*n;
2)(f1*f2)[n] = (f2*f1)[n];
3)f1[n]*{ f2[n] + f3[n] } = f1[n]*f2[n] + f1[n]*f3[n];
4)(f1*delta)[n] = (delta*f1)[n] = f1[n];
I'm really really grateful for any tips/ideal on how to solve this problem. :)
sinc[n]:=sin(phi*n)/phi*n;
That certainly isn't Matlab syntax, and the ; at the end makes it not look much like a question either. Anyway, you have two options. Either plot the functions to visually assess equivalence or else check the vectors. I'll demonstrate with this one, then you can try for all the others.
Firstly you need to make a sample n vector which will be your domain over which to test equivalence (i.e. the x values of your plot). I'm going to arbitrarily choose:
n = -10:0.01:10;
Also I'm going to assuming by phi you actually meant pi based on the Matlab definition of sinc: http://www.mathworks.com/help/signal/ref/sinc.html
So now we have to functions:
a = sinc(n);
b = sin(n)./n;
a and b are now vectors with a corresponding "y" value for each element of n. You'll also notice I used a . before the /, this means element wise divide i.e. divide each element by each corresponding element rather than matrix division which is inversion followed by matrix multiplication.
Now lets plot them:
plot(n, a, n, b, 'r')
and finally to check numerical equivalence we could do this:
all(a == b)
But (and this is probably a bit out of scope for your question but important to know) you should actually never check for absolute equivalence of floating point numbers like that as you get precision errors due to different truncations in the inner calculations (because of how your computer stores floating point numbers). So instead it is good practice to rather check that the difference between the two numbers is less than some tiny threshold.
all((a - b) < 0.000001)
I'll leave the rest up to you

Find Audio Peaks in MATLAB [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have an audio signal of about the size 7000000 x 1. I have used the peakfinder m file in MATLAB to find the location of all of the peaks in the audio file above a specific threshold. I am now trying to find a frame sized 1000000 x 1 that contains the greatest amount of peaks. I am completely lost on how to do this and any help would be greatly appreciated. Thank you!
Well, all the peak finder function is doing is taking the second derivative and looking for any place where the resulting value is negative. This indicates a local maximum. So you can do something very similar to find any local maximum.
Once you have these indices, you can window the array containing a logical representation of the locations, and count how many peaks are there.
The code below will do what I am saying. It will window across and count the number of peaks found, and return a a vector of the counts, which you can then just find the max of, and then you have the starting index.
clc; close all; clear all;
A = randi(10,[1,100])
plot(A)
hold on
C = diff(diff(A))
indices = find(C < 0)+1;
scatter(indices,A(indices),'r')
temp = zeros(size(A));
temp(indices) = 1;
window = ones(1,5);
results = conv(temp,window,'same');
max(results)
This is of course a pet example, A would be your matrix, and window would be a matrix the length of the range you want to examine, in your case 1000000
Edit
As Try Hard has made note of in the comments below, this method will be fairly susceptible to noise, so what you can do first is run a smoothing filter over the signal before doing any derivatives, something like as follows.
filt = (1/filtLength) * ones(1,filtLength);
A = conv(A,filt,'same')
This is a simple averaging filter which will help smooth out some of the noise