Why is Gini impurity greater than the weighted sum of Gini impurities of subnodes - classification

Say the decision tree has k classes (c1,c2,...,ck) to classify and the dataset of the parent node is D. Pi denotes the proportion of elements labelled with class ci. And Gini impurity is:
If one partitions the node to subnodes with subsets D1 and D2 which are complementary and not intersected. How to prove:
I understand that the information gain should not be negative so this inequality should exist. Could anyone help?

Related

Updating value of K in K-Means Clustering

What is the best way to cluster a dataset with no labels and no idea of the number of clusters required?
For example, using the Iris dataset with no labels or knowledge of the number of label classes.
My idea:
Compute the mean square distance from each of the existing clusters for a sample
*If mean square distance > some threshold by a factor that depends (penalizes) on k, then, add that a “new” candidate.
*If a new cluster was added, find the new “best” k+1 cluster centers
If no new cluster was added, go to next row
What you can do is plot the elbow curve at different K-values as described here
Specifically,
1) The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k (say, k from 1 to 10 in the examples above), and for each value of k calculate the sum of squared errors (SSE).
2) Then, plot a line chart of the SSE for each value of k. If the line chart looks like an arm, then the "elbow" on the arm is the value of k that is the best
3) So our goal is to choose a small value of k that still has a low SSE, and the elbow usually represents where we start to have diminishing returns by increasing k
Dozens of methods have been proposed on how to choose k.
Some variants such as x-means can dynamically adjust k, you only need to give the maximum - and choose the quality criterion AIC or BIC.

Computing the SVD of a rectangular matrix

I have a matrix like M = K x N ,where k is 49152 and is the dimension of the problem and N is 52 and is the number of observations.
I have tried to use [U,S,V]=SVD(M) but doing this I get less memory space.
I found another code which uses [U,S,V]=SVD(COV(M)) and it works well. My questions are what is the meaning of using the COV(M) command inside the SVD and what is the meaning of the resultant [U,S,V]?
Finding the SVD of the covariance matrix is a method to perform Principal Components Analysis or PCA for short. I won't get into the mathematical details here, but PCA performs what is known as dimensionality reduction. If you like a more formal treatise on the subject, you can read up on my post about it here: What does selecting the largest eigenvalues and eigenvectors in the covariance matrix mean in data analysis?. However, simply put dimensionality reduction projects your data stored in the matrix M onto a lower dimensional surface with the least amount of projection error. In this matrix, we are assuming that each column is a feature or a dimension and each row is a data point. I suspect the reason why you are getting more memory occupied by applying the SVD on the actual data matrix M itself rather than the covariance matrix is because you have a significant amount of data points with a small amount of features. The covariance matrix finds the covariance between pairs of features. If M is a m x n matrix where m is the total number of data points and n is the total number of features, doing cov(M) would actually give you a n x n matrix, so you are applying SVD on a small amount of memory in comparison to M.
As for the meaning of U, S and V, for dimensionality reduction specifically, the columns of V are what are known as the principal components. The ordering of V is in such a way where the first column is the first axis of your data that describes the greatest amount of variability possible. As you start going to the second columns up to the nth column, you start to introduce more axes in your data and the variability starts to decrease. Eventually when you hit the nth column, you are essentially describing your data in its entirety without reducing any dimensions. The diagonal values of S denote what is called the variance explained which respect the same ordering as V. As you progress through the singular values, they tell you how much of the variability in your data is described by each corresponding principal component.
To perform the dimensionality reduction, you can either take U and multiply by S or take your data that is mean subtracted and multiply by V. In other words, supposing X is the matrix M where each column has its mean computed and the is subtracted from each column of M, the following relationship holds:
US = XV
To actually perform the final dimensionality reduction, you take either US or XV and retain the first k columns where k is the total amount of dimensions you want to retain. The value of k depends on your application, but many people choose k to be the total number of principal components that explains a certain percentage of your variability in your data.
For more information about the link between SVD and PCA, please see this post on Cross Validated: https://stats.stackexchange.com/q/134282/86678
Instead of [U, S, V] = svd(M), which tries to build a matrix U that is 49152 by 49152 (= 18 GB 😱!), do svd(M, 'econ'). That returns the “economy-class” SVD, where U will be 52 by 52, S is 52 by 52, and V is also 52 by 52.
cov(M) will remove each dimension’s mean and evaluate the inner product, giving you a 52 by 52 covariance matrix. You can implement your own version of cov, called mycov, as
function [C] = mycov(M)
M = bsxfun(#minus, M, mean(M, 1)); % subtract each dimension’s mean over all observations
C = M' * M / size(M, 1);
(You can verify this works by looking at mycov(randn(49152, 52)), which should be close to eye(52), since each element of that array is IID-Gaussian.)
There’s a lot of magical linear algebraic properties and relationships between the SVD and EVD (i.e., singular value vs eigenvalue decompositions): because the covariance matrix cov(M) is a Hermitian matrix, it’s left- and right-singular vectors are the same, and in fact also cov(M)’s eigenvectors. Furthermore, cov(M)’s singular values are also its eigenvalues: so svd(cov(M)) is just an expensive way to get eig(cov(M)) 😂, up to ±1 and reordering.
As #rayryeng explains at length, usually people look at svd(M, 'econ') because they want eig(cov(M)) without needing to evaluate cov(M), because you never want to compute cov(M): it’s numerically unstable. I recently wrote an answer that showed, in Python, how to compute eig(cov(M)) using svd(M2, 'econ'), where M2 is the 0-mean version of M, used in the practical application of color-to-grayscale mapping, which might help you get more context.

Matlab PCA order of principal components

So I read the documentation on pca and it stated that the columns are organized in descending order of their variance. However, whenever I take the PCA of an example and I take the variance of the PCA matrix I get no specific order. A simple example of this is example:
pc = pca(x)
Which returns
pc =
0.0036 -0.0004
0.0474 -0.0155
0.3149 0.3803
0.3969 -0.1930
0.3794 0.3280
0.5816 -0.2482
0.3188 0.1690
-0.1343 0.7835
0.3719 0.0785
0.0310 -0.0110
Meaning column one should be PC1 and column two should be PC2 meaning var(PC1) > var(PC2), but when I get the variance this is clearly not the case.
var(pc)
ans =
0.0518 0.0932
Can anyone shed light into why the variance of PC1 is not the largest?
The docs state that calling
COEFF = pca(x)
will return a p-by-p matrix, so your result is rather surprising (EDIT: this is because your x data set has so few rows compared to columns (i.e. similar to having 10 unknowns and only 3 equations)). Either way when they talk about variance They don't mean the variance of the coefficients of each component but rather the variance of the x data columns after being projected on to each principal component. The docs state that the output score holds these projections and so to see the descending variance you should be doing:
[COEFF, score, latent] = pca(x)
var(score)
and you will see that var(score) equals the third output latent and is indeed in descending order.
Your misunderstanding is that you are trying to calculate the variance of the coefficients of the principal component vectors. These are just unit vectors describing the direction of the hyperplane on which to project your data such that the resulting projected data has maximum variance. These vectors ARE arranged in an order such that your original data projected onto the hyperplane that each describes will be in descending order of variance, but variance of the projected data (score) and NOT of the coefficients of the principal component vectors (COEFF or in your code pc).

JS divergence between two discrete probability distributions of unequal length

I am implementing an online topic modeling as outlined in the paper - On-line Trend Analysis with Topic Models: #twitter trends detection topic model online
I need to find the Jensen-Shannon divergence measure between the word distribution of each topic t before and after an update, and classify a topic as being novel if the measure exceeds a threshold.
At each update the vocabulary is updated, so the word distribution over the vocabulary has different length after each update. How can I calculate the JS divergence between two distributions of unequal length?
Jensen-Shannon divergence is the relative entropy of two probability distributions, it is a symmetrical form of Kullback-Leibler (KL) divergence. It is the average of the KL divergence when the two arguments that you are comparing with respect to divergence are swapped.
You will need a good understanding of KL divergence before you can proceed. Here is a good starting point:
Given two probability distributions ie P and Q.
P=(p1....pi), Q=(q1....qi)
KL(P||Q)= sum([pi * log(pi/qi) for i in P if i in Q])
KL is not symmetric, hence it is not a metric. In order to make KL symmetric Jensen and Shannon proposed the Jensen-Shannon divergence which is the average of the KL divergence when the two arguments are swapped ie
JSd(P||Q)= (KL(P,Z) + KL(Q,Z))/2
where Z=(P + Q)/2
In simple terms, the Jensen-Shannon divergence is the average of the averaged KL divergence between two probability distributions.
I hope this helps.
Using random.choice sample data to make the same length of distribution of p and q
def jsd(p, q, base=np.e):
'''
Implementation of pairwise `jsd` based on
https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence
'''
if len(p)>len(q):
p = np.random.choice(p,len(q)) # random.choice make same length to p/q
elif len(q)>len(p):
q = np.random.choice(q,len(p))
## convert to np.array
p, q = np.asarray(p), np.asarray(q)
## normalize p, q to probabilities
p, q = p/p.sum(), q/q.sum()
m = 1./2*(p + q)
return scipy.stats.entropy(p,m, base=base)/2. + scipy.stats.entropy(q, m, base=base)/2.

how to determine k value for the k nearest neighbours algorithm for a matrix in matlab

If we have a matrix for 6 rows and 10 columns we have to determine the k value.If we assume default k value is 5 and if we have less columns than 5 with same number of rows 6 can we assume that number of columns=k value is it correct?i.e rows=6 cols=4 then k=col-1 => k=3
k=n^(1/2)
Where n is number of instances and not features. reference 1 , reference 2
Check this question, value of k in k nearest neighbour algorithm
Same as the previous one. Usually, the rule of thumb is squareroot of number of features
k=n^(1/2)
where n is the number of features. In your case square-root of 10 is approximately 3, so the answer should be 3.
k=sqrt(n) has not optimal result with the various dataset. Some dataset, its result is quite awful. For example, one paper for 90's paper link says the best result of k is between 5-10 bu sqrt(n) gives us a 17. Some other papers suggest interesting suggestions such as local k value or weighted k. İt's obvious choose k it's not an easy choice. That does not have an easy formula for these and depends on our dataset. Best way to choose optimal k is calculate accuracy of which k is best for our dataset. Generally, if our dataset is getting bigger, optimal k value is also increasing.