Which features should be selected for my clustering analysis? - cluster-analysis

I want to cluster firms based on balance sheet data.
I have access to very detailed balance sheet data of firms. The dataset contains more that 1000 features for more than 1000 firms. Now, my goal is to cluster those firms with respect to their business model based on a subset of those features. Since I am interested in the business model of the firms I will scale the features by total assets. This should lower the predominant effect of the size of the firm on the clustering result.
In addition to some sort of analytical dimensionality reduction that I will perform, I also want to run the cluster analysis after intuitively reducing the number of features to use. Here, the nested nature of the features makes it hard for me to understand how the feature selection affects the clustering result. Let me explain.
Generally, I have three types of features (X, Y and Z) in the aggregated balance sheet. Features of type X have sub-variables x1, x2 and x3, that sum up to exactly X. Features of type Y have sub-variables y1 and y2 that sum up to less than Y, which means that there is some amount in Y that is not explicitely stated in the balance sheet or at least not stated in one of the immediate subordinate positions. Lastly, features of type Z do not have any sub-variables.
Here is an example balance sheet for depiction:
Assets Liabilities
X (100) A (200)
x1 (30) a1 (150)
x2 (30) a2 (25)
x3 (40) a3 (25)
x1+x2+x3=X a1+a2+a3=A
Y (150) B (200)
y1 (10) b1 (80)
y2 (40) b2 (100)
Z (350) C (100)
Tot. Ass. (500) Tot. Liab. (500)
As long as I only include X, Y and Z (and A, B and C) in the cluster analysis I do not expect any problems.
Now, here is my series of questions:
Let's assume I want to include x1, x2 and x3 in the analysis. Should I exclude X? Moreover, do I run into trouble because of the magnitude of the numbers, which are now way smaller? I believe that using correlation based distance makes sense in this scenario. Do you agree?
Let's assume I want to include y1 and y2 into the analysis. In this case, I should not remove Y from the analysis, because depending on the size of y1 and y2 relative to Y, Y might still have a lot of explanatory power. Do you agree?
I would be thankful for any pointers and also just general advice on the clustering analysis/links to look at, etc.
P.s. I am doing the analysis in R.

Related

PCA (Principle Component Analysis) on multiple datasets

I have a set of climate data (temperature, pressure and moisture for example), X, Y, Z which are matricies with dimensions (n x p) where n is the number of observations and p is the number of spatial points.
Previously, to investigate modes of variability in dataset X, I simply performed a empirical orthogonal function (EOF) analysis OR Principle component Analysis (PCA) on X. This involved decomposing (via SVD), the matrix X.
To investigate the coupling of the modes of variability of X and Y, I used maximum covariance analysis (MCA) which involved decomposing a covariance matrix proportional to XY^{T}. (T is transpose)
However, if I wish to looked at all three datasets, how do I go about doing this? One idea I had was to form a fourth matrix, L, which will be the 'feature' concatenation of the three datasets:
L = [X, Y, Z]
so that my matrix L will have dimensions (n x 3p).
I would then use standard PCA/EOF analysis and use SVD to decompose this matrix L and then I would obtain modes of variabiilty with size (3p x 1) and thus subsequently the mode associated with X is the first p values, the mode associated with Y is the second set of p values and the mode associated with Z is the last p values.
Is this correct? Or can anyone suggest a better way of looking at the coupling of all three (or more) datasets?
Thank you so much!
I'd recommend to treat spatial points as extra dimension, i.e. f x n x p, where 'f' is your number of features. At this point you should use multilinear extension of PCA that can work on tensor data.

Computing the SVD of a rectangular matrix

I have a matrix like M = K x N ,where k is 49152 and is the dimension of the problem and N is 52 and is the number of observations.
I have tried to use [U,S,V]=SVD(M) but doing this I get less memory space.
I found another code which uses [U,S,V]=SVD(COV(M)) and it works well. My questions are what is the meaning of using the COV(M) command inside the SVD and what is the meaning of the resultant [U,S,V]?
Finding the SVD of the covariance matrix is a method to perform Principal Components Analysis or PCA for short. I won't get into the mathematical details here, but PCA performs what is known as dimensionality reduction. If you like a more formal treatise on the subject, you can read up on my post about it here: What does selecting the largest eigenvalues and eigenvectors in the covariance matrix mean in data analysis?. However, simply put dimensionality reduction projects your data stored in the matrix M onto a lower dimensional surface with the least amount of projection error. In this matrix, we are assuming that each column is a feature or a dimension and each row is a data point. I suspect the reason why you are getting more memory occupied by applying the SVD on the actual data matrix M itself rather than the covariance matrix is because you have a significant amount of data points with a small amount of features. The covariance matrix finds the covariance between pairs of features. If M is a m x n matrix where m is the total number of data points and n is the total number of features, doing cov(M) would actually give you a n x n matrix, so you are applying SVD on a small amount of memory in comparison to M.
As for the meaning of U, S and V, for dimensionality reduction specifically, the columns of V are what are known as the principal components. The ordering of V is in such a way where the first column is the first axis of your data that describes the greatest amount of variability possible. As you start going to the second columns up to the nth column, you start to introduce more axes in your data and the variability starts to decrease. Eventually when you hit the nth column, you are essentially describing your data in its entirety without reducing any dimensions. The diagonal values of S denote what is called the variance explained which respect the same ordering as V. As you progress through the singular values, they tell you how much of the variability in your data is described by each corresponding principal component.
To perform the dimensionality reduction, you can either take U and multiply by S or take your data that is mean subtracted and multiply by V. In other words, supposing X is the matrix M where each column has its mean computed and the is subtracted from each column of M, the following relationship holds:
US = XV
To actually perform the final dimensionality reduction, you take either US or XV and retain the first k columns where k is the total amount of dimensions you want to retain. The value of k depends on your application, but many people choose k to be the total number of principal components that explains a certain percentage of your variability in your data.
For more information about the link between SVD and PCA, please see this post on Cross Validated: https://stats.stackexchange.com/q/134282/86678
Instead of [U, S, V] = svd(M), which tries to build a matrix U that is 49152 by 49152 (= 18 GB šŸ˜±!), do svd(M, 'econ'). That returns the ā€œeconomy-classā€ SVD, where U will be 52 by 52, S is 52 by 52, and V is also 52 by 52.
cov(M) will remove each dimensionā€™s mean and evaluate the inner product, giving you a 52 by 52 covariance matrix. You can implement your own version of cov, called mycov, as
function [C] = mycov(M)
M = bsxfun(#minus, M, mean(M, 1)); % subtract each dimensionā€™s mean over all observations
C = M' * M / size(M, 1);
(You can verify this works by looking at mycov(randn(49152, 52)), which should be close to eye(52), since each element of that array is IID-Gaussian.)
Thereā€™s a lot of magical linear algebraic properties and relationships between the SVD and EVD (i.e., singular value vs eigenvalue decompositions): because the covariance matrix cov(M) is a Hermitian matrix, itā€™s left- and right-singular vectors are the same, and in fact also cov(M)ā€™s eigenvectors. Furthermore, cov(M)ā€™s singular values are also its eigenvalues: so svd(cov(M)) is just an expensive way to get eig(cov(M)) šŸ˜‚, up to Ā±1 and reordering.
As #rayryeng explains at length, usually people look at svd(M, 'econ') because they want eig(cov(M)) without needing to evaluate cov(M), because you never want to compute cov(M): itā€™s numerically unstable. I recently wrote an answer that showed, in Python, how to compute eig(cov(M)) using svd(M2, 'econ'), where M2 is the 0-mean version of M, used in the practical application of color-to-grayscale mapping, which might help you get more context.

How to accurately calibrate a measurement using a higher order correlation?

I have about 1000 measurements using a device. Let's call these measurement y. For each of these measurements, I know what the actual measurement should be, let's call these z. How I can calibrate, adjust, or scale y for a better estimation? I was thinking of solving either of the following systems of equations (linear/nonlinear) for alpha, beta, and gamma:
or
Could someone give me some advice and let me know if I am doing this correctly?
First you need to know that a measurement device is doing two kinds of errors: accidental and systematic.
The accidental errors are due to a number of perturbation factors with a complex interaction and will result in non repeatability (measuring twice the same value results in different measurements). To reduce the accidental errors, you can repeat the measurement and average.
The systematic errors are permanent and stable. They are due to the relation z = y being wrong or approximate, and will repeat identically for the same measurement. The true relation can be of the form y = z + c with c != 0 (offset error), y = c.z with c != 1 (gain error), y = c1.z + c2 (both), or nonlinear, like y = c1.zĀ² + c2.z + c3, y = (c1.z + c2) / (c3.z + c4), y = ln(exp(z)+1)... or any other.
In some cases, you have reasons to know the functional form of the relation (for instance a metallic ruler gets a wrong "gain" when the temperature changes); in other cases you don't, and you can use an empirical model such as a polynomial (quite often, the relation is smooth and remains close to y = z).
Usually, observing a plot of the (z, y) points will hint you the importance of accidental errors and the possible shape of the functional relation.
A simple approach is to try a least-squares fitting of a polynomial model (say second or third degree). Then when you have found the coefficients, you can look at the relative magnitudes of the polynomial terms (powers) over the working range. This will tell you if all terms are relevant. I advise you to discard the terms that do not significantly decrease the fitting error and keep a simple model.
Consider the case of the plot below, chosen randomly from the web.
At first sight the relation looks linear, with no offset error (as the relation includes the point (0, 0)), and a few irregularities, that we can attribute to accidental errors. For this device, the straight model y = c.z should be appropriate, and adding nonlinear terms would be useless or misleading.

Cannonical Correlation Analysis

I have just started working using CCA in Matlab. I have two vectors X and Y of dimension 60x1920 and 60x1536 with the number of samples being 60 and variables in the different set of vectors being 1920 and 1536 respectively. I want to know do CCA for reducing them to the subspace and then do feature matching.
I am using this commands.
%% DO CCA
[A,B,r,U,V] = canoncorr(X,Y);
The output I get is this :
Name Size Bytes Class Attributes
A 1920x58 890880 double
B 1536x58 712704 double
U 60x58 27840 double
V 60x58 27840 double
r 1x58 464 double
Can anyone please tell me what these variables mean. I have gone over the documentation several times and still is unclear about them. As I understand CCA finds two linear projection matrices Wx and Wy such that the projection of X and Y on Wx and Wy are maximally correlated.
1) Could anyone please tell me which of the following matrices are these?
2) Also how can I find the projected vectors in the learned subspace of CCA?
Any help will be appreciated. Thanks in advance.
As I understand it, with X and Y being your original data matrices, A and B are the sets of coefficients that perform a change of basis to maximally correlate your original data. Your data is represented in the new bases as the matrices U and V.
So to answer your questions:
The projection matrices you are looking for would be A and B since they transform X and Y into the new space.
The resulting projections of X and Y into the new space would be U and V, respectively. (The r vector represents the entries of the correlation matrix between U and V, which is a diagonal matrix.)
The The MATLAB documentation says this transformation can be done with the following formulae, where N is the number of observations:
U = (X-repmat(mean(X),N,1))*A
V = (Y-repmat(mean(Y),N,1))*B
This page lays out the process nicely so you can see what each coefficient means in the transformation process.

Normalization in neural network with (x, y) output

I built a backpropagation neural network to learn from a dataset that consists of 7 continuous inputs and 2 outputs (x, y coordinates). My implementation choice was to use one hidden layer with 7 neurons, but I did it in such a way that I can try different combinations of hidden layers with variable number of hidden nodes.
The error measurement is the usual mean squared error, calculated as follows:
MSE(x,y) = 1/N * sum((X - x)^2 + (Y - y)^2)
where X and Y are the target values, x and y the predictions. I also have to compute an accuracy measure which is the mean euclidean distance of each point from the target points, that's basically the same as the MSE, but the values inside sum get square-rooted.
The input ranges are all between the interval [-2, +2], plus some outliers.
The output coordinates have completely unrelated distributions (x is normally distributed while y is uniformly distributed). The x range is small (say -1, +1 from the mean) while the y range varies more (say -10, +10 from the mean).
The behavior I get is that the net seems to predict quite right the y output, while the x "flattens" to y. Ie, the x values get closer to the y values, the network doesn't adapt to predict the x correctly.
My initial choice was to scale both inputs/outputs as a whole to the usual (0,1) interval but that didn't lead to good results. So I then chose to standardize each feature separately with their z-score, and scale the outputs in the (0,1) interval (I am using the sigmoid activation function so (0,1) seemed about right). But then this strange behavior appeared.
So my questions are, how would you normalize such inputs/outputs? Is there a way to deal with such uncorrelated outputs? I had even thought about using two separate networks to predict one single output discarding the other, is that a good choice?
Could you also point me to some reading where output normalization is discussed? The literature talks a lot about normalizing the inputs, but no one seems to care about the outputs.