PCA on huge (10 million+ features) datasets - matlab

I am looking to extract "principal components" from huge datasets (each data point has 10 million+ features). I have about 1000 such data points. PCA only requires the co-variance matrix, which would be 1000x1000, so it's quite feasible to do PCA on the data. However the principal components still have the same dimension as the data points (10 million+). I'd like to cut this down, because my code will require reading in the principal components, and will be prohibitively slow if each principal component is several tens of megabytes.
Ideally I'd like to reduce the dimensionality of the dataset before applying PCA, keeping as much relevant information in the original data as possible. Any suggestions? Obviously simply downsampling the original data would work, but I'd lose high-frequency parts of my original data.
Thanks ^.^

Related

When to use PCA for dimensionality reduction?

I am using the Matlab Classification Learner app to test different classifiers over a training set (size = 700). My response variable is a categorical label with 5 possible values. I have 7 numerical features and 2 categorical ones. I found a Cubic SVM to have the highest accuracy of 83%. But the performance goes down considerably when I enable PCA with 95% explained variance (accuracy = 40.5%). I am a student and this is the first time I am using PCA.
Why do I see such a result?
Could it be because of a small / unbalanced data set?
When is it useful to apply PCA? When we say "reduce dimensionality", is there a minimum number of features (dimensionality) in the original set?
Any help is appreciated. Thanks in advance!
I want to share my opinion
I think training set 700 means, your data is < 1k.
I'm even surprised that svm performs 83%.
Even MNIST dataset is considered to be small (60.000 training - 10.000 test). Your data is much-much smaller.
You try to reduce your small data even smaller using pca. So what will svm learns? There is no discriminating samples left?
If I were you I would test using random-forest classifier. Random-forest might even perform better.
Even if you balanced your data, it is small data.
I believe using SMOTE will not improve the result. If your data consist of images then you could use ImageDataGenerator for replicating your data. Though I'm not sure matlab contains ImageDataGenerator.
You will use PCA, when you have lots of samples. Yet the samples are not directly effecting the accuracy but they are the components of data.
For instance: Let's consider handwritten digit classification data.
From above can we say each pixel is directly effecting the accuracy?
The answer is no? Above the black pixels are not important for the accuracy, therefore to remove them we use pca.
If you want a detailed explanation with a python example. Check out my other answer

Efficient hard data sampling for triplet loss

I am trying to implement a deep network for triplet loss in Caffe.
When I select three samples for anchor, positive, negative images randomly, it almost produces zero losses. So I tried the following strategy:
If I have 15,000 training images,
1. extract features of 15,000 images with the current weights.
2. calculate the triplet losses with all possible triplet combinations.
3. use the hard samples with n largest losses, and update the network n times.
4. iterate the above steps every k iterations to get new hard samples.
The step 1 is fast, but I think step 2 is very time-consuming and is really inefficient. So, I wonder whether there are other efficient strategies for hard data sampling.
Thanks.
In practice, if your dataset is large, it is infeasible to sample hard triplet from the whole dataset. In fact, you can choose hard triplet for only a small proportion of your training dataset, which will be much more time-saving. After training the network using the hard triplet generated for K iterations. You feed the network with the next batch of images from the dataset and generate the new hard triplet.
In this way, the computation cost is acceptable and network is gradually improving as the training process goes.
see the article here for more reference.(section 5.1)

What is the importance of clustering?

During unsupervised learning we do cluster analysis (like K-Means) to bin the data to a number of clusters.
But what is the use of these clustered data in practical scenario.
I think during clustering we are losing information about the data.
Are there some practical examples where clustering could be beneficial?
The information loss can be intentional. Here are three examples:
PCM signal quantification (Lloyd's k-means publication). You know that are certain number (say 10) different signals are transmitted, but with distortion. Quantifying removes the distortions and re-extracts the original 10 different signals. Here, you lose the error and keep the signal.
Color quantization (see Wikipedia). To reduce the number of colors in an image, a quite nice method uses k-means (usually in HSV or Lab space). k is the number of desired output colors. Information loss here is intentional, to better compress the image. k-means attempts to find the least-squared-error approximation of the image with just k colors.
When searching motifs in time series, you can also use quantization such as k-means to transform your data into a symbolic representation. The bag-of-visual-words approach that was the state of the art for image recognition prior to deep learning also used this.
Explorative data mining (clustering - one may argue that above use cases are not data mining / clustering; but quantization). If you have a data set of a million points, which points are you going to investigate? clustering methods try ro split the data into groups that are supposed to be more homogeneous within and more different to another. Thrn you don't have to look at every object, but only at some of each cluster to hopefully learn something about the whole cluster (and your whole data set). Centroid methods such as k-means even can proviee a "prototype" for each cluster, albeit it is a good idea to also lool at other points within the cluster. You may also want to do outlier detection and look at some of the unusual objects. This scenario is somewhere inbetween of sampling representative objects and reducing the data set size to become more manageable. The key difference to above points is that the result is usually not "operationalized" automatically, but because explorative clustering results are too unreliable (and thus require many iterations) need to be analyzed manually.

Do you have to normalize the data for a neural net if it is already scaled?

I'm currently trying to preprocess my training data ready for a multi-layered perceptron. The data I downloaded consists of 20,000 instances and 16 attributes, all of which are coordinate values of pixels as part of letter recognition. The data itself has already been scaled from its original form into values between 0 - 15 before being published.
However since it's already been scaled, is it still necessary to perform normalization on it? I've tried to read around and look at previous examples but have come up with conflicting points. In some papers, it has stated that scaling is a form of normalization, where as others have said that normalization would be bringing that values to a range of 0-1.
Since I'm using WEKA I've attempted their normalize filter during a pre-processing stage and it caused the accuracy to decrease by around 2% which makes me think it could be unnecessary. But again, I've read that it may only have a positive effect later in training.
So my question is:
What is the difference between scaling to a range such as 0 - 15 and normalizing it? Should I still normalize it on top of this scaling thats already done?
In your case you do not need to. Normalizing data is done so that an attribute with a different scale will not decide outcome of distance operations, ultimately decide clustering or classification results.
An example you have two attributes weight and income. Weight will be 10 and 200kg at most. While income can be 10,000$ and 20,000,000$. But most of the people's income will be 10,000 and 120,000, while above this values will be outliers. If you do not normalize your data before using Multi Layer Perceptron, outcome of your neural network will be decided by these outliers.
In your case this situation is already mitigated due to your scaling therefore you do not need normalizing.

How do I compress data points in time-series using MATLAB?

I'm looking for some suggestions on how to compress time-series data in MATLAB.
I have some data sets of pupil size, which were gathered during 1 sec with 25,000 points for each trial(I'm still not sure whether it is proper to call the data 'timeseries'). What I'd like to do from now is to compare them with another data, and I need to compress the number of points into about 10,000 or less, minimizing loss of the information. Are there any ways to do it?
I've tried to search how to do this, but all that I could find out was the way to smooth the data or to compress digital images, which were already done or not useful to me.
• The data sets simply consist of pupil diameter, changing as time goes. For each trial, 25,000 points of data were gathered during 1 sec, that means 1 point denotes the pupil diameter measured for 0.04msec. What I want to do is just to adjust this data into 0.1 msec/point; however, I'm not sure whether I can apply techniques like FFT in this case because it is the first time that I handle this kind of data. I appreciate your advices again.
A standard data compression technique with time series data is to take the fast Fourier transform and use the smallest frequency amplitudes to represent your data (calculate the power spectrum). You can compare data using these frequency amplitudes, though for the to lose the least amount of information you would want to use the frequencies with the largest amplitudes -- but then it becomes tricky to compare the data... Here is the standard Matlab tutorial on FFT. Some other possibilities include:
-ARMA models
-Wavelets
Check out this paper on the "SAX" method, a modern approach for time-series compression -- it also discusses classic time-series compression techniques.