Kmeans clustering on large dataset in Matlab - matlab

I have epinions dataset that has 290000 columns and 22166 rows.
Its large dataset and 340 MB , when I open this mat file it take about 30 min to load it on Matlab and when I run my clustering code its impossible to give me any output, all my problem is the big size of dataset, I test it with R programming language and it was same, is there any solution to compress dataset ? What can I do now ? Thanks

Related

Measure spike correlation over time for different neurons

I´m trying to find a global measure of similarity for spikes trains over time. The signals look like in the picture (in this example I have 17 neurons).
Can I use windowed cross correlations? If yes, what should I do with the output matrices? I use MATLAB by the way.
Yes, you can use binned spikes or binned bursts to compute the cross correlations. You can then threshold the output matrix to see what electrodes have the most similarity.
If you have the spikes in text format or have exported your spike train to an HDF5 file from MC_Rack, you can use MEAnalyzer to compute and export all the comparisons: https://github.com/RDastgh1/MEAnalyzer (but please cite it if you use it).

The size of the generated confusion matrix using confusionmat function is not right, why?

I am working on a traffic sign recognition code in MATLAB using Belgian Traffic Sign Dataset. This dataset can be found here.
The dataset consists of training data and test data (or evaluation data).
I resized the given images and extracted HOG features using the VL_HOG function from VL_feat library.
Then, I trained a multi class SVM using all of the signs inside the training dataset. There are 62 categories (i.e. different types of traffic signs) and 4577 frames inside the training set.
I used the fitcecoc function to obtain the classifier.
Upon training the multi-class SVM, I want to test the classifier performance using the test data and I used the predict and confusionmat functions, respectively.
For some reason, the size of the returned confusion matrix is 53 by 53 instead of 62 by 62.
Why the size of the confusion matrix is not the same as the number of categories?
Some of the folders inside the testing dataset are empty, causing MATLAB to skip those rows and columns in the confusion matrix.

Training Time LSTM Keras depending on the size of dataset

I am currently trying to do time series prediction with LSTM implemented with Keras.
I tried to train a LSTM model with 10 000 samples in the train and 2 500 samples in the test. I am using a batch size of 30.
Now, I am trying to train the exact same model but with more data. I have a train with 100 000 samples and test with 25 000 samples.
The time for one epoch is multiplicated by 100 when using the big dataset.
Even if I have more data, the size of the batch size is the same so the training should not be taking more time. Is it possible that this is the calculation of the loss on the train and test data that take a lot of time (here all the data is used) ?
Concerning the size of the batch size : should I put it higher because I have more data ?
EDIT 1
I tried to change the batch size and to put a bigger one. When I do that, the time of training decrease a lot.
With a big batch size, the computation of the gradient should be longer than with a small batch size ?
I have no clue here, I really do not understand why this is happening.
Does someone know why this is happening ? Is it linked to the data I use ? How theorically can this happen ?
EDIT 2
My processor is Intel Xeon W3520 (4 hearts / 8 threads) with 32G of RAM.
The data is composed of sequence of length 6 with 4 features. I use one LSMT layer with 50 units and a dense output layer. Whether I am training with 10 000 samples or 100 000 it is really the size of the batch size that change the time of computation. I can go from 2 seconds for one epoch with a batch size = 1000, to 200 seconds with a batch size = 30.
I do not use a generator, I use the basic line of code model.fit(Xtrain, Ytrain, nb_epoch, batch_size, verbose=2,callbacks,validation_data=(Xtest, Ytest)) with callbacks = [EarlyStopping(monitor='val_loss', patience=10, verbose=2), history]
You seemingly have misunderstood parts of how SGD (Stochastic Gradient Descent) works.
I explained parts of this answer in another post here on Stackoverflow, that might help you understand this better, but I'll take the time to explain it another time here.
The basic idea of Gradient Descent is to calculate the forward pass (and store the activations) of all trainig samples, and then afterwards update your weights once. Now, since you might not have enough memory to store all the activations (which you need for the calculation of your backpropagation gradient), and for other reasons (mainly convergence), you often cannot do classical gradient descent.
Stochastic Gradient Descent makes the assumption that, by sampling in a random order, you can reach convergence by looking at only one training sample at a time, and then updating directly after. This is called an iteration, whereas we call the pass through all training samples an epoch.
Mini batches now only change SGD by - instead of using one single sample - taking a "handful" of values, determined by the batch size.
Now, the updating of the weights is a quite costly process, and it should be clear at this point that updating the weights a great number of times (with SGD) is more costly than computing the gradient and updating only a few times (with a large batch size).

PCA on huge (10 million+ features) datasets

I am looking to extract "principal components" from huge datasets (each data point has 10 million+ features). I have about 1000 such data points. PCA only requires the co-variance matrix, which would be 1000x1000, so it's quite feasible to do PCA on the data. However the principal components still have the same dimension as the data points (10 million+). I'd like to cut this down, because my code will require reading in the principal components, and will be prohibitively slow if each principal component is several tens of megabytes.
Ideally I'd like to reduce the dimensionality of the dataset before applying PCA, keeping as much relevant information in the original data as possible. Any suggestions? Obviously simply downsampling the original data would work, but I'd lose high-frequency parts of my original data.
Thanks ^.^

Lip Reading classification on LiLiR dataset in matlab

i work on lip reading but i am a newbie.
after googling, i found that one of the data set of lip reading is LiLir dataset. now i downloaded it and i want to classify them using Support vector machine (SVM). but each letter has a matrix of data which has 4800 rows and 21 until 28 columns. i do not know what is the meaning of columns. they are features, but which features?
A1_Faye_lips = load('\data set\avletters\avletters\Lips\A1_Faye-lips.mat')
A1_Faye_lips =
vid: [4800x21 double]
siz: [60 80 21]
>>
how can i train SVM using this 2D matrix?
21 is the feature, I didn't look into the data source, so cannot tell what are those features exactly . But they are possibly the independent variables that influence the output (lip reading). Each variable is 4800*1 vector or 60*80 array.
For data training, Libsvm is a good SVM training toolbox for you.