video clustering in matlab - matlab

i'm trying to write the code of this article:"Improving Cluster Selection and Event Modeling in Unsupervised Mining for Automatic Audiovisual Video Structuring"
a part of the it is about video clustering:"The video stream is segmented into shots based on color histograms to detect abrupt changes and progressive transitions. Each of the resulting shot is summarized by a key frame, taken in the middle of the shot, in turn represented as a RGB histogram with 8 bins per color. Bottom-up clustering relies on the Euclidean distance between the 512-dimension color histograms using Ward’s linkage."
i've done this and reached to an array of numbers like this: 1.0e+03 *
3.8334
3.9707
3.8887
2.1713
2.5616
2.3764
2.4533
that after performing the dendrogram part, the result became:
174.0103
175.0093
176.0093
177.0093
178.0093
178.0093
179.0093
but according to a toy example that was given by authors of the article the result should be intervals like:
{47000, 50000}, {143400, 146400}, {185320, 187880},{228240, 231240}, {249440, 252000}, {346000, 349000}
what is wrong here?

You should have 512 dimensional vectors at the first step, one such vector per frame, or equivalently a 512 x n matrix.
Then in the second step I don't think they use the plain built-in hierarchical clustering - which is not time aware, and will not produce intervals, plus it will scale O(n^3) which is really bad - but instead they use a customized clustering algorithm, inspired by hierarchical clustering and using Ward's linkage, but which operates on time intervals; starting with single frames, but only joining neighboring intervals, not arbitrary intervals like regular hierarchical clustering would.

Related

Comparison of set of signals

I have certain movement data acquired from motion capture system which I want to automatically choose which 5 signals are more alike.
Picture shows example of the particular data, all normalized to 100 samples due to the difference in speed.
Data set for knee flexion/extension
What I am looking for is some idea to actually compare the shapes of the curves.
The easiest solution is just to substract the "raw" curves and check which one has the smallest RMSE.
But looking at your data (which are smooth curves), another option is to use PLS or GMM to describe them. Then you can use RMSE to compute the error between your input curve and your database of curves and take the one with lowest error.

What should be the output after extracting features from an audio signal using DWT (Discrete Wavelet Transform) in MATLAB?

I am working on a speech recognition system (I'm following a research paper), after denoising the signals I want to extract features from audio signals that are in the form of arrays in MATLAB.
Please correct me if I'm wrong but I think that the size of features array (after performing decomposition) should be smaller as compared to the original audio signal.
I used wavedec to decompose the signal upto 10 levels using db8 as the wavelet family, but the output was same in size as the input or slightly larger.
The coefficient array should be the same size as the original.
If you look at what wavedec does, it breaks down your signal into a high and a low component using 2 filters and then decimates by a factor of 2. It then repeats this on the approximation component (low) for each level you decompose. So if you decompose at one level, you simply pass your signal through both filters and decimate the result by 2 at the end. This preserves the number of total samples. The same logic applies to the next and subsequent levels.

3D SIFT for human activity classification in videos. NOT GETTING GOOD ACCURACY.

I am trying to classify human activities in videos(six classes and almost 100 videos per class, 6*100=600 videos). I am using 3D SIFT(both xy and t scale=1) from UCF.
for f= 1:20
f
offset = 0;
c=strcat('running',num2str(f),'.mat');
load(c)
pix=video3Dm;
% Generate descriptors at locations given by subs matrix
for i=1:100
reRun = 1;
while reRun == 1
loc = subs(i+offset,:);
fprintf(1,'Calculating keypoint at location (%d, %d, %d)\n',loc);
% Create a 3DSIFT descriptor at the given location
[keys{i} reRun] = Create_Descriptor(pix,1,1,loc(1),loc(2),loc(3));
if reRun == 1
offset = offset + 1;
end
end
end
fprintf(1,'\nFinished...\n%d points thrown out do to poor descriptive ability.\n',offset);
for t1=1:20
des(t1+((f-1)*100),:)=keys{1,t1}.ivec;
end
f
end
My approach is to first get 50 descriptors(of 640 dimension) for one video, and then perform bag of words with all descriptors(on 50*600= 30000 descriptors). After performing Kmeans(with 1000 k value)
idx1000=kmeans(double(total_des),1000);
I am getting 30k of length index vector. Then I am creating histogram signature of each video based on their index values in clusters. Then perform svmtrain(sum in matlab) on signetures(dim-600*1000).
Some potential problems-
1-I am generating random 300 points in 3D to calculate 50 descriptors on any 50 points from those points 300 points.
2- xy, and time scale values, by default they are "1".
3-Cluster numbers, I am not sure that k=1000 is enough for 30000x640 data.
4-svmtrain, I am using this matlab library.
NOTE: Everything is on MATLAB.
Your basic setup seems correct especially given that you are getting 85-95% accuracy. Now, it's just a matter of tuning your procedure. Unfortunately, there is no way to do this other than testing a variety of parameters examining the results and repeating. I going to break this answer into two parts. Advice about bag of words features, and advice about SVM classifiers.
Tuning Bag of Words Features
You are using 50 3D SIFT Features per video from randomly selected points with a vocabulary of 1000 visual words. As you've already mentioned, the size of the vocabulary is one parameter you can adjust. So is the number of descriptors per video.
Let's say that each video is 60 frames long, (at 30 fps only 2 sec, but let's assume you are sampling at 1fps for a 1 minute video). That means you are capturing less than one descriptor per frame. That seems very low to me even with 3D descriptors especially if the locations are randomly chosen.
I would manually examine the points for which you are generating features. Do they appear be well distributed in both space and time? Are you capturing too much background? Ask yourself, would I be able to distinguish between actions given these features?
If you find that many of the selected points are uninformative, increasing the number of points may help. The kmeans clustering can make a few groups for uninformative outliers, and more points means you hopefully capture a few more informative points. You can also try other methods for selecting points. For example, you could use corner points.
You can also manually examine the points that are clustered together. What sorts of structures do the groups have in common? Are the clusters too mixed? That's usually a sign that you need a larger vocabulary.
Tuning SVMs
Using the Matlab SVM implementation or the Libsvm implementation should not make a difference. They are both the same method and have similar tuning options.
First off, you should really be using cross-validation to tune the SVM to avoid overfitting on your test set.
The most powerful parameter for the SVM is the kernel choice. In Matlab, there are five built in kernel options, and you can also define your own. The kernels also have parameters of their own. For example, the gaussian kernel has a scaling factor, sigma. Typically, you start off with a simple kernel and compare to more complex kernels. For example, start with linear, then test quadratic, cubic and gaussian. To compare, you can simply look at your mean cross-validation accuracy.
At this point, the last option is to look at individual instances that are misclassified and try to identify reasons that they may be more difficult than others. Are there commonalities such as occlusion? Also look directly at the visual words that were selected for these instances. You may find something you overlooked when you were tuning your features.
Good luck!

On how to apply k means clustering and outlining the clusters

I am reading about applications of clustering in human motion analysis. I started out with random numbers and applied k-means clustering algorithm but I wanted to have some graphs that circle the clusters as shown in the picture. Basically, the lines represent the motion trajectory. I will appreciate ideas on how to obtain motion trajectory of a person. Application is patient monitoring where the trajectory will be used in abnormal behavior activity.
I will be using a kinect and recording the motion trajectory based on skeleton tracking. So, I will be recording the 4 quaternion values of Head, Shoulder and Torso joints and the RGBD (Red green blue Depth) that is combined as 1 value for these joints. So, a total of 4*3 + 3 = 15 time series. So, there are 15 variables. How do I convert them to represent the trajectories shown below and then apply clustering to cluster trajectories. The clusters will allow in classification.
Can somebody please show how to obtain the diagram similar to the one attached? and how do I fuse and convert the 15 time series from each person into a single trajectory.
The picture illustrates the number of clusters that are generated for the time series. Thank you in advance.
K-means is a bad fit for trajectories.
It needs to be able to compute the mean (which is why it is called "k-means"). Having a stable, sensible mean is important. But how meaningful is the mean of some time series, even if you could define it (and the series weren't e.g. of different length, and different movement speed)?
Try hierarchical clustering, and multivariate dynamic time warping.

How to extract useful features from a graph?

Things are like this:
I have some graphs like the pictures above and I am trying to classify them to different kinds so the shape of a character can be recognized, and here is what I've done:
I apply a 2-D FFT to the graphs, so I can get the spectral analysis of these graphs. And here are some result:
S after 2-D FFT
T after 2-D FFT
I have found that the same letter share the same pattern of magnitude graph after FFT, and I want to use this feature to cluster these letters. But there is a problem: I want the features of interested can be presented in a 2-D plane, i.e in the form of (x,y), but the features here is actually a graph, with about 600*400 element, and I know the only thing I am interested is the shape of the graph(S is a dot in the middle, and T is like a cross). So what can I do to reduce the dimension of the magnitude graph?
I am not sure I am clear about my question here, but thanks in advance.
You can use dimensionality reduction methods such as
k-means clustering
SVM
PCA
MDS
Each of these methods can take 2-dimensional arrays, and work out the best coordinate frame to distinguish / represent etc your letters.
One way to start would be reducing your 240000 dimensional space to a 26-dimensional space using any of these methods.
This would give you an 'amplitude' for each of the possible letters.
But as #jucestain says, a network classifiers are great for letter recognition.