I'm currently working on a project where I should classify hand gestures, many papers proposed that HMMs is the way to do so, many tutorials speak of either a weather tutorial or a dice and coin tutorial, I can't seem to understand how to map these to my problem and what should my different matrices be, I currently have a feature vector (containing the detected features of the hands as a n*2 matrix where n is the total number of features detected in all the frames, i.e. if the algorithm detected 10 features in each frame and the video is 10 frames, n would be = 100, and 2 is the x and y coordinates) and the motion vector (the motion of the hand itself in the video m*2 size where m is the number of frames in the video) also any other data u would recommend to extract from the video.
I know the papers you are talking about and the exemples about the weather are simplistic and cannot be mapped to most of the problems now processed with HMMs. In your case, you have features corresponding to hand gestures that you know. HMM can work because the data you have is dynamic, i.e. ordered in time.
My advice is that you should first have a look at the widely used HMM toolbox by Kevin Murphy. It provides all the tools you need to start working with HMMs.
The main idea is to model each gesture type with one dedicated HMM. For a given gesture type, the corresponding HMM will be trained with the available features that you have.
Once trained, you get a state transition probability matrix, an emission probability matrix and a prior for selecting the initial state.
When your have an unknown gesture, you will then compute the likelihood this gesture (its features actually) could have been generated by each of the trained HMMs. Usually, the query sequence is assigned to the category of the one raising the highest score.
This is for the big picture. In your case, you will have to find a way to represent your features as a time series. The "time" being the different frames. With a complex application such as hand gesture it might be difficult to see what each state of the model represents. Some kinds of HMM, by their topology (left-to-right models for instance) make this analogy easier.
Related
I want to use pre-trained model for the face identification. I try to use Siamese architecture which requires a few number of images. Could you give me any trained model which I can change for the Siamese architecture? How can I change the network model which I can put two images to find their similarities (I do not want to create image based on the tutorial here)? I only want to use the system for real time application. Do you have any recommendations?
I suppose you can use this model, described in Xiang Wu, Ran He, Zhenan Sun, Tieniu Tan A Light CNN for Deep Face Representation with Noisy Labels (arXiv 2015) as a a strating point for your experiments.
As for the Siamese network, what you are trying to earn is a mapping from a face image into some high dimensional vector space, in which distances between points reflects (dis)similarity between faces.
To do so, you only need one network that gets a face as an input and produce a high-dim vector as an output.
However, to train this single network using the Siamese approach, you are going to duplicate it: creating two instances of the same net (you need to explicitly link the weights of the two copies). During training you are going to provide pairs of faces to the nets: one to each copy, then the single loss layer on top of the two copies can compare the high-dimensional vectors representing the two faces and compute a loss according to a "same/not same" label associated with this pair.
Hence, you only need the duplication for the training. In test time ('deploy') you are going to have a single net providing you with a semantically meaningful high dimensional representation of faces.
For a more advance Siamese architecture and loss see this thread.
On the other hand, you might want to consider the approach described in Oren Tadmor, Yonatan Wexler, Tal Rosenwein, Shai Shalev-Shwartz, Amnon Shashua Learning a Metric Embedding for Face Recognition using the Multibatch Method (arXiv 2016). This approach is more efficient and easy to implement than pair-wise losses over image pairs.
I have a project for face recognition of five people that I want my CNN to detect, and I was wondering if people could have a look at my model to see if this is a step in the right direction
def model():
model= Sequential()
# sort out the input layer later
model.add(convolutional.Convolution2D(64,3,3, activation='relu'), input_shape=(3,800,800))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(convolutional.Convolution2D(64,3,3, activation='relu'))
model.add(convolutional.MaxPooling2D((2,2), strides=(2,2)))
flatten()
model.add(Dense(128, activation='relu'))
model.add(Dropout(p=0.2))
model.add(Dense(number_of_faces, activation='softmax'))
so the model will be taking in pictures (headshots found on google of 5 people) in 3 channels of size 800 by 800 with 64 feature maps, pooled and then another set of feature maps
and then connected to a mlp for classification into a binary vector for 5 output neurons. My question is, is this a decent approach to try and classify headshots of certain people?
for example if I were to download one hundred pictures of a certain person and put them through this model, would the feature space created in the convolution be big enough to capture
the features of that face and four others?
thanks for the help guys
Well, it is not an engineering issue but a scientific one. It is hard to judge whether 100 picture is enough for your purpose without showing current progress (like, what is the accuracy now? Are your facing overfitting or underfitting.
But, YES, extra data of faces can help with your model, especially when those faces are of same context (background, light, angle, skin color, etc.) with your eventual testing data.
If you are interesting in face recognition, you can start with Deep Learning Face Representation from Predicting 10,000 Classes (unofficial code here), they use 10 thousand faces as extra dataset to train. You can search "DeepID" for more information.
If you are an engineering guy, you can check Facial Expression Recognition with Convolutional Neural Networks, this report focus more on implementation, which is also implemented by Keras.
By then way, 800*800 is extra large in face recognition community. You might like to resize them to a smaller size. Otherwise your program might be too gargantuan to train and consumes butch of memory.
Face recognition is not a regular classification study. If you train your model for 5 people, even if it would be a successful model, you need to re-train it if a new person join to the team. It means that your new model might not be successful anymore.
We firstly train a regular classification model but then drop its final softmax layer and use its early layer to represent images. Representations are multi-dimensional vector. Herein, we expect that image pair of same person should have high similarity whereas image pair of different persons should have low similarity. We can find the vector similarities with cosine similarity or euclidean distance methods.
To sum up, you should not train a model anymore for face recognition application. You just need to use a neural networks to predict. Predictions will be representations.
I recommend you to use deepface. It wraps state-of-the-art face recognition models such as VGG-Face, Google FaceNet, OpenFace, Facebook DeepFace, DeepID and Dlib. It also handles face detection and alignment in the background. You just need to call a line of code to apply face recognition.
#!pip install deepface
from deepface import DeepFace
models = ['VGG-Face', 'Facenet', 'OpenFace', 'DeepFace', 'DeepID', 'Dlib']
obj = DeepFace.verify("img1.jpg", "img2.jpg", model_name = models[0])
print(obj["verified"], ", ", obj["distance"])
Returned object stores max threshold value and found distance. In this way, it returns True in verified param if the image pair is same person, returns False if the image pair is different persons.
I need to do dimensionality reduction from a series of images. More specifically, each image is a snapshot of a ball moving and the optimal features would be its position and velocity. As far as I know, CNN are the state-of-the-art for reducing the features for image classification, but in that case only a single frame is provided. Is it possible to extract also time-dependent features given many images at different time steps? Otherwise which is the state-of-the-art techniques for doing so?
It's the first time I use CNN and I would also appreciate any reference or any other suggestion.
If you want to be able to have the network somehow recognize a progression which is time dependent, you should probably look into recurrent neural nets (RNN). Since you would be operating on video, you should look into recurrent convolutional neural nets (RCNN) such as in: http://jmlr.org/proceedings/papers/v32/pinheiro14.pdf
Recurrence adds some memory of a previous state of the input data. See this good explanation by Karpathy: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
In your case you need to have the recurrence across multiple images instead of just within one image. It would seem like the first problem you need to solve is the image segmentation problem (being able to pick the ball out of the rest of the image) and the first paper linked above deals with segmentation. (then again, maybe you're trying to take advantage of the movement in order to identify the moving object?)
Here's another thought: perhaps you could only look at differences between sequential frames and use that as your input data to your convnet? The input "image" would then show where the moving object was in the previous frame and where it is in the current one. Larger differences would indicate larger amounts of movement. That would probably have a similar effect to using a recurrent network.
I am reading about applications of clustering in human motion analysis. I started out with random numbers and applied k-means clustering algorithm but I wanted to have some graphs that circle the clusters as shown in the picture. Basically, the lines represent the motion trajectory. I will appreciate ideas on how to obtain motion trajectory of a person. Application is patient monitoring where the trajectory will be used in abnormal behavior activity.
I will be using a kinect and recording the motion trajectory based on skeleton tracking. So, I will be recording the 4 quaternion values of Head, Shoulder and Torso joints and the RGBD (Red green blue Depth) that is combined as 1 value for these joints. So, a total of 4*3 + 3 = 15 time series. So, there are 15 variables. How do I convert them to represent the trajectories shown below and then apply clustering to cluster trajectories. The clusters will allow in classification.
Can somebody please show how to obtain the diagram similar to the one attached? and how do I fuse and convert the 15 time series from each person into a single trajectory.
The picture illustrates the number of clusters that are generated for the time series. Thank you in advance.
K-means is a bad fit for trajectories.
It needs to be able to compute the mean (which is why it is called "k-means"). Having a stable, sensible mean is important. But how meaningful is the mean of some time series, even if you could define it (and the series weren't e.g. of different length, and different movement speed)?
Try hierarchical clustering, and multivariate dynamic time warping.
I'm fairly new to MATLAB, but have acquainted myself with Simulink and Computer Vision over the past few days. My problem statement involves taking a traffic/highway video input and detecting if an accident has occurred.
I plan to do this by extracting the values of centroid to plot trajectory, velocity difference (between frames) and distance between two vehicles. I can successfully track the centroids, and aim to derive the rest of the features.
What I don't know is how to map these to ANN. I mean, every image has more than one vehicle blobs, which means, there are multiple centroids in a single frame/image. So, how does NN act on multiple inputs (the extracted features per vehicle) simultaneously? I am obviously missing the link. Help me figure it out please.
Also, am I looking at time series data?
I am not exactly sure about your question. The problem can be both time series data and not. You might be able to transform the time series version of the problem, such that it can be solved using ANN, but it is sort of a Maslow's hammer :). Also, Could you rephrase the problem.
As you said, you could give it features from two or three frames and then use the classifier to detect accident or not, but it might be difficult to train such a classifier. The problem is really difficult and the so you might need tons of training samples to get it right, esp really good negative samples (for examples cars travelling close to each other) etc.
There are multiple ways you can try to solve this problem of accident detection. For example : Build a classifier (ANN/SVM etc) to detect accidents without time series data. In which case your input would be accident images and non accident images or some sort of positive and negative samples for training and later images for test. In this specific case, you are not looking at the time series data. But here you might need lots of features to detect the same (this in some sense a single frame version of the problem).
The second method would be to use time series data, in which case you will have to detect the features, track the features (say using Lucas Kanade/Horn and Schunck) and then use the information about velocity and centroid to detect the accident. You might even be able to formulate it for HMMs.