could someone please explain the difference between i-vector and d-vector? All I know about them is that they are widely used in speaker/speech recognition systems and they are kind of templates for representing speaker information, but I don't know the main differences.
I-vector is a feature that represents the idiosyncratic characteristics of the frame-level features' distributive pattern. I-vector extraction is essentially a dimensionality reduction of the GMM supervector (although the GMM supervector is not extracted when computing the i-vector). It's extracted in a similar manner with the eigenvoice adaptation scheme or the JFA technique, but is extracted per sentence (or input speech sample).
On the other hand, d-vector is extracted using DNN. To extract a d-vector, a DNN model that takes stacked filterbank features (similar to the DNN acoustic model used in ASR) and generates the one-hot speaker label (or the speaker probability) on the output is trained. D-vector is the averaged activation from the last hidden layer of this DNN. So unlike the i-vector framework, this doesn't have any assumptions about the feature's distribution (the i-vector framework assumes that the i-vector, or the latent variable has a Gaussian distribution).
So in conclusion, these are two distinct features extracted from totally different methods or assumptions. I recommend you reading these papers:
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-end factor analysis for speaker verification," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, 2011.
E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. G-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," in Proc. ICASSP, 2014, pp. 4080-4084.
I don't know how to properly characterize the d-vector in plain language, but I can help a little.
The identity vector, or i-vector, Is a spectral signature for a particular slice of speech, usually a sliver of a phoneme, rarely (as far as I can see) as large as the entire phoneme. Basically, it's a discrete spectrogram expressed in a form isomorphic to the Gaussian mixture of the time slice.
EDIT
Thanks to those who provided comments and a superior answer. I updated this only to replace the incorrect information from my original attempt.
A d-vector is extracted from a Deep NN, the mean of the feature vectors in the DNN's final hidden layer. This becomes the model for the speaker, used to compare against other speech samples for identification.
Related
I've red a few paper about speech recognition based on neural networks, the gaussian mixture model and the hidden markov model. On my research, I came across the paper "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition" from George E. Dahl, Dong Yu, et al.. I think I understand the most of the presented idea, however I still have trouble with some details. I really would appreciate, if someone could enlighten me.
As I understand it, the procedure consists of three elements:
Input
The audio stream gets split up by frames of 10ms and processed by a MFCC, which outputs a feature vector.
DNN The neural network gets the feature vector as a input and processes the features, so that each frame(phone) is distinguishable or rather gives a represents of the phone in context.
HMM
The HMM is a is a state model, in which each state represents a tri-phone. Each state has a number of probability for changing to all the other state.
Now the output layer of the DNN produces a feature vector, that tells the current state to which state it has to change next.
What I don't get: How are the features of the output layer(DNN) mapped to the probabilities of the state. And how is the HMM created in the first place? Where do I get all the Information about the probabilietes?
I don't need to understand every detail, the basic concept is sufficient for my purpose. I just need to assure, that my basic thinking about the process is right.
On my research, I came across the paper "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition" from George E. Dahl, Dong Yu, et al.. I think I understand the most of the presented idea, however I still have trouble with some details.
It is better to read a textbook, not a research paper.
so that each frame(phone) is distinguishable or rather gives a represents of the phone in context.
This sentence does not have clear meaning which means you are not quite sure yourself. DNN takes a frame features and produces the probabilities for the states.
HMM The HMM is a is a state model, in which each state represents a tri-phone.
Not necessary a triphone. Usually there are tied triphones which means several triphones correspond to certain state.
Now the output layer of the DNN produces a feature vector
No, DNN produces state probabilities for the current frame, it does not produce feature vector.
that tells the current state to which state it has to change next.
No, next state is selected by HMM Viterbi algorithm based on current state and DNN probabilities. DNN alone does not decide the next state.
What I don't get: How are the features of the output layer(DNN) mapped to the probabilities of the state.
Output layer produces probabilities. It says that phone A at this frame is probable with probability 0.9 and phone B in this frame is probable with probability 0.1
And how is the HMM created in the first place?
Unlike end-to-end systems which does not use HMM, HMM is usually trained with HMM/GMM system and Baum-Welch algorithm before DNN is initialized. So you first train GMM/HMM with Baum-Welch, then you train the DNN to improve GMM.
Where do I get all the Information about the probabilietes?
It is hard to understand your last question.
I am trying to extract common patterns that always appear whenever a certain event occurs.
For example, patient A, B, and C all had a heart attack. Using the readings from there pulse, I want to find the common patterns before the heart attack stroke.
In the next stage I want to do this using multiple dimensions. For example, using the readings from the patients pulse, temperature, and blood pressure, what are the common patterns that occurred in the three dimensions taking into consideration the time and order between each dimension.
What is the best way to solve this problem using Neural Networks and which type of network is best?
(Just need some pointing in the right direction)
and thank you all for reading
Described problem looks like a time series prediction problem. That means a basic prediction problem for a continuous or discrete phenomena generated by some existing process. As a raw data for this problem we will have a sequence of samples x(t), x(t+1), x(t+2), ..., where x() means an output of considered process and t means some arbitrary timepoint.
For artificial neural networks solution we will consider a time series prediction, where we will organize our raw data to a new sequences. As you should know, we consider X as a matrix of input vectors that will be used in ANN learning. For time series prediction we will construct a new collection on following schema.
In the most basic form your input vector x will be a sequence of samples (x(t-k), x(t-k+1), ..., x(t-1), x(t)) taken at some arbitrary timepoint t, appended to it predecessor samples from timepoints t-k, t-k+1, ..., t-1. You should generate every example for every possible timepoint t like this.
But the key is to preprocess data so that we get the best prediction results.
Assuming your data (phenomena) is continuous, you should consider to apply some sampling technique. You could start with an experiment for some naive sampling period Δt, but there are stronger methods. See for example Nyquist–Shannon Sampling Theorem, where the key idea is to allow to recover continuous x(t) from discrete x(Δt) samples. This is reasonable when we consider that we probably expect our ANNs to do this.
Assuming your data is discrete... you still should need to try sampling, as this will speed up your computations and might possibly provide better generalization. But the key advice is: do experiments! as the best architecture depends on data and also will require to preprocess them correctly.
The next thing is network output layer. From your question, it appears that this will be a binary class prediction. But maybe a wider prediction vector is worth considering? How about to predict the future of considered samples, that is x(t+1), x(t+2) and experiment with different horizons (length of the future)?
Further reading:
Somebody mentioned Python here. Here is some good tutorial on timeseries prediction with Keras: Victor Schmidt, Keras recurrent tutorial, Deep Learning Tutorials
This paper is good if you need some real example: Fessant, Francoise, Samy Bengio, and Daniel Collobert. "On the prediction of solar activity using different neural network models." Annales Geophysicae. Vol. 14. No. 1. 1996.
I want to implement a filter algorithm for dimension reduction using symmetrical uncertain. I do not know how to write mathematical equation containing probability.
e.g. H(x)=-p(x)*log2(P(x)) Like that there are so many equation.
please tell me how to write this type of equation.
Check out the ITMO_FS library, which contains different filters, wrappers, hybrids, and embedded feature selection techniques including symmetrical uncertainty.
I'm currently developing an application which uses the iOS enabled device camera to recognise equations from the photo and then match these up to the correct equation in a library or database - basically an equation scanner. For example you could scan an Image of the Uncertainty Principle or Schrodinger Equation and the iOS device would be able to inform the user it's name and certain feedback.
I was wondering how to implement this using Xcode, I was thinking of using an open-source framework such as Tesseract OCR or OpenCV but I'm not sure how to apply these to equations.
Any help would be greatly appreciated.
Thanks.
Here's the reason why this is super ambitious. What OCR is doing is basically taking a confined set of dots and trying to match it to one of a number of members of a very small set. What you are talking about doing is more at the idiom than the character level. For instance, if I do a representation of Bayes' Rule as an equation, I have something like:
P(A|B) = P(B|A)P(A)/P(B)
Even if it recognizes each of those characters successfully, you have to have it then patch up features in the equation to families of equations. Not to mention, this is only one representation of Bayes Rule. There are others that use Sigma Notation (LaPlace's variant), and some use logs so they don't have to special case 0s.
This, btw, could be done with Bayes. Here are a few thoughts on that:
First you would have to treat the equations as Classifications, and you would have to describe them in terms of a set of features, for instance, the presence of Sigma Notation, or the application of a log.
The System would then be trained by being shown all the equations you want it to recognize, presumably several variations of each (per above). Then these classifications would have feature distributions.
Finally, when shown a new equation, the system would have to find each of these features, and then loop through the classifications and compute the overall probability that the equation matches the given classification.
This is how 90% of spam engines are done, but there, they only have two classifications: spam and not spam, and the feature representations are ludicrously simple: merely ratios of word occurrences in different document types.
Interesting problem, surely no simple answer.
I am studying Support Vector Machines (SVM) by reading a lot of material. However, it seems that most of it focuses on how to classify the input 2D data by mapping it using several kernels such as linear, polynomial, RBF / Gaussian, etc.
My first question is, can SVM handle high-dimensional (n-D) input data?
According to what I found, the answer is YES!
If my understanding is correct, n-D input data will be
constructed in Hilbert hyperspace, then those data will be
simplified by using some approaches (such as PCA ?) to combine it together / project it back to 2D plane, so that
the kernel methods can map it into an appropriate shape such a line or curve can separate it into distinguish groups.
It means most of the guides / tutorials focus on step (3). But some toolboxes I've checked cannot plot if the input data greater than 2D. How can the data after be projected to 2D?
If there is no projection of data, how can they classify it?
My second question is: is my understanding correct?
My first question is, does SVM can handle high-dimensional (n-D) input data?
Yes. I have dealt with data where n > 2500 when using LIBSVM software: http://www.csie.ntu.edu.tw/~cjlin/libsvm/. I used linear and RBF kernels.
My second question is, does it correct my understanding?
I'm not entirely sure on what you mean here, so I'll try to comment on what you said most recently. I believe your intuition is generally correct. Data is "constructed" in some n-dimensional space, and a hyperplane of dimension n-1 is used to classify the data into two groups. However, by using kernel methods, it's possible to generate this information using linear methods and not consume all the memory of your computer.
I'm not sure if you've seen this already, but if you haven't, you may be interested in some of the information in this paper: http://pyml.sourceforge.net/doc/howto.pdf. I've copied and pasted a part of the text that may appeal to your thoughts:
A kernel method is an algorithm that depends on the data only through dot-products. When this is the case, the dot product can be replaced by a kernel function which computes a dot product in some possibly high dimensional feature space. This has two advantages: First, the ability to generate non-linear decision boundaries using methods designed for linear classifiers. Second, the use of kernel functions allows the user to apply a classifier to data that have no obvious fixed-dimensional vector space representation. The prime example of such data in bioinformatics are sequence, either DNA or protein, and protein structure.
It would also help if you could explain what "guides" you are referring to. I don't think I've ever had to project data on a 2-D plane before, and it doesn't make sense to do so anyway for data with a ridiculous amount of dimensions (or "features" as it is called in LIBSVM). Using selected kernel methods should be enough to classify such data.