simple speech recognition methods - neural-network

Yes, I'm aware that speech recognition is fairly complicated (as an understatement). What I'm looking for is a method for distinguishing between maybe 20-30 phrases. An ability to split words (discrete speech is fine) would be nice, but isn't required. The software will be user-dependent(i.e. for use by me). I'm not looking for existing software, but for a good way of going about doing this myself. I've looked into various existing methods and it seems like splitting the sound into phonemes, while common, is somewhat excessive for my needs.
For some context, I'm just looking for a way to control some aspects of my computer with a few simple voice commands. I'm aware that Windows already has speech recognition software, but I'd like to go about this one myself as a learning exercise. Commands would be simple like "Open Google", or "Mute". What I had in mind (not sure if this is a good idea) is that some commands would be compound. So "Mute" would just be "Mute". Whereas the "Open" command could be recognized individually, and then have its suffixes (Google, Photoshop, etc). recognized with another network/model/whatever. But I'm not sure if looking for prefixes/word breaks in this way would produce better results than having to deal with an increased number of individual commands.
I've been looking into perceptrons, hopfield networks (though they're somewhat obsolete from what I understand) and HMMs, and while I understand the ideas behind these (I've implemented the ANNs before) I don't really know which is best suited to this task. I'm assuming that linear vector quantization models would also be appropriate, but I can't really find much literature to this end. Any guidance/resources would be greatly appreciated.

There are some open source project in speech recognition:
HTK (Hidden Markov Models Toolkit)
Both have decoder, training, language model toolkits. Eveything to build a complete and robust speech recognizer.
Voxforge has acoustic and language models for both open source speech recognition toolkits.

Some time ago, I read a whitepaper about a limited vocabulary system, which used a simple recognition process. The system divided each utterance into a small number of bins (6 in time, and 4 in magnitude, if I remember correctly, for 24 total), and all it did was count the number of sample audio measurements in each bin. There was a fuzzy logic rule base which then interpreted each utterances 24 bin counts, and generated an interpretation.
I imagine that (for some applications) a simple matching process might work just as well, in which the 24 bin counts of the current utterance are simple matched against those of each of your stored prototypes, and the one with the least overall difference is the winner.


Why such a bad performance for Moses using Europarl?

I have started playing around with Moses and tried to make what I believe would be a fairly standard baseline system. I have basically followed the steps described on the website, but instead of using news-commentary I have used Europarl v7 for training, with the WMT 2006 development set and the original Europarl common test. My idea was to do something similar to Le Nagard & Koehn (2010), who obtained a BLEU score of .68 in their baseline English-to-French system.
To summarise, my workflow was more or less this:
tokenizer.perl on everything
lowercase.perl (instead of truecase)
Train IRSTLM model using only French data from Europarl v7
train-model.perl exactly as described using WMT 2006 dev
Testing and measuring performances as described
And the resulting BLEU score is .26... This leads me to two questions:
Is this a typical BLEU score for this kind of baseline system? I realise Europarl is a pretty small corpus to train a monolingual language model on, even though this is how they do things on the Moses website.
Are there any typical pitfalls for someone just starting with SMT and/or Moses I may have fallen in? Or do researchers like Le Nagard & Koehn build their baseline systems in a way different from what is described on the Moses website, for instance using some larger, undisclosed corpus to train the language model?
Just to put things straight first: the .68 you are referring to has nothing to do with BLEU.
My idea was to do something similar to Le Nagard & Koehn (2010), who obtained a BLEU score of .68 in their baseline English-to-French system.
The article you refer to only states that 68% of the pronouns (using co-reference resolution) was translated correctly. It nowhere mentions that a .68 BLEU score was obtained. As a matter of fact, no scores were given, probably because the qualitative improvement the paper proposes cannot be measured with statistical significance (which happens a lot if you only improve on a small number of words). For this reason, the paper uses a manual evaluation of the pronouns only:
A better evaluation metric is the number of correctly
translated pronouns. This requires manual
inspection of the translation results.
This is where the .68 comes into play.
Now to answer your questions with respect to the .26 you got:
Is this a typical BLEU score for this kind of baseline system? I realise Europarl is a pretty small corpus to train a monolingual language model on, even though this is how they do things on the Moses website.
Yes it is. You can find the performance of WMT language pairs here
Are there any typical pitfalls for someone just starting with SMT and/or Moses I may have fallen in? Or do researchers like Le Nagard & Koehn build their baseline systems in a way different from what is described on the Moses website, for instance using some larger, undisclosed corpus to train the language model?
I assume that you trained your system correctly. With respect to the "undisclosed corpus" question: members of the academic community normally state for each experiment which data sets were used for training testing and tuning, at least in peer-reviewed publications. The only exception is the WMT task (see for example where privately owned corpora may be used if the system participates in the unconstrained track. But even then, people will mention that they used additional data.

What is the relation between OCR and Artificial Neural Network?

I saw different articles speaking about OCR form recognition (data extraction) and they said that they used Neural Network in order to do form recognition, so what's the relation between Artificial Neural network (ANN) and form recognition? If I want to extract fields from a BusinessCard, is it required to use ANN or is it optional? In other words when do I need to use ANN and when I don't?
It's a little different. ANN is just an "expert" in all OCR. But OCR engines contain many experts. When you study ANN you will build a simple OCR engine using just ANN but this does not compare to modern engines that use this in conjunction with tri-grams, morphology, data types ( very important for BCR and Forms ), dictionaries, connected components algorithm, etc. So look at it as just one of the tools in the bag of tricks to extract quality results. A good engine will incorporate ANN and all the others. In BCR there are additional considerations and it should be very heavy on connected components, dictionaries first, then use ANN and pattern matching for the actually recognition.
ANN is one way to perform OCR. There are others. Hence if you want to extract fields from a BusinessCard using ANN is only optional.
Good question. I recently spent some time playing with OCRopus, a Google project that does OCR - you can get it for free and play with it yourself. I'm pretty sure that it has an ANN as one of the modules behind it. However, the whole process of Optical Character Recognition can have many steps (lots of different little modules that each do something and pass the results to the next module).
So, here are some of the things I remember as being done by modules in that project:
There was a module that turned the image into black and white - this makes it easier for later modules to deal with.
Getting rid of speckles / spackles.
Straightening out the lines of text.
Breaking lines of text into individual words (it's been a few weeks, not sure about this one)
Basically, you can do the above using little bits of code that don't involve a neural net. So it's simpler doing it with these little bits of code.
The neural net I think is used just to recognize the individual characters - which character of a group of possible characters is it.
There's a training command in the OCRopus that I had running for over a week on end, and it kept sending line samples to the map, slowly changing the map as it went. I think it was training the ANN part.

Audio File Matching Program

I'm trying to write a program in iPhone than can take two audio files (e.g. WAV) as inputs, compare them, and spit out a number that tells you how similar the audio files are.
If someone has done something like this, know how to go about doing it, or just have some ideas, please let me know. Anything will be greatly appreciated.
Specific questions: What language is suitable? How hard is it to do (how many
hours, roughly)? Where can I find a good source of audio library/tools?
I'd say it's pretty hard, not so much the implementation, but coming up with a reasonable definition of 'similar'.
That said, you're probably looking at techniques like autocorrelation and FFT, both of which are CPU-intensive tasks, so I'd say a fully-compiled language (C, C++, don't know about Objective-C) would be most suitable at least for the actual calculations. Also, you're facing a somewhat underpowered platform for such tasks (if only because uncompressed audio files are pretty large), so you're in for quite some optimization.
This book: is quite concise reading for all things DSP-related.
Sounds similar to what 'Shazam' does - awesome iPhone app by the way, check it out if you haven't already (it's free too).
A while ago there was an article on how Shazam works, read it here. It takes an acoustic fingerprint and compares it to other songs' fingerprints, returning the closest match.
I would say there is a lot of math, probably some matrices and maybe Fourier transforms involved in fingerprinting and then trying to compare the audio.
Probably would take a good while to program. If your math skills are up to it though, sounds like a good challenge :-)
EDIT: turns out there was some source code on the site I linked. It's in Java but would be well worth a look through before you start writing your own. Source code here
I am working on something similar in Java on a speech recognition app.
I would recommend using MFCC (requires calculating FFT) for feature extraction and Neural Networks or some other sort of machine learning technique for training and recognition. You train the NN with the features extracted from the reference wav file, more precisely from consecutive equal lenght slices/windows of that audio file. Then you use the NN to detect if another file, also split into slices, has the same features.
This is the basic idea upon which you can elaborate to further your own specifications, or exactly what you want your app to do.
In terms of libraries in Objective C I think you can find a few for the signal processing part (FFT and such) as for the machine learning part I have no idea about what you could find.
As for programming time it's hard to estimate because it depends on a lot of details. I would say somewhere about a week, but that's just a fair estimation.
ps: MFCC stands for Mel-Frequency Coeficients:

Matlab vs Aforge vs OpenCV

I am about to start a project in visual image-processing and have no had experience with Matlab, Aforge, OpenCV and was wondering if anyone had any experiences with these different software packages.
I was also wondering which of the three packages were most efficient I assume OpenCV but has anyone had any experience?
The question you need to ask yourself is which is more important - your time or the computer's time. If your task is really simple, you may be able to code it up in MATLAB and have it work right off the bat. MATLAB is by far the easiest for development - a scripted language with built-in memory management, a huge array of provided functions, and a great interface for displaying and manipulating data while debugging.
On the other hand, MATLAB is at least an order of magnitude slower than compiled openCV code for many tasks. This is especially true if you use the intel performance primitives libraries.
If you know how to code in MATLAB, I would suggest writing and debugging your algorithms in that language, then porting them to c/c++ with openCV for speed. If there are only a couple of simple functions that you need to speed up, you can call c code from MATLAB, but it's hard to get this working right the first few times you try it, so you're probably better off just rewriting your finished code entirely in c/c++
First, please elaborate about your project's needs. It has the biggest impact on the choice, in addition to other factors - your general programming knowledge (If you haven't dealt with dot net but just with C++, AForge is not a good choice, for example).
Both AForge and OpenCV has a built-in interface to .Net, and OpenCV also with C++, python, and more. Matlab might be more efficient, but if you don't have any experience with it - you should also learn its syntax. Take it into consideration.
Matlab probably has the largest variety of functions, but it is more complicated than the other projects. OpenCV and AForge themselves have some differences - see them described in this StackOverflow question/ answers.
I worked last year in two similar projects with cars on the highway. Afaik, Matlab allows to process only one picture frame at a time (surely you could elaborate an algorithm to compute a stream) but using Simulink you can process the stream directly.
On the other hand, i found AForge a lot friendlier and easier to use since you can easily adjust the processing parameters from a GUI (not so fast/easy) to do in Matlab/simulink.
I'd go for Aforge.Net. It's also fast enough if you're worrying about processing speed. (using 640x480)
If you are asking about using one of these in .net,easily you can get info by this:
1-matlab mostly used in simulation of projects not the End-prototype project; my numer : 30;
2-aforge (as I'v used in many project) if you do not need the circular process like capturing image, or recognition of something in images or ... you'll find it very good, cause it is easy to use but useful for single processes; my number : 50
3-opencv very good at speed and useful for circular processes, for example you can capture images from a webcam and Instantly cartoonize it without any delay, But not easy-to-use as aforge. I like it anyway cause of its speed and MANY functions it gives us mostly anything we need in programming; my number : 80
Dr.Taha -

How does music fingerprinting work (for sites such as Shazam and

My large (120gb) music collection contains many duplicate songs, and I've been trying to fingerprint tracks in the hopes of detecting duplicates. And since I'm a CS Major I'm very curious as to what is done out there? Nothing I do has nearly the accuracy of something like Shazam or How do they "hash" tracks? I have run a standard MD5 hash on all my files (26,000 files) and I found hundreds of equal hashes on different tracks, so that doesn't work.
I'm more interested in since they work with full files, unlike Shazam, but I'm assuming both use a similar technique. Can anyone explain how to generate unique identifiers for music?
The seminal paper on audio fingerprinting is the work by Haitsma and Kalker in 2002-03. For each frame of audio, it preprocesses (differences across time frames and frequency bands) and then stores a binarized version of the frame's spectrum.
This procedure adds robustness. If the entire signal is shifted in time, it still works (at least, one can derive a lower bound on performance degradation). It is pretty robust to environmental noise. Since its inception, there have been many papers on low-level music similarity, so there is no single answer.
Do you have absolutely identical files, i.e., the signals are time aligned, bit depth is the same, sampling rate is the same? Then I would think a hash like MD5 should work. But if any of those parameters are changed, so will the hashes. In such an event, a procedure like the one mentioned earlier would work better.
Take a look at the ISMIR proceedings available free online. Fun stuff.
There are a lot of algorithms for acoustic fingerprinting. Some of the more popular ones are:
In fact libfooId is opensource , so you can check out its code in google-code!!
Take a look at he Acoustic Fingerprint page on Wikipedia. It has references for some papers as well as links to implementations (including the open source fdmf).
After some more research (although this is not conclusive at all!), I happened across the wiki at which details some of the approaches they use: