Speaker adaptation of acoustic model for Indian accent kaldi ASR - neural-network

I am working on getting speech recognition for Indian accent speakers. Presently, I am using the online nnet2 decoding tool of Kaldi ASR.
The tool is working well when the speaker has good english pronunciation. But, it is failing when the speaker speaks in a accent different from the US english accent.
So, can anyone please suggest any procedure for speaker adaptation of acoustic or neural network model using Kaldi ASR?

There are many ways how you can do it or think about that.
1 - If you are talking only about accent (this means, no new words, standard grammar) -> Then you should mainly work with acoustic part of model. Get as much of audio & transcription data as you can (hundreds of hours) so you can update the H-part of model.
2 - If you are talking about something more complex, you should think about updating the lexicon (add word) & grammar (fst's) too (including my first point).
You can try to start with AMI model and its papers, that is included in examples in Kaldi. See Examples included with Kaldi

Related

Data Entry Automation by Field Identification and Optical Character Recognition (OCR) for Handwriting on Predefined Forms

I'm looking to automate data entry from predefined forms that have been filled out by hand. The characters are not separated, but the fields are identifiable by lines underneath or as a part of a table. I know that handwriting OCR is still an area of active research, and I can include an operator review function, so I do not expect accuracy above 90%.
The first solution that I thought of is a combination of OpenCV for field identification (http://answers.opencv.org/question/63847/how-to-extract-tables-from-an-image/) and Tesseract to recognize the handwriting (https://github.com/openpaperwork/pyocr).
Another potentially simpler and more efficacious method for field identification with a predefined form would be to somehow subtract the blank form from the filled form. Since the forms would be scanned, this would likely require some location tolerance, noise reduction, and feature recognition.
Any suggestions or comments would be greatly appreciated.
As said in Tesseract FAQ it is not recommended to use if you're looking for a successful handwritten recognition. I would recommend you to look more into commercial projects like Microsoft OCR API (Scroll down to Read handwritten text from images), you can try it online and use their API in your application.
Another option is ABBYY OCR which has a lot of useful functions to recognize tables, complicated documents etc. You can read more here
As for free alternatives - the only think that comes to mind is Lipi toolkit
As for detection of letters - it really depends on the input, in general if your form is more or less same every time - it would be best to simply measure your form and use predefined positions in which you need to search for text. Otherwise OpenCV is a right technology to look for text, there are plenty of tutorials online and good answers here on stackoverflow, for example you can take a look at detection using MSER answer by Silencer.

How can I improve Watson Speech to Text accuracy?

I understand that Watson Speech To Text is somewhat calibrated for colloquial conversation and for 1 or 2 speakers. I also know that it can deal with FLAC better than WAV and OGG.
I would like to know how could I improve the algorithm recognition, acoustically speaking.
I mean, does increasing volume help? Maybe using some compression filter? Noise reduction?
What kind of pre processing could help for this service?
the best way to improve the accuracy of the base models (which are very accurate but also very general) is by using the Watson STT customization service: https://www.ibm.com/watson/developercloud/doc/speech-to-text/custom.html. That will let you create a custom model tailored to the specifics of your domain. If your domain is not very well matched to those captured by the base model then you can expect a great boost in recognition accuracy.
Regearding your comment " I also know that it can deal with FLAC better than WAV and OGG", that is not really the case. The Watson STT service offers full support for flac, wav, ogg and other formats (please see this section of the documentation: https://www.ibm.com/watson/developercloud/doc/speech-to-text/input.html#formats).

Looking for neural network samples

I am making a program to recognize musical notes recorded by a human voice , I'm using a neural network and I wonder if I can find good samples of musical notes with human voice for my network ..... I've found thousands of patterns for other instruments but none for human voice
You can try places like the UCI Machine Learning Repository, but it's unlikely they'll have exactly what you're looking for.
http://archive.ics.uci.edu/ml/
If you can find even a small number of samples of different voices at known notes, though, you can construct a much fuller library by using pitch-shifting software (similar to auto-tune software.) I believe there are free- or shareware versions available.
(UPDATE: If you do create your own, you might consider donating it to the repository. I've seen work on neural networks to classify both western and non-western music by key or other categories, so there is some interest in music recognition.)

Chinese hand-writing recognition program for IPhone

I would like to start on Chinese hand-writing recognition program for IPhone...but I couldn't find any library or API that can help me to do so. It's hard for me to write the algorithm myself because of my time span.
Some of suggestion recommended that I should make use of a back-end server to do the recognition work. But I don't know how to set up that kind of server.
So any suggestion or basic steps that can help me to achieve this personal project?
You might want to check out Zinnia. Tegaki relies on other APIs (Zinnia is one of them) to do the actual character recognition.
I haven't looked at the code, but I gather it's written in C or C++, so should suit your needs better than Tegaki.

How would you compare a spoken word to an audio file?

How would you go about comparing a spoken word to an audio file and determining if they match? For example, if I say "apple" to my iPhone application, I would like for it to record the audio and compare it with a prerecorded audio file of someone saying "apple". It should be able to determine that the two spoken words match.
What kind of algorithm or library could I use to perform this kind of voice-based audio file matching?
You should look up Acoustic Fingerprinting see wikipedia link below. Shazam is basically doing it for music.
http://en.wikipedia.org/wiki/Acoustic_fingerprint
I know this question is old, but I discovered this library today:
http://www.ispikit.com/
Sphinx does voice recognition and pocketSphinx has been ported to the iPhone by Brian King
check https://github.com/KingOfBrian/VocalKit
He has provided excellent details and made it easy to implement for yourself. I've run his example and modified my own rendition of it.
You can use a neural networks library and teach it to recognize different speech patterns. This will require some know how behind the general theory of neural networks and how they can be used to create systems that will behave a particular way. If you know nothing about the subject, you can get started on just the basics and then use a library rather than implementing something yourself. Hope that helps.