How can I improve Watson Speech to Text accuracy? - ibm-cloud

I understand that Watson Speech To Text is somewhat calibrated for colloquial conversation and for 1 or 2 speakers. I also know that it can deal with FLAC better than WAV and OGG.
I would like to know how could I improve the algorithm recognition, acoustically speaking.
I mean, does increasing volume help? Maybe using some compression filter? Noise reduction?
What kind of pre processing could help for this service?

the best way to improve the accuracy of the base models (which are very accurate but also very general) is by using the Watson STT customization service: https://www.ibm.com/watson/developercloud/doc/speech-to-text/custom.html. That will let you create a custom model tailored to the specifics of your domain. If your domain is not very well matched to those captured by the base model then you can expect a great boost in recognition accuracy.
Regearding your comment " I also know that it can deal with FLAC better than WAV and OGG", that is not really the case. The Watson STT service offers full support for flac, wav, ogg and other formats (please see this section of the documentation: https://www.ibm.com/watson/developercloud/doc/speech-to-text/input.html#formats).

Related

Data Entry Automation by Field Identification and Optical Character Recognition (OCR) for Handwriting on Predefined Forms

I'm looking to automate data entry from predefined forms that have been filled out by hand. The characters are not separated, but the fields are identifiable by lines underneath or as a part of a table. I know that handwriting OCR is still an area of active research, and I can include an operator review function, so I do not expect accuracy above 90%.
The first solution that I thought of is a combination of OpenCV for field identification (http://answers.opencv.org/question/63847/how-to-extract-tables-from-an-image/) and Tesseract to recognize the handwriting (https://github.com/openpaperwork/pyocr).
Another potentially simpler and more efficacious method for field identification with a predefined form would be to somehow subtract the blank form from the filled form. Since the forms would be scanned, this would likely require some location tolerance, noise reduction, and feature recognition.
Any suggestions or comments would be greatly appreciated.
As said in Tesseract FAQ it is not recommended to use if you're looking for a successful handwritten recognition. I would recommend you to look more into commercial projects like Microsoft OCR API (Scroll down to Read handwritten text from images), you can try it online and use their API in your application.
Another option is ABBYY OCR which has a lot of useful functions to recognize tables, complicated documents etc. You can read more here
As for free alternatives - the only think that comes to mind is Lipi toolkit
As for detection of letters - it really depends on the input, in general if your form is more or less same every time - it would be best to simply measure your form and use predefined positions in which you need to search for text. Otherwise OpenCV is a right technology to look for text, there are plenty of tutorials online and good answers here on stackoverflow, for example you can take a look at detection using MSER answer by Silencer.

Speaker adaptation of acoustic model for Indian accent kaldi ASR

I am working on getting speech recognition for Indian accent speakers. Presently, I am using the online nnet2 decoding tool of Kaldi ASR.
The tool is working well when the speaker has good english pronunciation. But, it is failing when the speaker speaks in a accent different from the US english accent.
So, can anyone please suggest any procedure for speaker adaptation of acoustic or neural network model using Kaldi ASR?
There are many ways how you can do it or think about that.
1 - If you are talking only about accent (this means, no new words, standard grammar) -> Then you should mainly work with acoustic part of model. Get as much of audio & transcription data as you can (hundreds of hours) so you can update the H-part of model.
2 - If you are talking about something more complex, you should think about updating the lexicon (add word) & grammar (fst's) too (including my first point).
You can try to start with AMI model and its papers, that is included in examples in Kaldi. See Examples included with Kaldi

Looking for neural network samples

I am making a program to recognize musical notes recorded by a human voice , I'm using a neural network and I wonder if I can find good samples of musical notes with human voice for my network ..... I've found thousands of patterns for other instruments but none for human voice
You can try places like the UCI Machine Learning Repository, but it's unlikely they'll have exactly what you're looking for.
http://archive.ics.uci.edu/ml/
If you can find even a small number of samples of different voices at known notes, though, you can construct a much fuller library by using pitch-shifting software (similar to auto-tune software.) I believe there are free- or shareware versions available.
(UPDATE: If you do create your own, you might consider donating it to the repository. I've seen work on neural networks to classify both western and non-western music by key or other categories, so there is some interest in music recognition.)

How would you compare a spoken word to an audio file?

How would you go about comparing a spoken word to an audio file and determining if they match? For example, if I say "apple" to my iPhone application, I would like for it to record the audio and compare it with a prerecorded audio file of someone saying "apple". It should be able to determine that the two spoken words match.
What kind of algorithm or library could I use to perform this kind of voice-based audio file matching?
You should look up Acoustic Fingerprinting see wikipedia link below. Shazam is basically doing it for music.
http://en.wikipedia.org/wiki/Acoustic_fingerprint
I know this question is old, but I discovered this library today:
http://www.ispikit.com/
Sphinx does voice recognition and pocketSphinx has been ported to the iPhone by Brian King
check https://github.com/KingOfBrian/VocalKit
He has provided excellent details and made it easy to implement for yourself. I've run his example and modified my own rendition of it.
You can use a neural networks library and teach it to recognize different speech patterns. This will require some know how behind the general theory of neural networks and how they can be used to create systems that will behave a particular way. If you know nothing about the subject, you can get started on just the basics and then use a library rather than implementing something yourself. Hope that helps.

Building better positional audio [AudioQueue manipulation]

I'm building an app that has a requirement for really accurate positional audio, down to the level of modelling inter-aural time difference (ITD), the slight delay difference between stereo channels that varies with a sound's position relative to a listener. Unfortunately, the iPhone's implementation of OpenAL doesn't have this feature, nor is a delay Audio Unit supplied in the SDK.
After a bit of reading around, I've decided that the best way to approach this problem is to implement my own delay by manipulating an AudioQueue (I can also see some projects in my future which may require learning this stuff, so this is as good an excuse to learn as any). However, I don't have any experience in low-level audio programming at all, and certainly none with AudioQueue. Trying to learn both:
a) the general theory of audio processing
and
b) the specifics of how AudioQueue implements that theory
is proving far too much to take in all at once :(
So, my questions are:
1) where's a good place to start learning about DSP and how audio generation and processing works in general (down to the level of how audio data is structured in memory, how mixing works, that kinda thing)?
2) what's a good way to get a feel for how AudioQueue does this? Are there any good examples of how to get it reading from a generated ring buffer, rather that just fetching bits of a file on-demand with AudioFileReadPackets, like Apple's SpeakHere example does?
and, most importantly
3) is there a simpler way of doing this that I've overlooked?
I think Richard Lyons' "Understanding Digital Signal Processing" is widely revered as a good starter DSP book, though it's all math and no code.
If timing is so important, you'll likely want to use the Remote I/O audio unit, rather than the higher-latency audio queue. Some of the audio unit examples may be helpful to you here, like the "aurioTouch" example that uses the Remote I/O unit for capture and performs an FFT on it to get the frequencies.
If the built-in AL isn't going to do it for you, I think you've opted into the "crazy hard" level of difficulty.
Sounds like you should probably be on the coreaudio-api list (lists.apple.com), where Apple's Core Audio engineers hang out.
Another great resource for learning the fundamental basics of DSP and their applications is The Scientist and Engineer's Guide to Digital Signal Processing by Steven W. Smith. It is available online for free at http://www.dspguide.com/ but you can also order a printed copy.
I really like how the author builds up the fundamental theory in a way that very palatable.
Furthermore, you should check out the Core Audio Public Utility which you'll find at /Developer/Extras/CoreAudio/PublicUtility. It covers a lot of the basic structures you'll need to get in place in order to work with CoreAudio.