Watson 'Speech to text' not recognizing microphone input properly - unity3d

Iam using Unity SDK provided for IBM Watson services. I try to use 'ExampleStreaming.cs' sample provided for speech to text recognition. I test the app in unity editor.
This sample uses Microphone as audio input and gets results for voice input from the user. However, when I use microphone as input, the transcribed results are far from being correct. When I say "Create a black box", the results are inappropriate, with the word results being completely irrelevant to input.
When I use pre-recorded voice clips, the output is perfect.
Does the service perform incorrectly for Indian accent?.
What is the reason for poor microphone input recognition?
The docs say:
"In general, the service is sensitive to background noise. For instance, engine noise, working devices, street noise, and talking can significantly reduce accuracy. In addition, the microphones that are typically installed on mobile devices and tablets are often inadequate. The service performs best when professional microphones are used to capture audio with better quality."
I use Logitech headset mic as input source.

Satish,
Try to "clean up" the audio as best you can - by limiting background noise. Also be aware that you can use one of two different processing models - one for broadband and one for narrowband. Try them both, and see which is most appropriate for your input device.
In addition, you can find that the underlying speech model does not handle all of the domain specific terms that you might be looking for. In these cases you can customize and expand the speech model, as explained in the documentation on Using Custom Language Models (https://console.bluemix.net/docs/services/speech-to-text/custom.html#custom). While this is a bit more involved, it can often make a huge difference in accuracy and overall usability.

Related

Record from specific microphone channel

I am looking for a way to record microphone input from a specific channel.
For example, I want to record separately left/right of an M-Track audio interface or SingStar wireless microphones.
Microphone.Start seems limited in this regard.
Further, I found this thread, which says
Yes, you can assume the microphone at runtime will always be 1 channel.
My questions:
Any workaround to get this working?
It there maybe an open source lib or low level microphone API in Unity?
Is it really not possible to record different microphone channels into different AudioClips in Unity?
I am looking for a solution at least on desktop platform, i.e., Windows, MacOS, Linux.
Bonus question: Is recording specific microphone channels working in Unreal Engine?

Is it possible to make the voice of user instead of inbuilt voice in TTS?

We have Text to speech feature wherein a set of voices and different pitch, male/female voices are present.
Similarly we do have voice recognition available in many devices and PCs.
Is there any possibility that a system makes use of voice of the User to speak instead of inbuilt default voices?
Although it is theoretically possible it is most likely unpractical. There are basically two types of artificial voices: fully synthetic and sample based.
If your TTS voice is fully synthetic then it can only be influenced by certain parameters such as pitch and speed. Your best approach is to try and estimate all parameters from your input speech.
If your TTS voice is sample based then you could try to collect enough speech from the user to construct a whole new dataset. Usually you need every possible diphone, which can take a long time to collect unless you have the user utter some text specifically to collect these. Then your engine needs to be able to accept the speech parts and construct a new voice from them.
In both cases the result still won't be very convincing unless you can also mimic the user's prosody and specific pronunciation. If your TTS and recognition modules aren't developed by yourself or extensible then you are likely out of luck since most software don't allow new voices to be built at runtime.

Getting the voltage applied to an iPhone's microphone port

I am looking at a project where we send information from a peripheral device to an iPhone through the microphone input.
In a nutshell, the iPhone would act as a voltmeter. (In reality, the controller we developed will send data encoded as voltages to the iPhone for further processing).
As there are several voltmeter apps on the AppStore that receive their input through the microphone port, this seems to be technically possible.
However, scanning the AudioQueue and AudioFile APIs, there doesn't seem to be a method for directly accessing the voltage.
Does anyone know of APIs, sample code or literature that would allow me to access the voltage information?
The A/D converter on the line-in is a voltmeter, for all practical purposes. The sample values map to the voltage applied at the time the sample was taken. You'll need to do your own testing to figure out what voltages correspond to the various sample values.
As far as I know, it won't be possible to get the voltages directly; you'll have to figure out how to convert them to equivalent 'sounds' such that the iOS APIs will pick them up as sounds, which you can interpret as voltages in your app.
If I were attempting this sort of app, I would hook up some test voltages to the input (very small ones!), capture the sound and then see what it looks like. A bit of reverse engineering should get you to the point where you can interpret the inputs correctly.

iPhone: CPU power to do DSP/Fourier transform/frequency domain?

I want to analyze MIC audio on an ongoing basis (not just a snipper or prerecorded sample), and display frequency graph and filter out certain aspects of the audio. Is the iPhone powerful enough for that? I suspect the answer is a yes, given the Google and iPhone voice recognition, Shazaam and other music recognition apps, and guitar tuner apps out there. However, I don't know what limitations I'll have to deal with.
Anyone play around with this area?
Apple's sample code aurioTouch has a FFT implementation.
The apps that I've seen do some sort of music/voice recognition need an internet connection, so it's highly likely that these just so some sort of feature calculation on the audio and send these features via http to do the recognition on the server.
In any case, frequency graphs and filtering have been done before on lesser CPUs a dozen years ago. The iPhone should be no problem.
"Fast enough" may be a function of your (or your customer's) expectations on how much frequency resolution you are looking for and your base sample rate.
An N-point FFT is on the order of N*log2(N) computations, so if you don't have enough MIPS, reducing N is a potential area of concession for you.
In many applications, sample rate is a non-negotiable, but if it was, this would be another possibility.
I made an app that calculates the FFT live
http://www.itunes.com/apps/oscope
You can find my code for the FFT on GitHub (although it's a little rough)
http://github.com/alexbw/iPhoneFFT
Apple's new iPhone OS 4.0 SDK allows for built-in computation of the FFT with the "Accelerate" library, so I'd definitely start working with the new OS if it's a central part of your app's functionality.
You cant just port FFT code written in C into your app...there is the thumb compiler option that complicates floating point arithmetic. You need to put it in arm mode

Streaming Audio Clips from iPhone to server

I'm wondering if there are any examples atomic examples out there for streaming audio FROM the iPhone to a server. I'm not interested in telephony or SIP style solutions, just a simple socket stream to send an audio clip, in .wav format, as it is being recorded. I haven't had much luck with the google or other obvious avenues, although there seem to be many examples of doing this the other way around.
i cant figure out how to register the unregistered account i initially posted with.
anyway, I'm not really interested in the audio format at present, just the streaming aspect. i want to take the microphone input, and stream it from the iphone to a server. i dont presently care about the transfer rate as ill initially just test from a wifi connection, not the 3g setup. the reason i cant cache it is because im interested in trying out some open source speech recognition stuffs for my undergraduate thesis. caching and then sending the recording is possible but then it takes considerably longer to get the voice data to the server. if i can start sending the data as soon as i start recording, then the response time is considerably improved because most of the data will have already reached the server by the time i let go of the record button. furthermore, if i can get this streaming functionality to work from the iphone then on the server side of things i can also start the speech recognizer as soon as the first bit of audio comes through. again this should considerably speech up the final amount of time that the transaction takes from the user perspective.
colin barrett mentions the phones and phone networks, but these are actually a pretty suboptimal solution for asr, mainly because they provide no good way to recover from errors - doing so over a voip dialogue is a horrible experience. however, the iphone and in particular the touch screen provide a great way to do that, through use of an ime or nbest lists for the other recognition candidates.
if i can figure out the basic architecture for streaming the audio, then i can start thinking about doing flac encoding or something to reduce the required transfer rate. maybe even feature extraction, although that limits the later ability to retrain the system with the recordings.