Any simple VAD implementation? - iphone

I'm looking for some C/C++ code for VAD (Voice Activity Detection).
Basically, my application is reading PCM frames from the device. I would like to know when the user is talking. I'm not looking for any speech recognition algorithm but only for voice detection.
I would like to know when the user is talking and when he finishes:
bool isVAD(short* pcm,size_t count);

Google's open-source WebRTC code has a VAD module written in C. It uses a Gaussian Mixture Model (GMM), which is typically much more effective than a simple energy-threshold detector, especially in a situation with dynamic levels and types of background noise. In my experience it's also much more effective than the Moattar-Homayounpour VAD that Gilad mentions in their comment.
The VAD code is part of the much, much larger WebRTC repository, but it's very easy to pull it out and compile it on its own. E.g. the webrtcvad Python wrapper includes just the VAD C source.
The WebRTC VAD API is very easy to use. First, the audio must be mono 16 bit PCM, with either a 8 KHz, 16 KHz or 32 KHz sample rate. Each frame of audio that you send to the VAD must be 10, 20 or 30 milliseconds long.
Here's an outline of an example that assumes audio_frame is 10 ms (320 bytes) of audio at 16000 Hz:
#include "webrtc/common_audio/vad/include/webrtc_vad.h"
// ...
VadInst *vad;
WebRtcVad_Create(&vad);
WebRtcVad_Init(vad);
int is_voiced = WebRtcVad_Process(vad, 16000, audio_frame, 160);

There are open source implementations in the Sphinx and Freeswitch projects. I think they are all energy based detectors do won't need any kind model.
Sphinx 4 (Java but it should be easy to port to C/C++)
PocketSphinx
Freeswitch

How about LibVAD? www.libvad.com
Seems like that does exactly what you're describing.
Disclosure: I'm the developer behind LibVAD

Related

USB Audio, distortion in low bits

I've implemented a PIC32 as a USB sound card, using USB Audio Class 1. I'm sending a sawtooth signal from the microcontroller to the PC(windows 7, 64 bit), as 16-bit samples:
in decimal:
000
800
1600
2400
.. so on
then i try recording the received audio using Audacity, with MME -driver, as .wav or .raw.
I use MATLAB to open and inspect the data, and there i see data like:
000
799
1599
2400
..
The distortion varies from -1 to +1 bit pr sample..
Anyone have any idea where the problem might be.?
Windows-audio drivers.?
Since you receive the audio signal on PC, playback it, and record it using SW, the audio signal is converted from digital to analog, and to digital again. These introduce quantization error and noise, and you see the little difference between two signals.
I solved my problem..
The problem was caused by the application i used to record the data, and the method i used.. I used Audacity, which supports the old windows MME audio API, and the DirectSound API. These are relatively high-level API's apparently, and are the cause of the distortion.
About the Windows Core Audio APIs
Instead i used another program, called Reaper, it has an option to record using ASIO og WASAPI. This solves my problem. I've checked every sample in an 2 hour .wav file, using MATLAB, and it is completely bit-perfect.
I was probably some quantization error, but it was caused by the API.
ASIO and WASAPI gave me bit-perfect sound, MME and DirectSound gave me a distorted signal.

Output 4 channels of audio in MATLAB

I'm looking to output four channels of audio simultaneously from MATLAB using an external soundcard (Creative Soundblaster X-Fi Surround 5.1 Pro USB) and haven't yet found a working solution.
As far as I understand it, MATLAB's audioplayer object can only output a stereo signal, so I've tried two alternatives: playrec and pa_wavplay. Both appear to do precisely what I need, but seem to recognize the soundcard as a two-channel device only.
Any advice would be terrific. Thanks for reading.
(The MATLAB version is R2007b and the only available toolbox is the Signal Processing Toolbox.)
I've got a bit of experience of pa_wavplay and found it dealt with large numbers of inputs/output without any problems. I'd suspect the problem is with your audio interface.
While it can output 5.1, it's quite possibly producing those "additional" channels itself by decoding a Dolby Digital stream once in the device. This suggests the interface won't allow you to output 6 six channels of PCM audio as such.
If you're determined to use this device and prepared to get your hands dirty you could always try encoding your audio as ac3 yourself, but I guess you'd have to do this outside Matlab.

Transferring data using ultrasound

Yamaha InfoSound and ShopKick application use technologies that allow to transfer data using ultrasound. That is playing an inaudible signal (>18kHz) that can be picked up by modern mobile phones (iOS, Android).
What is the approach used in such technologies? What kind of modulation they use?
I see several problems with this approach. First, 18kHz is not inaudible. Many people cannot hear it, especially as they age, but I know I certainly can (I do regular hearing tests, work-related). Also, most phones have different low-pass filters on their A/D converters, and many devices, especially older Android ones (I've personally seen that happen), filter everything below 16 kHz or so. Your app therefore is not guaranteed to work on any hardware. The iPhone should probably be able to do it.
In terms of modulation, it could be anything really, but I would definitely rule out AM. Sound has next to zero robustness when it comes to volume. If I were to implement something like that, I would go with FSK. I would think that PSK would fail due to acoustic reflections and such. The difficulty is that you're working with non-robust energy transfer within a very narrow bandwidth. I certainly do not doubt that it can be achieved, but I don't see something like this proving reliable. Just IMHO, that is.
Update: Now that i think about it, a plain on-off would work with a single tone if you're not transferring any data, just some short signals.
Can't say for Yamaha InfoSound and ShopKick, but what we used in our project was a variation of frequency modulation: the frequency of the carrier is modulated by a digital binary signal, where 0 and 1 correspond to 17 kHz and 18 kHz respectively. As for demodulator, we tried heterodyne. More details you could find here: http://rnd.azoft.com/mobile-app-transering-data-using-ultrasound/
There's nothing special in being ultrasound, the principle is the same as data transmission through a modem, so any digital modulation is -in principle- feasible. You only have a specific frequency band (above 18khz) and some practical requisites (the medium is very unreliable, I guess) that suggest to use a simple-robust scheme with low-bit rate.
I don't know how they do it but this is how I do it:
If it is a string then make sure it's not a long one (the longer the higher is the error probability ). Lets assume we're working with the vital part of the ASCII code, namely up to character number 127, then all you need is 7 bits per character. Transform this character into bits and modulate those bits using QFSK (there are several modulations to choose from, frequency shift based ones have turned out to be the most robust I've tried from the conventional ones... I've created my own modulation scheme for this use case). Select the carrier frequencies as 18.5,19,19.5, and 20 kHz (if you want to be mathematically strict in your design, select frequency values that assure you both orthogonality and phase continuity at symbol transitions, if you can't, a good workaround to avoid abrupt symbols transitions is to multiply your symbols by a window of the same size, eg. a Gaussian or Bartlet ). In my experience you can move this values in the range from 17.5 to 20.5 kHz (if you go lower it will start to bother people using your app, if you go higher the average type microphone frequency response will attenuate your transmission and induce unwanted errors).
On the receiver side implement a correlation or matched filter receiver (an FFT receiver works as well, specially a zero padded one but it might be a little bit slower, I wouldn't recommend Goertzel because frequency shift due to Doppler effect or speaker-microphone non-linearities could affect your reception). Once you have received the bit stream make characters with them and you will recover your message
If you face too many broadcasting errors, try selecting a higher amount of samples per symbol or band-pass filtering each frequency value before giving them to the demodulator, using an error correction code such as BCH or Reed Solomon is sometimes the only way to assure an error free communication.
One topic everybody always forgets to talk about is synchronization (to know on the receiver side when the transmission has begun), you have to be creative here and make a lot a tests with a lot of phones before you can derive an actual detection threshold that works on all, notice that this might also be distance dependent
If you are unfamiliar with these subjects I would recommend a couple of great books:
Digital Modulation Techniques from Fuqin Xiong
DIGITAL COMMUNICATIONS Fundamentals and Applications from BERNARD SKLAR
Digital Communications from John G. Proakis
You might have luck with a library I created for sound-based modems, libquiet. It gives you a handful of profiles to work from, including a slow "Ultrasonic whisper" profile with spectral content above 19kHz. The library is written in C but would require some work to interface with iOS.

Watermarking sound, reading through iPhone

I want to add a few bytes of data to a sound file (for example a song). The sound file will be transmitted via radio to a received who uses for example the iPhone microphone to pick up the sound, and an application will show the original bytes of data. Preferably it should not be hearable for humans.
What is such technology called? Are there any applications that can do this?
Libraries/apps that can be used on iPhone?
It's audio steganography. There are algorithms to do it. Refer to here.
I've done some research, and it seems the way to go is:
Use low audio frequencies.
Spread the "bits" around randomly - do not use a pattern as it will be picked up by the listener. "White noise" is a good clue. The random pattern is known by the sender and receiver.
Use Fourier transform to pick up frequency and amplitude
Clean up input data.
Use checksum/redundancy-algorithms to compensate for loss.
I'm writing a prototype and am having a bit difficulty in picking up the right frequency as if has a ~4 Hz offset (100 Hz becomes 96.x Hz when played and picked up by the microphone).
This is not the answer, but I hope it helps.

How can I do metering/average peak power level in OpenAL?

I'm in the process of switching from AVAudioPlayer to OpenAL using the Finch sound engine. I need to do metering, i.e. get the average peak levels. Finch sound engine does not provide this, and I'm completely new to OpenAL. How can I do this? Any examples would be really appreciated.
I'm assuming you're looking for a drop-in replacement of AVAudioPlayer's peakPowerForChannel: method. Unfortunately, there is none. You'll have to roll your own.
OpenAL "sounds" are a combination of a "buffer" (your sample data, loaded in memory) and a "source," which represents something like properties you want applied to your sample data.
The easy approach to OpenAL playback is to load the entire file into memory and just play the whole thing in one call. However, you can use an NSInputStream to read in a chunk of PCM sample data from a file into an OpenAL buffer, use alBufferData() to compute your peak power using your own function, play the chunk using your source, and then repeat until EOF.
I know you are intending to use Finch, but you should give AudioQueues a real close lookover (if metering is a critical feature for you). It is much better designed for this type of application. In particular, the kAudioQueueProperty_CurrentLevelMeterDB property will provide you with either peak RMS (mPeakPower) or average RMS levels (mAveragePower), which you can read as often as you like.
Good luck and happy coding!
Some resources that might be helpful:
http://kcat.strangesoft.net/openal-tutorial.html
http://connect.creativelabs.com/openal/Documentation/OpenAL_Programmers_Guide.pdf
http://www.hydrogenaudio.org/forums/index.php?showtopic=78578
http://developer.apple.com/mac/library/documentation/MusicAudio/Reference/AudioQueueReference/Reference/reference.html