Is it possible to compare two sounds ?
for example app have already a sound file mp3 or any format, is it possible to compare any static sound file and recorded sound inside of app ?
Any comments are welcomed.
Regards
This forum thread has a good answer (about three down) - http://www.dsprelated.com/showmessage/103820/1.php.
The trick is to get the decoded audio from the mp3 - if they're just short 'hello' sounds, I'd store them inside the app as a wav instead of decoding them (though I've never used CoreAudio or any of the other frameworks before so mp3 decoding into memory might be easy).
When you've got your reference wav and your recorded wav, follow the steps in the post above :
1 Do whatever is necessary to convert .wav files to their discrete- time
signals:
http://www.sonicspot.com/guide/wavefiles.html
2 time-warping might or might not be necessary depending on difference
between two sample rates:
http://en.wikipedia.org/wiki/Dynamic_time_warping
3 After time warping, truncate both signals so that their durations are
equivalent.
4 Compute normalized energy spectral density (ESD) from DFT's two signals:
http://en.wikipedia.org/wiki/Power_spectrum.
6 Compute mean-square-error (MSE) between normalized ESD's of two
signals:
http://en.wikipedia.org/wiki/Mean_squared_error
The MSE between the normalized ESD's
of two signals is good metric of
closeness. If you have say, 10 .wav
files, and 2 of them are nearly the
same, but the others are not, the two
that are close should have a
relatively low MSE. Two perfectly
identical signals will obviously have
MSE of zero. Ideally, two "equivalent"
signals with different time scales,
(20-second human talking versus
5-second chipmunk), different energies
(soft-spoken human verus yelling
chipmunk), and different phases
(sampling began at slightly different
instant against continuous time
input); should still have MSE of zero,
but quantization errors inherent in
DSP will yield MSE slightly greater
than zero.
http://en.wikipedia.org/wiki/Minimum_mean-square_error
You should get two different MSE values, one between your male->recorded track and one between your female->recorded track. The comparison with the lowest difference is probably the correct gender.
I confess that I've never tried to do this and it looks very hard - good luck!
Related
How is it possible to encode black/white picture into ".wav"-file? I know that it is possible for sure with help of "stenography". But I don't know it's algorithms. What algorithms exist? And what books/sources are the best for understanding of their principles?
Edited:
Actually I have stereo wav-file. My task is to decode pictures from it. The task says, that frequencies of the left channel show the X-coordinate, frequencies of the right channel show the Y-coordinate of Cartesian coordinate system. These points compose the picture with the text-message. So, I must to write programm for this. I haven't any idea what should I do.
Probably the simplest version of steganography using a wav file would be to use 16-bit samples in the wave file, but only dedicate the 15 most significant bits to sound. In the least significant bit of each sample, you'd encode one pixel of your black and white picture.
Regenerating the picture would require software to open the wave file, take the least significant bit from each sample, and put those bits back together with each other into (for example) a JPEG file.
To put things into perspective, a CD has two channels containing 16 bit samples at a rate of 44.1 KHz, so you'd only need the LSBs from around 10 seconds of sound to encode a fairly typical full-color JPEG (e.g., 100KB or so). A wave file of a typical ~3 minute pop song could hide around 15-20 full-color pictures pretty easily.
Edit: (to reply to edited answer). This is a little tougher to deal with. An individual sample can't represent any frequency; it just represents the amplitude at a given point in time. To get frequency, you need a number of samples over a period of time -- and you need to know the exact period to convert.
Once you know that, you basically do an FFT on the samples. That will tell you the relative strengths of signal at all possible frequencies. Presumably, you'd pick the strongest one and scale appropriately. Do the same for the other channel and draw a pixel at that point.
Your ears are not sensitive to small changes in sound file.
Wav files are UNCOMPRESSED data so its just a file of 16-24bit characters. Your ears cannot notice slight differences betweeen bits. All you need to do is periodically inject bit values that represent an image in the data.
So if you insert one pixel for every 1000 data points you can hide an image (without even encrypting it) in a wave file. If a user plays the file they CANNOT hear it.
When you save the file on your computer or computer afar you can use a decoding tool that is aware of the hiding techinque.
Yamaha InfoSound and ShopKick application use technologies that allow to transfer data using ultrasound. That is playing an inaudible signal (>18kHz) that can be picked up by modern mobile phones (iOS, Android).
What is the approach used in such technologies? What kind of modulation they use?
I see several problems with this approach. First, 18kHz is not inaudible. Many people cannot hear it, especially as they age, but I know I certainly can (I do regular hearing tests, work-related). Also, most phones have different low-pass filters on their A/D converters, and many devices, especially older Android ones (I've personally seen that happen), filter everything below 16 kHz or so. Your app therefore is not guaranteed to work on any hardware. The iPhone should probably be able to do it.
In terms of modulation, it could be anything really, but I would definitely rule out AM. Sound has next to zero robustness when it comes to volume. If I were to implement something like that, I would go with FSK. I would think that PSK would fail due to acoustic reflections and such. The difficulty is that you're working with non-robust energy transfer within a very narrow bandwidth. I certainly do not doubt that it can be achieved, but I don't see something like this proving reliable. Just IMHO, that is.
Update: Now that i think about it, a plain on-off would work with a single tone if you're not transferring any data, just some short signals.
Can't say for Yamaha InfoSound and ShopKick, but what we used in our project was a variation of frequency modulation: the frequency of the carrier is modulated by a digital binary signal, where 0 and 1 correspond to 17 kHz and 18 kHz respectively. As for demodulator, we tried heterodyne. More details you could find here: http://rnd.azoft.com/mobile-app-transering-data-using-ultrasound/
There's nothing special in being ultrasound, the principle is the same as data transmission through a modem, so any digital modulation is -in principle- feasible. You only have a specific frequency band (above 18khz) and some practical requisites (the medium is very unreliable, I guess) that suggest to use a simple-robust scheme with low-bit rate.
I don't know how they do it but this is how I do it:
If it is a string then make sure it's not a long one (the longer the higher is the error probability ). Lets assume we're working with the vital part of the ASCII code, namely up to character number 127, then all you need is 7 bits per character. Transform this character into bits and modulate those bits using QFSK (there are several modulations to choose from, frequency shift based ones have turned out to be the most robust I've tried from the conventional ones... I've created my own modulation scheme for this use case). Select the carrier frequencies as 18.5,19,19.5, and 20 kHz (if you want to be mathematically strict in your design, select frequency values that assure you both orthogonality and phase continuity at symbol transitions, if you can't, a good workaround to avoid abrupt symbols transitions is to multiply your symbols by a window of the same size, eg. a Gaussian or Bartlet ). In my experience you can move this values in the range from 17.5 to 20.5 kHz (if you go lower it will start to bother people using your app, if you go higher the average type microphone frequency response will attenuate your transmission and induce unwanted errors).
On the receiver side implement a correlation or matched filter receiver (an FFT receiver works as well, specially a zero padded one but it might be a little bit slower, I wouldn't recommend Goertzel because frequency shift due to Doppler effect or speaker-microphone non-linearities could affect your reception). Once you have received the bit stream make characters with them and you will recover your message
If you face too many broadcasting errors, try selecting a higher amount of samples per symbol or band-pass filtering each frequency value before giving them to the demodulator, using an error correction code such as BCH or Reed Solomon is sometimes the only way to assure an error free communication.
One topic everybody always forgets to talk about is synchronization (to know on the receiver side when the transmission has begun), you have to be creative here and make a lot a tests with a lot of phones before you can derive an actual detection threshold that works on all, notice that this might also be distance dependent
If you are unfamiliar with these subjects I would recommend a couple of great books:
Digital Modulation Techniques from Fuqin Xiong
DIGITAL COMMUNICATIONS Fundamentals and Applications from BERNARD SKLAR
Digital Communications from John G. Proakis
You might have luck with a library I created for sound-based modems, libquiet. It gives you a handful of profiles to work from, including a slow "Ultrasonic whisper" profile with spectral content above 19kHz. The library is written in C but would require some work to interface with iOS.
I'm not sure if it's possible to achieve what I want, but basically I have a NSDictionary which represents a recording. It's a timeline of what sound id was played at what point in time.
I have it so that you can play back this timeline/recording, and it works perfectly.
I'm wondering if there is anyway to take this timeline, and export it as a single sound that could be saved to a computer if the device was synced with iTunes.
So basically I'm asking if I can take a timeline of sounds, play it back and have these sounds stitched together as a single sound, that can then be exported.
I'm using OpenAL as my sound framework and the sound files are all CAFs.
Any help or guidance is appreciated.
Thanks!
You will need:
A good understanding of linear PCM audio format (See Wikipedia's Linear PCM page).
A good understanding of audio sample-rates and some basic maths to convert your timings into sample-offsets.
An awareness of how two's-complement binary numbers (signed/unsigned, 16-bit, 32-bit, etc.) are stored in computers, and how the endian-ness of a processor affects this.
Patience, interest in learning, and a strong desire to get this working.
Here's what to do:
Enable file sharing in your app (UIFileSharingEnabled=YES in info.plist and write files to /Documents directory).
Render the used sounds into memory buffers containing linear PCM audio data (if they are not already, i.e. if they are compressed). You can do this using the offline rendering functionality of Audio Queues (see Apple audio queue docs). It will make things a lot easier if you render them all to the same PCM format and sample rate (For example 16-bit signed samples #44,100Hz, I'll use this format for all examples), and use the same format for your output. I recommend starting off with a Mono format then adding stereo once you get it working.
Choose an uncompressed output format and mix your sounds into a single stream:
3.1. Allocate a buffer large enough, or open a file stream to write to.
3.2. Write out any headers (for example if using WAV format output instead of raw PCM) and write zeros (or the mid-point of your sample range if not using a signed sample format) for any initial silence before your first sound starts. For example if you want 0.1 seconds silence before your first sound, write 4410 (0.1 * 44100) zero-samples i.e. write 4410 shorts (16-bit) all with zero.
3.3. Now keep track of all 'currently playing' sounds and mix them together. Start with an empty list of 'currently playing sounds and keep track of the 'current time' of the sample you are mixing, for each sample you write out increment the 'current time' by 1.0/sample_rate. When it gets time for another sound to start, add it to the 'currently playing' list with a sample offset of 0. Now to do the mixing, you iterate through all of the 'currently playing' sounds and add together their current sample, then increment the sample offset for each of them. Write the summed value into the output buffer. For example if soundA starts at 0.1 seconds (after the silence) and soundB starts at 0.2 seconds, you will be doing the equivalent of output[8820] = soundA[4410] + soundB[0]; for sample 8820 and then output[8821] = soundA[4411] + soundB[1]; for sample 8821, etc. As a sound ends (you get to the end of its samples) simply remove it from the 'currently playing' list and keep going until the end of your audio data.
3.4. The simple mixing (sum of samples) described above does have some problems. For example if two samples have values that add up to a number larger than 32767, this cannot be stored in a signed-16-bit number, this is called clipping. For now, just clamp the value to 32767, and get it working... later on come back and implement a simple limiter (see description at end).
Now that you have a mixed version of your track in an uncompressed linear PCM format, that might be enough, so write it to /Documents. If you want to write it in a compressed format, you will need to get the source for an audio encoder and run your linear PCM output through that.
Simple limiter:
Let's choose to limit the top 10% of the sample range, so if the absolute value is greater than 29490 (int limitBegin = (int)(32767 * 0.9f);) we will scale down the value. The maximum possible peak would be int maxSampleValue = 32767 * numPlayingSounds; and we want to scale values above limitBegin to peak at 32767. So do the summation into sampleValue as per the very simple mixer described above, then:
if(sampleValue > limitBegin)
{
float overLimit = (sampleValue - limitBegin) / (float)(maxSampleValue - limitBegin);
sampleValue = limitBegin + (int)(overLimit * (32767 - limitBegin));
}
If you're paying attention, you will have noticed that when numPlayingSounds changes (for example when a new sound starts), the limiter becomes more (or less) harsh and this may result in abrupt volume changes (within the limited range) to accommodate the extra sound. You can use the maximum number of playing sounds instead, or devise some clever way to ramp up the limiter over a few milliseconds.
Remember that this is operating on the absolute value of sampleValue (which may be negative in signed formats), so the code here is just to demonstrate the idea. You'll need to write it properly to handle limiting at both ends (peak and trough) of your sample range. Also, there are some tricks you can do to optimize all of the above during the mixing - you will probably spot these while you're writing the mixer, be careful and get it working first, then go back and refactor/optimize if needed.
Also remember to consider the endian-ness of the platform you are using and the file-format you are writing to, as you may need to do some byte-swapping.
One approach which isn't too hard if your files are stored in a simple format is just to combine them together manually. That is, create a new file with the caf format and manually put together the pieces you want.
This will be really easy if the sounds are uncompressed (linear PCM). But, read the documents on the caf file format here:
http://developer.apple.com/library/mac/#documentation/MusicAudio/Reference/CAFSpec/CAF_spec/CAF_spec.html#//apple_ref/doc/uid/TP40001862-CH210-SW1
I want to add a few bytes of data to a sound file (for example a song). The sound file will be transmitted via radio to a received who uses for example the iPhone microphone to pick up the sound, and an application will show the original bytes of data. Preferably it should not be hearable for humans.
What is such technology called? Are there any applications that can do this?
Libraries/apps that can be used on iPhone?
It's audio steganography. There are algorithms to do it. Refer to here.
I've done some research, and it seems the way to go is:
Use low audio frequencies.
Spread the "bits" around randomly - do not use a pattern as it will be picked up by the listener. "White noise" is a good clue. The random pattern is known by the sender and receiver.
Use Fourier transform to pick up frequency and amplitude
Clean up input data.
Use checksum/redundancy-algorithms to compensate for loss.
I'm writing a prototype and am having a bit difficulty in picking up the right frequency as if has a ~4 Hz offset (100 Hz becomes 96.x Hz when played and picked up by the microphone).
This is not the answer, but I hope it helps.
I'm using Aran Mulhollan' RemoteIOPlayer, using audioqueues in the SDK iphone.
I can without problems:
- adding two signals to mix sounds
- increasing sound volume by multiplying the UInt32 I get from the wav files
BUT every other operation gives me warped and distorted sound, and in particular I can't divide the signal. I can't seem to figure out what I'm doing wrong, the actual result of the division seems fine; some aspect of sound / signal processing must obviously be eluding me :)
Any help appreciated !
Have you tried something like this?
- (void)setQueue:(AudioQueueRef)ref toVolume:(float)newValue {
OSStatus rc = AudioQueueSetParameter(ref, kAudioQueueParam_Volume, newValue);
if (rc) {
NSLog(#"AudioQueueSetParameter returned %d when setting the volume.\n", rc);
}
}
First of all the code you mention does not use AudioQueues, it uses AudioUnits. The best way to mix audio in the iphone is using the mixer units that are inbuilt, there is some code on the site you downloaded your original example from here. Other than that what i would check in your code os that you have the correct data type. Are you trying your operations on Unsigned ints when you should be using signed ones? often that produces warped results (understandably)
The iPhone handles audio as 16-bit integer. Most audio files are already normalized so that the peak sample values are the maximum that fit in a 16-bit signed integer. That means if you add two such samples together, you get overflow, or in this case, audio clipping. If you want to mix two audio sources together and ensure there's no clipping, you must average the samples: add them together and divide by two. Or you set the volume to half. If you're using decibels, that would be about a -6 dB change.