Separating the instrument and vocals from an MP3 in Objective C - filtering

I'm trying to take out the instrumentals in any mp3 in objective-c for a karaoke song maker. The solution doesn't have to be perfect. I think the general idea from my research is it should be some sort of filter that leaves the vocals intact but affects the range of frequencies that are instruments. I don't have much of a background in signal processing but would love some help on this topic.

In general, separating instrumentals and vocals can't be done by filtering, as the audio spectrum frequency range of vocals and instrumentals overlap quite a bit.
For stereo music where the main vocal is panned dead-center, and with the instrumental music panned off to one side or the other, one can remove some of the vocals in the mix by subtracting one channel from the other (say the left from the right). To do this subtraction, you would have to convert the mp3 into uncompressed audio of raw PCM samples, and work with C data types.
A search term for academic research on how this might be accomplished is "blind source separation".

Related

Which input format is the best for sound recognition in recurrent neural networks?

I want to create sound or pitch recognition with recurrent deep neural network. And I'm wondering with what input will I get best results.
Should I feed DNN with amplitudes or with FFT(Fast Fourier transform) result?
Is there any other format that is known to produce good results and fast learning?
While MFCCs have indeed been used in music information retrieval research (for genre classification etc...), in this case (pitch detection) you may want to use a semi-tone filterbank or constant Q transform as a first information reduction step. These transformations match better with musical pitch.
But I think it's also worth trying to use the audio samples directly with RNNs, in case you have a huge number of samples. In theory, the RNNs should be able to learn the wave patterns corresponding to particular pitches.
From your description, it's not entirely clear what type of "pitch recognition" you're aiming for: monophonic instruments (constant timbre, and only 1 pitch sounding at a time)? polyphonic (constant timbre, but multiple pitches may be sounding simultaneously)? multiple instruments playing together (multiple timbres, multiple pitches)? or even a full mix with both tonal and percussive sounds? The hardness of these use cases roughly increases in the order I mentioned them, so you may want to start with monophonic pitch recognition first.
To obtain the necessary amount of training examples, you could use a physical model or a multi-sampled virtual instrument to generate the audio samples for particular pitches in a controlled way. This way, you can quickly create your training material instead of recording it and labeling it manually. But I would advise you to at least add some background noise (random noise, or very low-level sounds from different recordings) to the created audio samples, or your data may be too artificial and lead to a model that doesn't work well once you want to use it in practice.
Here is a paper that might give you some ideas on the subject:
An End-to-End Neural Network for Polyphonic Piano Music Transcription
(Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon)
https://arxiv.org/pdf/1508.01774.pdf
The Mel-frequency cepstrum is general used for speech recognition.
mozilla DeepSpeech is using MFCCs as input to their DNN.
For python implementation you can use python-speech-features lib.

How to play a row of numbers on an iPhone as audio?

I'm looking at an output of an electroencephalogram sensor. This data is displayed on screen in raw form at about 200Hz. I read that in the old times, it was possible to hook up such output to a speaker and hear the waveform, instead of seeing it. So I'm interested if it is possible to replicate this experiment with modern iPhone. How can I take a waveform that is displayed in a graph form and package it in such a way that it can be played through a iPhone's speakers live? In other words, I'm looking to stream EEG data through some sort of audio player and need to know how to create audio packets from this data on the fly.
Here's the raw waveform, it is displayed at 200 data points per second (200Hz)
After I clean up and process the waveform, I'm interested in how far it deviates from the average of the waveform. In this case, I think this can be played as a increasing/decreasing amplitude of a sine wave, which may be easier.
Thank you for your input
Here's a good tutorial on generating a sine tone for output through CoreAudio:
http://www.cocoawithlove.com/2010/10/ios-tone-generator-introduction-to.html
The RenderProc is the bit of code you're twiddling with, in the example they're using an NSSlider to change the frequency, you just need to feed it with your signal data instead.
One of the ideas that I had for playing sound in response to the signal amplitude change is to divide the amplitude into a set of discrete bands of values (for example 0-10, 10-20, 20-30, etc) and then assign a sound to each band. Then using audio services or system sound, it might be possible to loop a unique sound fragment for each band.

changing the pitch of an audio wav file in matlab?

How do you go about changing the pitch of an audio signal in matlab?. Essentially I just want to change the original qualities of the audio signal without making a dramatic change. I'm trying to use the original input audio to simulate a chorus by changing its qualities slightly so that I can have multiple variations of the audio to simulate the chorus.
This simplest approach might be a phase vocoder. You can find one matlab implementation here:
http://labrosa.ee.columbia.edu/matlab/pvoc/
This is a rabbit hole, though. There are so many more techniques that can employed to improve the quality and reduce the artifacts introduced by pitch shifting. See for example, Jean Laroche and Mark Dolson, "New Phase-Vocoder Techniques for pitch shifting, harmonizing and other exotic effects", proc. 1999 IEEE Workship on Applications of Signal Processing to Audio and Acoustics, p. 91.

Ultrasound iphone (Shopkick signal technology)

I think shopkick is detecting very high frequency signal which is not audible to human ear.But the real question is how they can detect signal of more than 22khz in iphone. I have checked frequency response of iphone mic,it seems to be from 20 hz to 22 khz within the human audible range.
http://blog.faberacoustical.com/2009/iphone/iphone-microphone-frequency-response-comparison/ http://www.businessinsider.com/shopkick-crate-barrel-2010-12?op=1
Can you guide me on this. If it is possible with iphone mic,then we can able do some signal processing specifically FFT in order to get frequency.
Well I am currently working on a similar system of transmitting data using these high frequencies and this is what I found out. Al-thou keep in mind that I am doing this with Android phones, mostly Galaxy S line.
First of all spectrum of 20khz to 22khz seems quite promising because it can be detected by all phones we tested and even reproduced by some of them. These frequencies are inaudible to humans of any age and even the dogs and cats seem to not notice them. If you are targeting (actually avoiding) detection by humans you could even go to as low as 18khz since most people wouldn't hear that. This gives you a bandwidth of 4000hz which you can Frequency modulate a data into. Of course don't expect to transmit 8mp images but some small data can be transmitted. You are right in the part that you could than use FFT to transit into frequency domain and analyse those frequencies, this can be done even on older phones in Java (I think doing it in objective c would be even faster).
Also if you have few iPhones on your disposal you could install any frequency analyser and play the frequencies you want on another iPhone or some speaker to test what they can detect. Just keep in mind that standard desktop speakers would probably be able to play the given frequencies but will introduce noise of lower frequency. Piezo tweeters are probably best for these type of sounds al-thou I must say I am using iPhone 4 to play these frequencies for testing quete efficiently.
I read somewhere that Shopkick now even plays there sound codes over stores PA-s and since those speakers are not really optimised for above 20khz response I too am starting to suspect they are using frequencies below that. Take a look at this website for different store codes that some people are using to cheat the system http://www.ceploitips.com/2011/03/shopkick-walk-in-files.html
Keep in mind that using these might ban your account since they improved there misuse detection algorithms.
Also I too would like to read more about the Shopkick implementation so if anyone viewing this has some link please share.
First, human hearing pretty much tops out at 20 KHz and even that requires a very young human and a very low and erratic shift along those upper frequencies. For example, I can produce a tone as low as 18 KHz at full iPad volume at a sample rate of 48 KHz that even my dog doesn't notice. Read up on PsychoAcoustics and you will see that humans filter echoes at even very low frequencies that are there but we don't notice them.
But in the case of ShopKick, I don't think they are going above even 21 KHz. I have created several digital audio modulations on the iPhone and 21 KHz seems to be the upper limit for any distance at all.
It would help if you gave more input on what you are doing. I assume from the question you want to modulate a digital signal between two devices.
My best guess is that they are using maximal length sequences. These are almost like a weak background hiss that covers a large range of the audio spectrum. The key to detection is that the pattern repeats exactly and the phone has a key that detects the sound by correlating the key and the incoming audio.

How to determine the frequency of the input recorded voice in iphone?

I am new to iphone development.I am doing research on voice recording in iphone .I have downloaded the "speak here " sample program from Apple.I want to determine the frequency of my voice that is recorded in iphone.Please guide me .Please help me out.Thanks.
In the context of processing human speech, there's really no such thing as "the" frequency.
The signal will be a mix of many different frequencies, so it might be more fruitful to think in terms of a spectrum, rather than a single frequency. Even if you're talking about
a sustained musical note with a fixed pitch, there will be plenty of overtones and harmonics present, in addition to the fundamental frequency of the note. And for actual speech,
the frequency spectrum will change drastically even within a short clip, due to the different tonal characteristics of vowels and consonants.
With that said, it does make some sense to consider the peak frequency of a voice recording.
You could calculate the Fast Fourier Transform of your voice clip, then find the frequency
bin with the largest response. You may also be interested in the concept of a spectrogram, which represents how the audio spectrum of a signal varies over time.
Use Audacity. Take a small recording of typical speech, and cut it down to one wavelength, from one peak to another peak. Subtract the two times, and divide 1 by that number and you'll get the frequency of your wave in Hz.
Example:
In my audio clip, my waveform runs from 0.0760 to 0.0803 seconds.
0.0803-0.0760 = 0.0043
1/0.0043 = 232.558 Hz, my typical speech frequency
This might give you a good basis to create an analyzer. You'd need to detect the peaks, and time between the peaks of the wave and do an average calculation of the result.
You'll need to use Apple's Accelerate framework to take an FFT of the relevant audio. The FFT will convert the audio in the time domain to the frequency domain. The Accelerate framework supports the FFT and will allow you to do frequency analysis in real-time.