I want to create sound or pitch recognition with recurrent deep neural network. And I'm wondering with what input will I get best results.
Should I feed DNN with amplitudes or with FFT(Fast Fourier transform) result?
Is there any other format that is known to produce good results and fast learning?
While MFCCs have indeed been used in music information retrieval research (for genre classification etc...), in this case (pitch detection) you may want to use a semi-tone filterbank or constant Q transform as a first information reduction step. These transformations match better with musical pitch.
But I think it's also worth trying to use the audio samples directly with RNNs, in case you have a huge number of samples. In theory, the RNNs should be able to learn the wave patterns corresponding to particular pitches.
From your description, it's not entirely clear what type of "pitch recognition" you're aiming for: monophonic instruments (constant timbre, and only 1 pitch sounding at a time)? polyphonic (constant timbre, but multiple pitches may be sounding simultaneously)? multiple instruments playing together (multiple timbres, multiple pitches)? or even a full mix with both tonal and percussive sounds? The hardness of these use cases roughly increases in the order I mentioned them, so you may want to start with monophonic pitch recognition first.
To obtain the necessary amount of training examples, you could use a physical model or a multi-sampled virtual instrument to generate the audio samples for particular pitches in a controlled way. This way, you can quickly create your training material instead of recording it and labeling it manually. But I would advise you to at least add some background noise (random noise, or very low-level sounds from different recordings) to the created audio samples, or your data may be too artificial and lead to a model that doesn't work well once you want to use it in practice.
Here is a paper that might give you some ideas on the subject:
An End-to-End Neural Network for Polyphonic Piano Music Transcription
(Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon)
https://arxiv.org/pdf/1508.01774.pdf
The Mel-frequency cepstrum is general used for speech recognition.
mozilla DeepSpeech is using MFCCs as input to their DNN.
For python implementation you can use python-speech-features lib.
I am new to speech processing. So please forgive for my ignorance. I was given a short speech signal (10 sec) and was asked to manually annotate pitch using MATLAB or Wavesufer software. Now how to find pitch of a speech signal?. Is there any theoretical resource to help the problem? I tried to plot pitch-contour of the signal using Wavesurfer.Is it right?
Edit 1:My work is applying various pitch detection algorithms for our data and compare their accuracies. So manually annotated pitch acts as the reference.
UPDATE 1: I obtained the GCIs (Glottal Closure Instants) by differentiating EGG (dEGG) signal and the peaks in dEGG are GCIs. Time interval between two successive GCIs is the pitch period (s). The inverse of pitch period is pitch (hz).
UPDATE 2 : SIGMA is a famous algorithm for automatic GCI detection.
Thanks everyone.
Usually ground truth is obtained on the signal accompanied with EGG recording. EGG is an acronym for Electrogastrogram, it's a special device which records true pitch.
Since I doubt you have access to such device, I recommend you to use existing database for pitch extraction evaluation carefully prepared for that task. You can download it here. This data was collected in University of Edinburgh by Paul Bagshaw
I suggest you to read his thesis as well.
If you want to compare with the state of the art algorithm for pitch extraction check https://github.com/google/REAPER. Also note that "true" pitch might not be the best feature for subsequent algorithms. Sometime you might extract pitch with mistakes but get better accuracy for example for speech recognition. Check for more information this publication.
Is this because it's a complex problem ? I mean to wide and therefore it does not exist a simple / generic solution ?
Because every (almost) software making signal processing (Avisoft, GoldWave, Audacity…) have this function that reduce background noise of a signal. Usually it uses FFT. But I can't find a function (already implemented) in Matlab that allows us to do the same ? Is the right way to make it manually then ?
Thanks.
The common audio noise reduction approaches built-in to things like Audacity are based around spectral subtraction, which estimates the level of steady background noise in the Fourier transform magnitude domain, then removes that much energy from every frame, leaving energy only where the signal "pokes above" this noise floor.
You can find many implementations of spectral subtraction for Matlab; this one is highly rated on Matlab File Exchange:
http://www.mathworks.com/matlabcentral/fileexchange/7675-boll-spectral-subtraction
The question is, what kind of noise reduction are you looking for? There is no one solution that fits all needs. Here are a few approaches:
Low-pass filtering the signal reduces noise but also removes the high-frequency components of the signal. For some applications this is perfectly acceptable. There are lots of low-pass filter functions and Matlab helps you apply plenty of them. Some knowledge of how digital filters work is required. I'm not going into it here; if you want more details consider asking a more focused question.
An approach suitable for many situations is using a noise gate: simply attenuate the signal whenever its RMS level goes below a certain threshold, for instance. In other words, this kills quiet parts of the audio dead. You'll retain the noise in the more active parts of the signal, though, and if you have a lot of dynamics in the actual signal you'll get rid of some signal, too. This tends to work well for, say, slightly noisy speech samples, but not so well for very noisy recordings of classical music. I don't know whether Matlab has a function for this.
Some approaches involve making a "fingerprint" of the noise and then removing that throughout the signal. It tends to make the result sound strange, though, and in any case this is probably sufficiently complex and domain-specific that it belongs in an audio-specific tool and not in a rather general math/DSP system.
Reducing noise requires making some assumptions about the type of noise and the type of signal, and how they are different. Audio processors typically assume (correctly or incorrectly) something like that the audio is speech or music, and that the noise is typical recording session background hiss, A/C power hum, or vinyl record pops.
Matlab is for general use (microwave radio, data comm, subsonic earthquakes, heartbeats, etc.), and thus can make no such assumptions.
matlab is no exactly an audio processor. you have to implement your own filter. you will have to design your filter correctly, according to what you want.
I'm currently working on program that can output sine wave of set frequency through speaker/headphones on iPhone.
Now I want to output multiple sine waves, and I don't know which approach is better. Should I just add all sine waves and play them using one AudioUnit, or maybe create AudioUnit for each sine wave ?
I'm currently leaning towards first solution, but don't know why ... It's just my instinct. It would be great if someone could explain to me why solution they choose is better :)
Thanks !
You will have more precise control of the timing of the mix (where each sine wave starts and ends), and the quality of the mix, if you create one DSP mixer and play the result through a single Audio Unit. There will also be a very tiny bit less thread switching overhead taking up CPU cycles.
I want to detect not the pitch, but the pitch class of a sung note.
So, whether it is C4 or C5 is not important: they must both be detected as C.
Imagine the 12 semitones arranged on a clock face, with the needle pointing to the pitch class. That's what I'm after! ideally I would like to be able to tell whether the sung note is spot-on or slightly off.
This is not a duplicate of previously asked questions, as it introduces the constraints that:
the sound source is a single human voice, hopefully with negligible background interference (although I may need to deal with this)
the octave is not important, only the pitch class
EDIT -- Links:
Real time pitch detection
Using the Apple FFT and Accelerate Framework
See my answer here for getting smooth FREQUENCY detection: https://stackoverflow.com/a/11042551/1457445
As far as snapping this frequency to the nearest note -- here is a method I created for my tuner app:
- (int) snapFreqToMIDI: (float) frequencyy {
int midiNote = (12*(log10(frequencyy/referenceA)/log10(2)) + 57) + 0.5;
return midiNote;
}
This will return the MIDI note value (http://www.phys.unsw.edu.au/jw/notes.html)
In order to get a string from this MIDI note value:
- (NSString*) midiToString: (int) midiNote {
NSArray *noteStrings = [[NSArray alloc] initWithObjects:#"C", #"C#", #"D", #"D#", #"E", #"F", #"F#", #"G", #"G#", #"A", #"A#", #"B", nil];
return [noteStrings objectAtIndex:midiNote%12];
}
For an example implementation of the pitch detection with output smoothing, look at musicianskit.com/developer.php
Pitch is a human psycho-perceptual phenomena. Peak frequency content is not the same as either pitch or pitch class. FFT and DFT methods will not directly provide pitch, only frequency. Neither will zero crossing measurements work well for human voice sources. Try AMDF, ASDF, autocorrelation or cepstral methods. There are also plenty of academic papers on the subject of pitch estimation.
There is another long list of pitch estimation algorithms here.
Edited addition: Apple's SpeakHere and aurioTouch sample apps (available from their iOS dev center) contain example source code for getting PCM sample blocks from the iPhone's mic.
Most of the frequency detection algorithms cited in other answers don't work well for voice. To see why this is so intuitively, consider that all the vowels in a language can be sung at one particular note. Even though all those vowels have very different frequency content, they would all have to be detected as the same note. Any note detection algorithm for voices must take this into account somehow. Furthermore, human speech and song contains many fricatives, many of which have no implicit pitch in them.
In the generic (non voice case) the feature you are looking for is called the chroma feature and there is a fairly large body of work on the subject. It is equivalently known as the harmonic pitch class profile. The original reference paper on the concept is Tayuka Fujishima's "Real-Time Chord Recognition of Musical Sound: A System Using Common Lisp Music". The Wikipedia entry has an overview of a more modern variant of the algorithm. There are a bunch of free papers and MATLAB implementations of chroma feature detection.
However, since you are focusing on the human voice only, and since the human voice naturally contains tons of overtones, what you are practically looking for in this specific scenario is a fundamental frequency detection algorithm, or f0 detection algorithm. There are several such algorithms explicitly tuned for voice. Also, here is a widely cited algorithm that works on multiple voices at once. You'd then check the detected frequency against the equal-tempered scale and then find the closest match.
Since I suspect that you're trying to build a pitch detector and/or corrector a la Autotune, you may want to use M. Morise's excellent WORLD implementation, which permits fast and good quality detection and modification of f0 on voice streams.
Lastly, be aware that there are only a few vocal pitch detectors that work well within the vocal fry register. Almost all of them, including WORLD, fail on vocal fry as well as very low voices. A number of papers refer to vocal fry as "creaky voice" and have developed specific algorithms to help with that type of voice input specifically.
If you are looking for the pitch class you should have a look at the chromagram (http://labrosa.ee.columbia.edu/matlab/chroma-ansyn/)
You can also simply dectect the f0 (using something like YIN algorithm) and return the appropriate semitone, most of fundamental frequency estimation algorithms suffer from octave error
Perform a Discrete Fourier Transform on samples from your input waveform, then sum values that correspond to equivalent notes in different octaves. Take the largest value as the dominant frequency.
You can likely find some existing DFT code in Objective C that suits your needs.
Putting up information as I find it...
Pitch detection algorithm on Wikipedia is a good place to start. It lists a few methods that fail for determining octave, which is okay for my purpose.
A good explanation of autocorrelation can be found here (why can't Wikipedia put things simply like that??).
Finally I have closure on this one, thanks to this article from DSP Dimension
The article contains source code.
Basically he performs an FFT. then he explains that frequencies that don't coincide spot on with the centre of the bin they fall in will smear over nearby bins in a sort of bell shaped curve. and he explains how to extract the exact frequency from this data in a second pass (FFT being the first pass).
the article then goes further to pitch shift; I can simply delete the code.
note that they supply a commercial library that does the same thing (and far more) only super optimised. there is a free version of the library that would probably do everything I need, although since I have worked through the iOS audio subsystem, I might as well just implement it myself.
for the record, I found an alternative way to extract the exact frequency by approximating a quadratic curve over the bin and its two neighbours here. I have no idea what is the relative accuracy between these two approaches.
As others have mentioned you should use a pitch detection algorithm. Since that ground is well-covered I will address a few particulars of your question. You said that you are looking for the pitch class of the note. However, the way to find this is to calculate the frequency of the note and then use a table to convert it to the pitch class, octave, and cents. I don't know of any way to obtain the pitch class without finding the fundamental frequency.
You will need a real-time pitch detection algorithm. In evaluating algorithms pay attention to the latency implied by each algorithm, compared with the accuracy you desire. Although some algorithms are better than others, fundamentally you must trade one for the other and cannot know both with certainty -- sort of like the Heisenberg uncertainty principle. (How can you know the note is C4 when only a fraction of a cycle has been heard?)
Your "smoothing" approach is equivalent to a digital filter, which will alter the frequency characteristics of the voice. In short, it may interfere with your attempts to estimate the pitch. If you have an interest in digital audio, digital filters are fundamental and useful tools in that field, and a fascinating subject besides. It helps to have a strong math background in understanding them, but you don't necessarily need that to get the basic idea.
Also, your zero crossing method is a basic technique to estimate the period of a waveform and thus the pitch. It can be done this way, but only with a lot of heuristics and fine-tuning. (Essentially, develop a number of "candidate" pitches and try to infer the dominant one. A lot of special cases will emerge that will confuse this. A quick one is the less 's'.) You'll find it much easier to begin with a frequency domain pitch detection algorithm.
if you re beginner this may be very helpful. It is available both on Java and IOS.
dywapitchtrack for ios
dywapitchtrack for java