Which input format is the best for sound recognition in recurrent neural networks? - neural-network

I want to create sound or pitch recognition with recurrent deep neural network. And I'm wondering with what input will I get best results.
Should I feed DNN with amplitudes or with FFT(Fast Fourier transform) result?
Is there any other format that is known to produce good results and fast learning?

While MFCCs have indeed been used in music information retrieval research (for genre classification etc...), in this case (pitch detection) you may want to use a semi-tone filterbank or constant Q transform as a first information reduction step. These transformations match better with musical pitch.
But I think it's also worth trying to use the audio samples directly with RNNs, in case you have a huge number of samples. In theory, the RNNs should be able to learn the wave patterns corresponding to particular pitches.
From your description, it's not entirely clear what type of "pitch recognition" you're aiming for: monophonic instruments (constant timbre, and only 1 pitch sounding at a time)? polyphonic (constant timbre, but multiple pitches may be sounding simultaneously)? multiple instruments playing together (multiple timbres, multiple pitches)? or even a full mix with both tonal and percussive sounds? The hardness of these use cases roughly increases in the order I mentioned them, so you may want to start with monophonic pitch recognition first.
To obtain the necessary amount of training examples, you could use a physical model or a multi-sampled virtual instrument to generate the audio samples for particular pitches in a controlled way. This way, you can quickly create your training material instead of recording it and labeling it manually. But I would advise you to at least add some background noise (random noise, or very low-level sounds from different recordings) to the created audio samples, or your data may be too artificial and lead to a model that doesn't work well once you want to use it in practice.
Here is a paper that might give you some ideas on the subject:
An End-to-End Neural Network for Polyphonic Piano Music Transcription
(Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon)
https://arxiv.org/pdf/1508.01774.pdf

The Mel-frequency cepstrum is general used for speech recognition.
mozilla DeepSpeech is using MFCCs as input to their DNN.
For python implementation you can use python-speech-features lib.

Related

Detect if two audio files are generated by the same instrument

What I'm trying to do is to detect in a small set of audio samples if any are generated by the same instrument. If so, those are considered duplicates and filtered out.
Listen to this file of ten concatenated samples. You can hear that the first five are all generated by the same instrument (an electric piano) so four of them are to be deemed duplicates.
What algorithm or method can I use to solve this problem? Note that I don't need full-fledged instrument detection as I'm only interested in whether the instrument is or isn't the same. Note also that I don't mean literally "the same instrument" but rather "the same acoustic flavor just different pitches."
Task Formulation
What you need is a Similarity Metric (a type of Distance Metric), that predicts two samples of the same instrument / instrument type as very similar (low score) and two samples of different instruments as quite different (high score). And that this holds regardless of which note is being played. So it should be sensitive to timbre, and not sensitive to musical content.
Learning setup
The task can be referred to as Similarity Learning. A popular and effective approach for neural networks is Triplet Loss. Here is a blog-post introducing the concept in the context of image similarity. It has been applied successfully to audio before.
Model architecture
The primary model architecture I would consider would be a Convolutional Neural Network on log-mel spectrograms. Try first to use a generic model like OpenL3 as a feature extractor. It produces a 1024 dimensional output called an Audio Embedding, which you can do a triplet loss model on top of.
Datasets
The key to success for your application will to have a suitable dataset. You might be able to utilize the Nsynth dataset. Maybe training on that alone can give OK performance. Or you may be able to use it as a training set, and then fine-tune on your own training set.
You will at a minimum need to create a validation/test set for your own audio clips, in order to evaluate performance of the model. Minimum some 10-100 labeled examples of each instrument type of interest.

Convolutional Neural Network for time-dependent features

I need to do dimensionality reduction from a series of images. More specifically, each image is a snapshot of a ball moving and the optimal features would be its position and velocity. As far as I know, CNN are the state-of-the-art for reducing the features for image classification, but in that case only a single frame is provided. Is it possible to extract also time-dependent features given many images at different time steps? Otherwise which is the state-of-the-art techniques for doing so?
It's the first time I use CNN and I would also appreciate any reference or any other suggestion.
If you want to be able to have the network somehow recognize a progression which is time dependent, you should probably look into recurrent neural nets (RNN). Since you would be operating on video, you should look into recurrent convolutional neural nets (RCNN) such as in: http://jmlr.org/proceedings/papers/v32/pinheiro14.pdf
Recurrence adds some memory of a previous state of the input data. See this good explanation by Karpathy: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
In your case you need to have the recurrence across multiple images instead of just within one image. It would seem like the first problem you need to solve is the image segmentation problem (being able to pick the ball out of the rest of the image) and the first paper linked above deals with segmentation. (then again, maybe you're trying to take advantage of the movement in order to identify the moving object?)
Here's another thought: perhaps you could only look at differences between sequential frames and use that as your input data to your convnet? The input "image" would then show where the moving object was in the previous frame and where it is in the current one. Larger differences would indicate larger amounts of movement. That would probably have a similar effect to using a recurrent network.

Why isn't there a simple function to reduce background noise of an audio signal in Matlab?

Is this because it's a complex problem ? I mean to wide and therefore it does not exist a simple / generic solution ?
Because every (almost) software making signal processing (Avisoft, GoldWave, Audacity…) have this function that reduce background noise of a signal. Usually it uses FFT. But I can't find a function (already implemented) in Matlab that allows us to do the same ? Is the right way to make it manually then ?
Thanks.
The common audio noise reduction approaches built-in to things like Audacity are based around spectral subtraction, which estimates the level of steady background noise in the Fourier transform magnitude domain, then removes that much energy from every frame, leaving energy only where the signal "pokes above" this noise floor.
You can find many implementations of spectral subtraction for Matlab; this one is highly rated on Matlab File Exchange:
http://www.mathworks.com/matlabcentral/fileexchange/7675-boll-spectral-subtraction
The question is, what kind of noise reduction are you looking for? There is no one solution that fits all needs. Here are a few approaches:
Low-pass filtering the signal reduces noise but also removes the high-frequency components of the signal. For some applications this is perfectly acceptable. There are lots of low-pass filter functions and Matlab helps you apply plenty of them. Some knowledge of how digital filters work is required. I'm not going into it here; if you want more details consider asking a more focused question.
An approach suitable for many situations is using a noise gate: simply attenuate the signal whenever its RMS level goes below a certain threshold, for instance. In other words, this kills quiet parts of the audio dead. You'll retain the noise in the more active parts of the signal, though, and if you have a lot of dynamics in the actual signal you'll get rid of some signal, too. This tends to work well for, say, slightly noisy speech samples, but not so well for very noisy recordings of classical music. I don't know whether Matlab has a function for this.
Some approaches involve making a "fingerprint" of the noise and then removing that throughout the signal. It tends to make the result sound strange, though, and in any case this is probably sufficiently complex and domain-specific that it belongs in an audio-specific tool and not in a rather general math/DSP system.
Reducing noise requires making some assumptions about the type of noise and the type of signal, and how they are different. Audio processors typically assume (correctly or incorrectly) something like that the audio is speech or music, and that the noise is typical recording session background hiss, A/C power hum, or vinyl record pops.
Matlab is for general use (microwave radio, data comm, subsonic earthquakes, heartbeats, etc.), and thus can make no such assumptions.
matlab is no exactly an audio processor. you have to implement your own filter. you will have to design your filter correctly, according to what you want.

Convolutional Neural Network (CNN) for Audio [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have been following the tutorials on DeepLearning.net to learn how to implement a convolutional neural network that extracts features from images. The tutorial are well explained, easy to understand and follow.
I want to extend the same CNN to extract multi-modal features from videos (images + audio) at the same time.
I understand that video input is nothing but a sequence of images (pixel intensities) displayed in a period of time (ex. 30 FPS) associated with audio. However, I don't really understand what audio is, how it works, or how it is broken down to be feed into the network.
I have read a couple of papers on the subject (multi-modal feature extraction/representation), but none have explained how audio is inputed to the network.
Moreover, I understand from my studies that multi-modality representation is the way our brains really work as we don't deliberately filter out our senses to achieve understanding. It all happens simultaneously without us knowing about it through (joint representation). A simple example would be, if we hear a lion roar we instantly compose a mental image of a lion, feel danger and vice-versa. Multiple neural patterns are fired in our brains to achieve a comprehensive understanding of what a lion looks like, sounds like, feels like, smells like, etc.
The above mentioned is my ultimate goal, but for the time being I'm breaking down my problem for the sake of simplicity.
I would really appreciate if anyone can shed light on how audio is dissected and then later on represented in a convolutional neural network. I would also appreciate your thoughts with regards to multi-modal synchronisation, joint representations, and what is the proper way to train a CNN with multi-modal data.
EDIT:
I have found out the audio can be represented as spectrograms. It as a common format for audio and is represented as a graph with two geometric dimensions where the horizontal line represents time and the vertical represents frequency.
Is it possible to use the same technique with images on these spectrograms? In other words can I simply use these spectrograms as input images for my convolutional neural network?
We used deep convolutional networks on spectrograms for a spoken language identification task. We had around 95% accuracy on a dataset provided in this TopCoder contest. The details are here.
Plain convolutional networks do not capture the temporal characteristics, so for example in this work the output of the convolutional network was fed to a time-delay neural network. But our experiments show that even without additional elements convolutional networks can perform well at least on some tasks when the inputs have similar sizes.
There are many techniques to extract feature vectors from audio data in order to train classifiers. The most commonly used is called MFCC (Mel-frequency cepstrum), which you can think of as a "improved" spectrogram, retaining more relevant information to discriminate between classes. Other commonly used technique is PLP (Perceptual Linear Predictive), which also gives good results. These are still many other less known.
More recently deep networks have been used to extract features vectors by themselves, thus more similarly the way we do in image recognition. This is a active area of research. Not long ago we also used feature extractors to train classifiers for images (SIFT, HOG, etc.), but these were replaced by deep learning techniques, which have raw images as inputs and extract feature vectors by themselves (indeed it's what deep learning is really all about).
It's also very important to notice that audio data is sequential. After training a classifier you need to train a sequential model as a HMM or CRF, which chooses the most likely sequences of speech units, using as input the probabilities given by your classifier.
A good starting point to learn speech recognition is Jursky and Martins: Speech and Language Processing. It explains very well all these concepts.
[EDIT: adding some potentially useful information]
There are many speech recognition toolkits with modules to extract MFCC feature vectors from audio files, but using than for this purpose is not always straightforward. I'm currently using CMU Sphinx4. It has a class named FeatureFileDumper, that can be used standalone to generate MFCC vectors from audio files.

C/C++/Obj-C Real-time algorithm to ascertain Note (not Pitch) from Vocal Input

I want to detect not the pitch, but the pitch class of a sung note.
So, whether it is C4 or C5 is not important: they must both be detected as C.
Imagine the 12 semitones arranged on a clock face, with the needle pointing to the pitch class. That's what I'm after! ideally I would like to be able to tell whether the sung note is spot-on or slightly off.
This is not a duplicate of previously asked questions, as it introduces the constraints that:
the sound source is a single human voice, hopefully with negligible background interference (although I may need to deal with this)
the octave is not important, only the pitch class
EDIT -- Links:
Real time pitch detection
Using the Apple FFT and Accelerate Framework
See my answer here for getting smooth FREQUENCY detection: https://stackoverflow.com/a/11042551/1457445
As far as snapping this frequency to the nearest note -- here is a method I created for my tuner app:
- (int) snapFreqToMIDI: (float) frequencyy {
int midiNote = (12*(log10(frequencyy/referenceA)/log10(2)) + 57) + 0.5;
return midiNote;
}
This will return the MIDI note value (http://www.phys.unsw.edu.au/jw/notes.html)
In order to get a string from this MIDI note value:
- (NSString*) midiToString: (int) midiNote {
NSArray *noteStrings = [[NSArray alloc] initWithObjects:#"C", #"C#", #"D", #"D#", #"E", #"F", #"F#", #"G", #"G#", #"A", #"A#", #"B", nil];
return [noteStrings objectAtIndex:midiNote%12];
}
For an example implementation of the pitch detection with output smoothing, look at musicianskit.com/developer.php
Pitch is a human psycho-perceptual phenomena. Peak frequency content is not the same as either pitch or pitch class. FFT and DFT methods will not directly provide pitch, only frequency. Neither will zero crossing measurements work well for human voice sources. Try AMDF, ASDF, autocorrelation or cepstral methods. There are also plenty of academic papers on the subject of pitch estimation.
There is another long list of pitch estimation algorithms here.
Edited addition: Apple's SpeakHere and aurioTouch sample apps (available from their iOS dev center) contain example source code for getting PCM sample blocks from the iPhone's mic.
Most of the frequency detection algorithms cited in other answers don't work well for voice. To see why this is so intuitively, consider that all the vowels in a language can be sung at one particular note. Even though all those vowels have very different frequency content, they would all have to be detected as the same note. Any note detection algorithm for voices must take this into account somehow. Furthermore, human speech and song contains many fricatives, many of which have no implicit pitch in them.
In the generic (non voice case) the feature you are looking for is called the chroma feature and there is a fairly large body of work on the subject. It is equivalently known as the harmonic pitch class profile. The original reference paper on the concept is Tayuka Fujishima's "Real-Time Chord Recognition of Musical Sound: A System Using Common Lisp Music". The Wikipedia entry has an overview of a more modern variant of the algorithm. There are a bunch of free papers and MATLAB implementations of chroma feature detection.
However, since you are focusing on the human voice only, and since the human voice naturally contains tons of overtones, what you are practically looking for in this specific scenario is a fundamental frequency detection algorithm, or f0 detection algorithm. There are several such algorithms explicitly tuned for voice. Also, here is a widely cited algorithm that works on multiple voices at once. You'd then check the detected frequency against the equal-tempered scale and then find the closest match.
Since I suspect that you're trying to build a pitch detector and/or corrector a la Autotune, you may want to use M. Morise's excellent WORLD implementation, which permits fast and good quality detection and modification of f0 on voice streams.
Lastly, be aware that there are only a few vocal pitch detectors that work well within the vocal fry register. Almost all of them, including WORLD, fail on vocal fry as well as very low voices. A number of papers refer to vocal fry as "creaky voice" and have developed specific algorithms to help with that type of voice input specifically.
If you are looking for the pitch class you should have a look at the chromagram (http://labrosa.ee.columbia.edu/matlab/chroma-ansyn/)
You can also simply dectect the f0 (using something like YIN algorithm) and return the appropriate semitone, most of fundamental frequency estimation algorithms suffer from octave error
Perform a Discrete Fourier Transform on samples from your input waveform, then sum values that correspond to equivalent notes in different octaves. Take the largest value as the dominant frequency.
You can likely find some existing DFT code in Objective C that suits your needs.
Putting up information as I find it...
Pitch detection algorithm on Wikipedia is a good place to start. It lists a few methods that fail for determining octave, which is okay for my purpose.
A good explanation of autocorrelation can be found here (why can't Wikipedia put things simply like that??).
Finally I have closure on this one, thanks to this article from DSP Dimension
The article contains source code.
Basically he performs an FFT. then he explains that frequencies that don't coincide spot on with the centre of the bin they fall in will smear over nearby bins in a sort of bell shaped curve. and he explains how to extract the exact frequency from this data in a second pass (FFT being the first pass).
the article then goes further to pitch shift; I can simply delete the code.
note that they supply a commercial library that does the same thing (and far more) only super optimised. there is a free version of the library that would probably do everything I need, although since I have worked through the iOS audio subsystem, I might as well just implement it myself.
for the record, I found an alternative way to extract the exact frequency by approximating a quadratic curve over the bin and its two neighbours here. I have no idea what is the relative accuracy between these two approaches.
As others have mentioned you should use a pitch detection algorithm. Since that ground is well-covered I will address a few particulars of your question. You said that you are looking for the pitch class of the note. However, the way to find this is to calculate the frequency of the note and then use a table to convert it to the pitch class, octave, and cents. I don't know of any way to obtain the pitch class without finding the fundamental frequency.
You will need a real-time pitch detection algorithm. In evaluating algorithms pay attention to the latency implied by each algorithm, compared with the accuracy you desire. Although some algorithms are better than others, fundamentally you must trade one for the other and cannot know both with certainty -- sort of like the Heisenberg uncertainty principle. (How can you know the note is C4 when only a fraction of a cycle has been heard?)
Your "smoothing" approach is equivalent to a digital filter, which will alter the frequency characteristics of the voice. In short, it may interfere with your attempts to estimate the pitch. If you have an interest in digital audio, digital filters are fundamental and useful tools in that field, and a fascinating subject besides. It helps to have a strong math background in understanding them, but you don't necessarily need that to get the basic idea.
Also, your zero crossing method is a basic technique to estimate the period of a waveform and thus the pitch. It can be done this way, but only with a lot of heuristics and fine-tuning. (Essentially, develop a number of "candidate" pitches and try to infer the dominant one. A lot of special cases will emerge that will confuse this. A quick one is the less 's'.) You'll find it much easier to begin with a frequency domain pitch detection algorithm.
if you re beginner this may be very helpful. It is available both on Java and IOS.
dywapitchtrack for ios
dywapitchtrack for java