Detect if two audio files are generated by the same instrument - classification

What I'm trying to do is to detect in a small set of audio samples if any are generated by the same instrument. If so, those are considered duplicates and filtered out.
Listen to this file of ten concatenated samples. You can hear that the first five are all generated by the same instrument (an electric piano) so four of them are to be deemed duplicates.
What algorithm or method can I use to solve this problem? Note that I don't need full-fledged instrument detection as I'm only interested in whether the instrument is or isn't the same. Note also that I don't mean literally "the same instrument" but rather "the same acoustic flavor just different pitches."

Task Formulation
What you need is a Similarity Metric (a type of Distance Metric), that predicts two samples of the same instrument / instrument type as very similar (low score) and two samples of different instruments as quite different (high score). And that this holds regardless of which note is being played. So it should be sensitive to timbre, and not sensitive to musical content.
Learning setup
The task can be referred to as Similarity Learning. A popular and effective approach for neural networks is Triplet Loss. Here is a blog-post introducing the concept in the context of image similarity. It has been applied successfully to audio before.
Model architecture
The primary model architecture I would consider would be a Convolutional Neural Network on log-mel spectrograms. Try first to use a generic model like OpenL3 as a feature extractor. It produces a 1024 dimensional output called an Audio Embedding, which you can do a triplet loss model on top of.
Datasets
The key to success for your application will to have a suitable dataset. You might be able to utilize the Nsynth dataset. Maybe training on that alone can give OK performance. Or you may be able to use it as a training set, and then fine-tune on your own training set.
You will at a minimum need to create a validation/test set for your own audio clips, in order to evaluate performance of the model. Minimum some 10-100 labeled examples of each instrument type of interest.

Related

How to decide which Convolution Neural Network architecture will work to identify the own data set?

I have data-set regarding chocolates. I need to detect whether it has scratches or not. I am planning to detect from Convolution Neural Network using Caffe. But how to define which neural network architecture will suit to my data-set?
Also how to generate heat values when there is any scratches in image?
I have tried detect normal image processing algorithms and it did not work.
Abnormal Image
Normal Image
Based on the little info you provide, the network architecture choice should be the last of your concerns. Also "trying normal image processing algorithms" is quite a vague statement.
A few points to consider
How big is the dataset? Are the chocolate photos taken in a controlled setting where they are always similar to your example photos or are they taken in the wild, i.e. where they could have different lighting conditions, positions, etc.? Is the dataset balanced?
How is the dataset labelled? Is it just a class for the whole image specifying normal vs abnormal? If so, you'd just be doing classification, and one way to potentially just visualise the location of the scratches (if they turn out to be the most prominent feature for the classification) is to use gradient-weighted class activation maps. On the other hand, if your dataset has labelled scratch points over images, then you can directly train your network to output heatmaps.
Once your dataset is properly set up with a training and validation set, you can just start with a baseline simple small convolutional network architecture, and then you can try out different and bigger network architectures like VGG16, ResNet, etc., and check whether they improve performance on your validation set.

Which input format is the best for sound recognition in recurrent neural networks?

I want to create sound or pitch recognition with recurrent deep neural network. And I'm wondering with what input will I get best results.
Should I feed DNN with amplitudes or with FFT(Fast Fourier transform) result?
Is there any other format that is known to produce good results and fast learning?
While MFCCs have indeed been used in music information retrieval research (for genre classification etc...), in this case (pitch detection) you may want to use a semi-tone filterbank or constant Q transform as a first information reduction step. These transformations match better with musical pitch.
But I think it's also worth trying to use the audio samples directly with RNNs, in case you have a huge number of samples. In theory, the RNNs should be able to learn the wave patterns corresponding to particular pitches.
From your description, it's not entirely clear what type of "pitch recognition" you're aiming for: monophonic instruments (constant timbre, and only 1 pitch sounding at a time)? polyphonic (constant timbre, but multiple pitches may be sounding simultaneously)? multiple instruments playing together (multiple timbres, multiple pitches)? or even a full mix with both tonal and percussive sounds? The hardness of these use cases roughly increases in the order I mentioned them, so you may want to start with monophonic pitch recognition first.
To obtain the necessary amount of training examples, you could use a physical model or a multi-sampled virtual instrument to generate the audio samples for particular pitches in a controlled way. This way, you can quickly create your training material instead of recording it and labeling it manually. But I would advise you to at least add some background noise (random noise, or very low-level sounds from different recordings) to the created audio samples, or your data may be too artificial and lead to a model that doesn't work well once you want to use it in practice.
Here is a paper that might give you some ideas on the subject:
An End-to-End Neural Network for Polyphonic Piano Music Transcription
(Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon)
https://arxiv.org/pdf/1508.01774.pdf
The Mel-frequency cepstrum is general used for speech recognition.
mozilla DeepSpeech is using MFCCs as input to their DNN.
For python implementation you can use python-speech-features lib.

Use a trained neural network to imitate its training data

I'm in the overtures of designing a prose imitation system. It will read a bunch of prose, then mimic it. It's mostly for fun so the mimicking prose doesn't need to make too much sense, but I'd like to make it as good as I can, with a minimal amount of effort.
My first idea is to use my example prose to train a classifying feed-forward neural network, which classifies its input as either part of the training data or not part. Then I'd like to somehow invert the neural network, finding new random inputs that also get classified by the trained network as being part of the training data. The obvious and stupid way of doing this is to randomly generate word lists and only output the ones that get classified above a certain threshold, but I think there is a better way, using the network itself to limit the search to certain regions of the input space. For example, maybe you could start with a random vector and do gradient descent optimisation to find a local maximum around the random starting point. Is there a word for this kind of imitation process? What are some of the known methods?
How about Generative Adversarial Networks (GAN, Goodfellow 2014) and their more advanced siblings like Deep Convolutional Generative Adversarial Networks? There are plenty of proper research articles out there, and also more gentle introductions like this one on DCGAN and this on GAN. To quote the latter:
GANs are an interesting idea that were first introduced in 2014 by a
group of researchers at the University of Montreal lead by Ian
Goodfellow (now at OpenAI). The main idea behind a GAN is to have two
competing neural network models. One takes noise as input and
generates samples (and so is called the generator). The other model
(called the discriminator) receives samples from both the generator
and the training data, and has to be able to distinguish between the
two sources. These two networks play a continuous game, where the
generator is learning to produce more and more realistic samples, and
the discriminator is learning to get better and better at
distinguishing generated data from real data. These two networks are
trained simultaneously, and the hope is that the competition will
drive the generated samples to be indistinguishable from real data.
(DC)GAN should fit your task quite well.

Face Recognition based on Deep Learning (Siamese Architecture)

I want to use pre-trained model for the face identification. I try to use Siamese architecture which requires a few number of images. Could you give me any trained model which I can change for the Siamese architecture? How can I change the network model which I can put two images to find their similarities (I do not want to create image based on the tutorial here)? I only want to use the system for real time application. Do you have any recommendations?
I suppose you can use this model, described in Xiang Wu, Ran He, Zhenan Sun, Tieniu Tan A Light CNN for Deep Face Representation with Noisy Labels (arXiv 2015) as a a strating point for your experiments.
As for the Siamese network, what you are trying to earn is a mapping from a face image into some high dimensional vector space, in which distances between points reflects (dis)similarity between faces.
To do so, you only need one network that gets a face as an input and produce a high-dim vector as an output.
However, to train this single network using the Siamese approach, you are going to duplicate it: creating two instances of the same net (you need to explicitly link the weights of the two copies). During training you are going to provide pairs of faces to the nets: one to each copy, then the single loss layer on top of the two copies can compare the high-dimensional vectors representing the two faces and compute a loss according to a "same/not same" label associated with this pair.
Hence, you only need the duplication for the training. In test time ('deploy') you are going to have a single net providing you with a semantically meaningful high dimensional representation of faces.
For a more advance Siamese architecture and loss see this thread.
On the other hand, you might want to consider the approach described in Oren Tadmor, Yonatan Wexler, Tal Rosenwein, Shai Shalev-Shwartz, Amnon Shashua Learning a Metric Embedding for Face Recognition using the Multibatch Method (arXiv 2016). This approach is more efficient and easy to implement than pair-wise losses over image pairs.

Convolutional Neural Network for time-dependent features

I need to do dimensionality reduction from a series of images. More specifically, each image is a snapshot of a ball moving and the optimal features would be its position and velocity. As far as I know, CNN are the state-of-the-art for reducing the features for image classification, but in that case only a single frame is provided. Is it possible to extract also time-dependent features given many images at different time steps? Otherwise which is the state-of-the-art techniques for doing so?
It's the first time I use CNN and I would also appreciate any reference or any other suggestion.
If you want to be able to have the network somehow recognize a progression which is time dependent, you should probably look into recurrent neural nets (RNN). Since you would be operating on video, you should look into recurrent convolutional neural nets (RCNN) such as in: http://jmlr.org/proceedings/papers/v32/pinheiro14.pdf
Recurrence adds some memory of a previous state of the input data. See this good explanation by Karpathy: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
In your case you need to have the recurrence across multiple images instead of just within one image. It would seem like the first problem you need to solve is the image segmentation problem (being able to pick the ball out of the rest of the image) and the first paper linked above deals with segmentation. (then again, maybe you're trying to take advantage of the movement in order to identify the moving object?)
Here's another thought: perhaps you could only look at differences between sequential frames and use that as your input data to your convnet? The input "image" would then show where the moving object was in the previous frame and where it is in the current one. Larger differences would indicate larger amounts of movement. That would probably have a similar effect to using a recurrent network.