Convolutional Neural Network (CNN) for Audio [closed] - neural-network

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have been following the tutorials on DeepLearning.net to learn how to implement a convolutional neural network that extracts features from images. The tutorial are well explained, easy to understand and follow.
I want to extend the same CNN to extract multi-modal features from videos (images + audio) at the same time.
I understand that video input is nothing but a sequence of images (pixel intensities) displayed in a period of time (ex. 30 FPS) associated with audio. However, I don't really understand what audio is, how it works, or how it is broken down to be feed into the network.
I have read a couple of papers on the subject (multi-modal feature extraction/representation), but none have explained how audio is inputed to the network.
Moreover, I understand from my studies that multi-modality representation is the way our brains really work as we don't deliberately filter out our senses to achieve understanding. It all happens simultaneously without us knowing about it through (joint representation). A simple example would be, if we hear a lion roar we instantly compose a mental image of a lion, feel danger and vice-versa. Multiple neural patterns are fired in our brains to achieve a comprehensive understanding of what a lion looks like, sounds like, feels like, smells like, etc.
The above mentioned is my ultimate goal, but for the time being I'm breaking down my problem for the sake of simplicity.
I would really appreciate if anyone can shed light on how audio is dissected and then later on represented in a convolutional neural network. I would also appreciate your thoughts with regards to multi-modal synchronisation, joint representations, and what is the proper way to train a CNN with multi-modal data.
EDIT:
I have found out the audio can be represented as spectrograms. It as a common format for audio and is represented as a graph with two geometric dimensions where the horizontal line represents time and the vertical represents frequency.
Is it possible to use the same technique with images on these spectrograms? In other words can I simply use these spectrograms as input images for my convolutional neural network?

We used deep convolutional networks on spectrograms for a spoken language identification task. We had around 95% accuracy on a dataset provided in this TopCoder contest. The details are here.
Plain convolutional networks do not capture the temporal characteristics, so for example in this work the output of the convolutional network was fed to a time-delay neural network. But our experiments show that even without additional elements convolutional networks can perform well at least on some tasks when the inputs have similar sizes.

There are many techniques to extract feature vectors from audio data in order to train classifiers. The most commonly used is called MFCC (Mel-frequency cepstrum), which you can think of as a "improved" spectrogram, retaining more relevant information to discriminate between classes. Other commonly used technique is PLP (Perceptual Linear Predictive), which also gives good results. These are still many other less known.
More recently deep networks have been used to extract features vectors by themselves, thus more similarly the way we do in image recognition. This is a active area of research. Not long ago we also used feature extractors to train classifiers for images (SIFT, HOG, etc.), but these were replaced by deep learning techniques, which have raw images as inputs and extract feature vectors by themselves (indeed it's what deep learning is really all about).
It's also very important to notice that audio data is sequential. After training a classifier you need to train a sequential model as a HMM or CRF, which chooses the most likely sequences of speech units, using as input the probabilities given by your classifier.
A good starting point to learn speech recognition is Jursky and Martins: Speech and Language Processing. It explains very well all these concepts.
[EDIT: adding some potentially useful information]
There are many speech recognition toolkits with modules to extract MFCC feature vectors from audio files, but using than for this purpose is not always straightforward. I'm currently using CMU Sphinx4. It has a class named FeatureFileDumper, that can be used standalone to generate MFCC vectors from audio files.

Related

Which input format is the best for sound recognition in recurrent neural networks?

I want to create sound or pitch recognition with recurrent deep neural network. And I'm wondering with what input will I get best results.
Should I feed DNN with amplitudes or with FFT(Fast Fourier transform) result?
Is there any other format that is known to produce good results and fast learning?
While MFCCs have indeed been used in music information retrieval research (for genre classification etc...), in this case (pitch detection) you may want to use a semi-tone filterbank or constant Q transform as a first information reduction step. These transformations match better with musical pitch.
But I think it's also worth trying to use the audio samples directly with RNNs, in case you have a huge number of samples. In theory, the RNNs should be able to learn the wave patterns corresponding to particular pitches.
From your description, it's not entirely clear what type of "pitch recognition" you're aiming for: monophonic instruments (constant timbre, and only 1 pitch sounding at a time)? polyphonic (constant timbre, but multiple pitches may be sounding simultaneously)? multiple instruments playing together (multiple timbres, multiple pitches)? or even a full mix with both tonal and percussive sounds? The hardness of these use cases roughly increases in the order I mentioned them, so you may want to start with monophonic pitch recognition first.
To obtain the necessary amount of training examples, you could use a physical model or a multi-sampled virtual instrument to generate the audio samples for particular pitches in a controlled way. This way, you can quickly create your training material instead of recording it and labeling it manually. But I would advise you to at least add some background noise (random noise, or very low-level sounds from different recordings) to the created audio samples, or your data may be too artificial and lead to a model that doesn't work well once you want to use it in practice.
Here is a paper that might give you some ideas on the subject:
An End-to-End Neural Network for Polyphonic Piano Music Transcription
(Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon)
https://arxiv.org/pdf/1508.01774.pdf
The Mel-frequency cepstrum is general used for speech recognition.
mozilla DeepSpeech is using MFCCs as input to their DNN.
For python implementation you can use python-speech-features lib.

Use a trained neural network to imitate its training data

I'm in the overtures of designing a prose imitation system. It will read a bunch of prose, then mimic it. It's mostly for fun so the mimicking prose doesn't need to make too much sense, but I'd like to make it as good as I can, with a minimal amount of effort.
My first idea is to use my example prose to train a classifying feed-forward neural network, which classifies its input as either part of the training data or not part. Then I'd like to somehow invert the neural network, finding new random inputs that also get classified by the trained network as being part of the training data. The obvious and stupid way of doing this is to randomly generate word lists and only output the ones that get classified above a certain threshold, but I think there is a better way, using the network itself to limit the search to certain regions of the input space. For example, maybe you could start with a random vector and do gradient descent optimisation to find a local maximum around the random starting point. Is there a word for this kind of imitation process? What are some of the known methods?
How about Generative Adversarial Networks (GAN, Goodfellow 2014) and their more advanced siblings like Deep Convolutional Generative Adversarial Networks? There are plenty of proper research articles out there, and also more gentle introductions like this one on DCGAN and this on GAN. To quote the latter:
GANs are an interesting idea that were first introduced in 2014 by a
group of researchers at the University of Montreal lead by Ian
Goodfellow (now at OpenAI). The main idea behind a GAN is to have two
competing neural network models. One takes noise as input and
generates samples (and so is called the generator). The other model
(called the discriminator) receives samples from both the generator
and the training data, and has to be able to distinguish between the
two sources. These two networks play a continuous game, where the
generator is learning to produce more and more realistic samples, and
the discriminator is learning to get better and better at
distinguishing generated data from real data. These two networks are
trained simultaneously, and the hope is that the competition will
drive the generated samples to be indistinguishable from real data.
(DC)GAN should fit your task quite well.

Why we need CNN for the Object Detection?

I want to ask one general question that nowadays Deep learning specially Convolutional Neural Network (CNN) has been used in every field. Sometimes it is not necessary to use CNN for the problem but the researchers are using and following the trend.
So for the Object Detection problem, is it a kind of problem where CNN is really needed to solve the detection problem?
That is unhappy question. In title you ask about CNN, but you ask about deep learning in general.
So we don't necessary need deep learning for object recognition. But trained deep networks gets better results. Companies like Google and others are thankful for every % of better results.
About CNN, they gets better results than "traditional" ANN and also have less parameters because of weights sharing. CNN also allow transfer learning(you take a feature detector- convolution and pooling layers and than you connect on feature detector yours full connected layers).
A key concept of CNN's is the idea of translational invariance. In short, using a convolutional kernel on an image allows the machine to learn a set of weights for a specific feature (an edge, or a much more detailed object, depending on the layering of the network) and apply it across the entire image.
Consider detecting a cat in an image. If we designed some set of weights that allowed the learner to recognize a cat, we would like those weights to be the same no matter where the cat is in the image! So we would "assign" a layer in the convolutional kernel to detecting cats, and then convolve over the entire image.
Whatever the reason for the recent successes of CNN's, it should be noted that regular fully-connected ANN's should perform just as well. The problem is that they quickly become computationally infeasible on larger images, whereas CNN's are much more efficient due to parameter sharing.

How do neural networks handle large images where the area of interest is small?

If I've understood correctly, when training neural networks to recognize objects in images it's common to map single pixel to a single input layer node. However, sometimes we might have a large picture with only a small area of interest. For example, if we're training a neural net to recognize traffic signs, we might have images where the traffic sign covers only a small portion of it, while the rest is taken by the road, trees, sky etc. Creating a neural net which tries to find a traffic sign from every position seems extremely expensive.
My question is, are there any specific strategies to handle these sort of situations with neural networks, apart from preprocessing the image?
Thanks.
Using 1 pixel per input node is usually not done. What enters your network is the feature vector and as such you should input actual features, not raw data. Inputing raw data (with all its noise) will not only lead to bad classification but training will take longer than necessary.
In short: preprocessing is unavoidable. You need a more abstract representation of your data. There are hundreds of ways to deal with the problem you're asking. Let me give you some popular approaches.
1) Image proccessing to find regions of interest. When detecting traffic signs a common strategy is to use edge detection (i.e. convolution with some filter), apply some heuristics, use a threshold filter and isolate regions of interest (blobs, strongly connected components etc) which are taken as input to the network.
2) Applying features without any prior knowledge or image processing. Viola/Jones use a specific image representation, from which they can compute features in a very fast way. Their framework has been shown to work in real-time. (I know their original work doesn't state NNs but I applied their features to Multilayer Perceptrons in my thesis, so you can use it with any classifier, really.)
3) Deep Learning.
Learning better representations of the data can be incorporated into the neural network itself. These approaches are amongst the most popular researched atm. Since this is a very large topic, I can only give you some keywords so that you can research it on your own. Autoencoders are networks that learn efficient representations. It is possible to use them with conventional ANNs. Convolutional Neural Networks seem a bit sophisticated at first sight but they are worth checking out. Before the actual classification of a neural network, they have alternating layers of subwindow convolution (edge detection) and resampling. CNNs are currently able to achieve some of the best results in OCR.
In every scenario you have to ask yourself: Am I 1) giving my ANN a representation that has all the data it needs to do the job (a representation that is not too abstract) and 2) keeping too much noise away (and thus staying abstract enough).
We usually dont use fully connected network to deal with image because the number of units in the input layer will be huge. In neural network, we have specific neural network to deal with image which is Convolutional neural network(CNN).
However, CNN plays a role of feature extractor. The encoded feature will finally feed into a fully connected network which act as a classifier. In your case, I dont know how small your object is compare to the full image. But if the interested object is really small, even use CNN, the performance for image classification wont be very good. Then we probably need to use object detection(which used sliding window) to deal with it.
If you want recognize small objects on large sized image, you should use "scanning window".
For "scanning window" you can to apply dimention reducing methods:
DCT (http://en.wikipedia.org/wiki/Discrete_cosine_transform)
PCA (http://en.wikipedia.org/wiki/Principal_component_analysis)

What are interesting ideas for experimenting with Artificial Neural Networks? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm after a list of possible neural network implementations that can be experimented with. Possibly something that could take an hour to a week to write.
What other possibilities are there?
Here's the list so far:
Games
tic-tac-toe
Connect 4
Chess
Go
Sudoku
paper/scissors/rock
horse racing predictor
Visual recognition
Character recognition (typefaces, letters, numbers, etc)
Facial recognition
Audio recognition
Language detection
Male vs female
Word recognition
Language detection (natural, programming)
Pathfinding
"Artificial neural network driven mobile robots learn how to drive on roads in simulation." http://cig.felk.cvut.cz/projects/robo/, http://www.youtube.com/watch?v=lmPJeKRs8gE
Some links to more:
http://www.cs.colostate.edu/~anderson/res/project-ideas.html
You can combine Genetic Algorithms and Neural Networks to evolve simple neural configurations, such as Neural Networks that perform logic operations (including the phantomatic XOR!).
This is a topic I very much like because - if you think about it - it's a bare bones model of how our brains evolved (I am not saying we have logic gates in our head).
It is simple enough - and should be good fun!
In a wider way, all that cover pattern recognition and signal processing could take great advantage of neural networks.
Also, you could use neural networks to develop "pseudo-AI" for games (strategy, soccer games).
Anyway, as neural network is a tool more than a "solution", it can be used in economics, physics, navigation, signal processing, etc.
Also, many types of neural networks exist (perceptron, hopfield), the thing is to use them wisely according to the problem.
Neural networks are not panacea, just a (very interesting and powerful) tool.
what about face recognition?
Here are some problems that I think feed forward neural nets (with multiple hidden layers) might be able to solve.
Given the number of packets sent/recieved on the network interface, the volume of ambient noise, and the level of ambient light, attempt to predict the time of day.
Given a latitude and longitude, attempt to predict the elevation, or crime rate.
Given some simple metrics about the keywords in the title of an article, predict how many upvotes it has.
Given the digits of a random phone number, predict where the line terminus is located.
This is more challenging: visualize (ie, plot) the decision boundary surface of a 2-layer neural network. (With 1 layer the boundary is linear, so it's easy).