I am new with deep learning and I would like to implement autoencoders for anomaly detection.
I have two questions:
Are the decoder layers always the mirrored version of the encoder layers?
Since the weights of hidden layer Code (h) represent the compressed represenation of the data, could they be considered for the classification of the input features?
Based on your questions,
1- Decoder layers are not always mirrored version of the encoder layers. You can check Mask R-CNN, YOLO and similar network architecture to check that decoder is consisted by 1-2 layers while encoder is consisted by multiple layers. However, based on my personal experience, I would definitely suggest to implement mirrored networks and supply it with feedbacks from encoder layers
2- You can use Code (h) part to do multiple things, including description (by supplying it to RNN), classification (by appending DNN), localization (by appending localization network) etc.
Encoder layers are just feature extractors and Code(h) contains extracted features. It is up to you to decide what to do with those features.
Related
I have data-set regarding chocolates. I need to detect whether it has scratches or not. I am planning to detect from Convolution Neural Network using Caffe. But how to define which neural network architecture will suit to my data-set?
Also how to generate heat values when there is any scratches in image?
I have tried detect normal image processing algorithms and it did not work.
Abnormal Image
Normal Image
Based on the little info you provide, the network architecture choice should be the last of your concerns. Also "trying normal image processing algorithms" is quite a vague statement.
A few points to consider
How big is the dataset? Are the chocolate photos taken in a controlled setting where they are always similar to your example photos or are they taken in the wild, i.e. where they could have different lighting conditions, positions, etc.? Is the dataset balanced?
How is the dataset labelled? Is it just a class for the whole image specifying normal vs abnormal? If so, you'd just be doing classification, and one way to potentially just visualise the location of the scratches (if they turn out to be the most prominent feature for the classification) is to use gradient-weighted class activation maps. On the other hand, if your dataset has labelled scratch points over images, then you can directly train your network to output heatmaps.
Once your dataset is properly set up with a training and validation set, you can just start with a baseline simple small convolutional network architecture, and then you can try out different and bigger network architectures like VGG16, ResNet, etc., and check whether they improve performance on your validation set.
I started working recently on object-detection algorithms. And I usually encounter with models having a base network as LeNet or PVA-Net and then a different architecture or model for detection. But I never understood how these base networks and detection network help and how to choose a particular model as base or detection network?
Assume that you are building a model for object detection.
A CNN object detection model (for simplicity, let's choose SSD) may consist of a base network which serves as the feature extraction, while the detection modules get the input features (extracted from the base network) to generate the outputs which contain the object classes, and coordinates of objects detected (including the center (x, y), the height (h) and the width (w) of predicted box).
For the base network, we usually take the pre-trained network such as ResNet, VGG, etc which already trained on large datasets like ImageNet with the hope that the base network would produce a good set of features for detection layer (or at least we don't need to tune so much the parameters of the base network during training which helps the model converges soon).
For the detection modules, it depends on what kind of methods you want to use, for instance, one-stage methods (SSD, RetinaNet, YOLO, so on) or two-stage methods (Faster R-CNN, Masked R-CNN, etc). There is a trade-off between the accuracy and speed among those methods which is an important indicator of which detection module you should pick.
I'm learning about using neural networks and object detection, using Python and Keras. My goal is to detect something very specific in an image, let's say a very specific brand / type of car carburetor (part of a car engine).
The tutorials I found so far use the detection of cats and dogs as example, and many of those use a pre-trained VGG16 network to improve performance.
If I want to detect only my specific carburetor, and don't care about anything else in the image, does it make sense to use VGG16.? Is VGG16 only useful when you want to detect many generic items, rather than one specific item.?
Edit: I only want to know if there is a specific object (carburetor) in the image. No need to locate or put a box around it. I have about 1000 images of this specific carburetor for the network to train on.
VGG16 or some other pretrained neural network is primarily used for classification. That means that you can use it to distinguish in what category the image belongs in.
As i understand, what you need is to detect where in an image a carburetor is located. For something like that you need a different, more complicated approach.
You could use
NVIDIA DetectNet,
YOLO,
Segmentation networks such as U-net or
SegNet, etc...
The VGG 16 can be used for that. (Now is it the best? This is an open question without a clear answer)
But you must replace its ending to fit your needs.
While a regular VGG model has about a thousand classes at its end, a cats x dogs VGG has its end changed to have two classes. In your case, you should change its ending to have only one class.
In Keras, you'd have to load the VGG model with the option include_top = False.
And you should then add your own final Dense layers (two or three dense layers at the end), making sure that the last layer has only one unit: Dense(1, activation='sigmoid').
This will work for "detecting" (yes / no results).
But if your goal is "locating/segmentation", then you should create your own version of a U-net or a SegNet, for instance.
I want to use pre-trained model for the face identification. I try to use Siamese architecture which requires a few number of images. Could you give me any trained model which I can change for the Siamese architecture? How can I change the network model which I can put two images to find their similarities (I do not want to create image based on the tutorial here)? I only want to use the system for real time application. Do you have any recommendations?
I suppose you can use this model, described in Xiang Wu, Ran He, Zhenan Sun, Tieniu Tan A Light CNN for Deep Face Representation with Noisy Labels (arXiv 2015) as a a strating point for your experiments.
As for the Siamese network, what you are trying to earn is a mapping from a face image into some high dimensional vector space, in which distances between points reflects (dis)similarity between faces.
To do so, you only need one network that gets a face as an input and produce a high-dim vector as an output.
However, to train this single network using the Siamese approach, you are going to duplicate it: creating two instances of the same net (you need to explicitly link the weights of the two copies). During training you are going to provide pairs of faces to the nets: one to each copy, then the single loss layer on top of the two copies can compare the high-dimensional vectors representing the two faces and compute a loss according to a "same/not same" label associated with this pair.
Hence, you only need the duplication for the training. In test time ('deploy') you are going to have a single net providing you with a semantically meaningful high dimensional representation of faces.
For a more advance Siamese architecture and loss see this thread.
On the other hand, you might want to consider the approach described in Oren Tadmor, Yonatan Wexler, Tal Rosenwein, Shai Shalev-Shwartz, Amnon Shashua Learning a Metric Embedding for Face Recognition using the Multibatch Method (arXiv 2016). This approach is more efficient and easy to implement than pair-wise losses over image pairs.
I have some questions about how to actually interact with a pre-trained Caffe model. In my case I'm using a model for scene recognition.
In the caffe git repository, there are some code examples in Python and C++ on the implementations of Image Classifiers. However, those do not apply to my use case (since they only classify the input image as ONE class).
My goal is an application that takes an input image (jpg) and outputs the highest predicted class label for each pixel in the input image (e.i., indices for sky, beach, road, car).
Could anyone give me some pointers on how to proceed?
There already seem to exist implementations for this. This demo (http://places.csail.mit.edu/demo.html) is kind of what I what.
Thank you!
What you are looking for is not image classification, but rather semantic segmentation.
A recent work, by Jonathan Long, Evan Shelhamer and Trevor Darrell is based on Caffe, and can be found here. It uses fully convolutional network, that is, a network with no "InnerProduct" layers only convolutional layers, thus capable of producing outputs with different sizes for different sizes of inputs.