Training Deep Convnet with small input size - neural-network

I am very new to this field of Deep Learning. While I understand how it works and I managed to run some tutorials on Caffe Library I still have some questions which I was unable to find some satisfying answers.
My questions are as follows:
Consider AlexNet which takes 227 x 227 image size as input in caffe (I think in original paper its 224), and the FC7 produces as 4096-D feature vector. Now if I want to detect a person say using Sliding window of Size (32 x 64) then each of this window will be upsized to 227 x 227 before going through the AlexNet. This is some big computation. Is there a better way to handle this (32 x 64) window?
My approach to this 32 x 64 window detector is to build my own network with few convolutions, pooling, ReLus and FCs. While I understand how I can build the architecture, I am afraid that the model I will train might have issues such as overfitting etc. One of my friend told me to pretrain my network using AlexNet but I don't know how to do this? I cannot get hold of him to ask for now but anyone there who thinks that what he said is doable? I am confused. I was thinking to use ImageNet and train my network which will take 32 x 64 input. Since this is just feature extractor I feel that using the imageNet might provide me with all variety of images for good learning? Please correct me if I am wrong and if possible guide me into the correct path.
This question is just about Caffe. Say I compute feature using HOG and I want to use the GPU version of Neural Net to train a classifier. Is that possible? I want thinking to use HDF5 layer to read the hog feature vector and pass that fully connected layer for training? Is that possible?
I would appreciate any help or links to papers etc that may help me understand the idea of Convnets.

For a CNN which contains fully connected layers, the input size cannot be changed. If the network is trained on 224x224 images, then the input size has to be 224x224. Look at this question.
Training your own network from scratch will require huge amount of data. AlexNet was trained on a million images. If you have such high amount of training data (you can download the ImageNet training data) then go ahead. Otherwise you might want to look into finetuning.
Yes, you can use HDF5 layer to read the HOG feature vector for training.

Related

Predict a number with a given image (0 to 1)

I am a total beginner to ML and Neural networks. I am currently working on a project where I have a lot of pictures stored in a MongoDB database. Each one of those pictures has a number from 0 to 1. For example "picture 1" 0.71.
I want to train my model given the database. The main goal for the project is that after the model is finished and trained, given an image the model will be able to return(predict) a number from 0 to 1. After doing some research and asking a few people I figured out some libraries that would be useful for the project are: Tenserflow and Keras. Some people told me that it is impossible, but I'm not sure therefore I came to ask here.
So my questions are: Is it possible? If so, how can I implement it? Are there any specific tools you recommend? If you specify a way that I should use for my project do I need to export my MongoDB database in a certain form? Since I am a beginner maybe there are some tutorials that you think that can help?
I'm sorry if this question is a bit too general, if there are any misunderstandings please comment and I will try to answer.
Thanks in advance!
What you want to do is totally feasible, this kind of project is called regression, since you are using images data the best type of models are called convolutional neural network (CNN), you'll need some understanding if you want to build your own model. I've done a project where I had to predict a number of bacterial colonies using an image, much like your problem except that I had no boundaries on the predicted values.
What is a CNN ? Here is a link
Basically a CNN will understand the features in the images and will use those features to predict a value.
You won't need to create your own model, most people just use well-designed one in the scientific litterature.
Go for keras, it's the easiest framework out there and work like a charm. Here is how to implement VGG16 (an architecture that is probably the best for your problem) : link
You should follow this tutorial to get going on developing with keras.
Last hint: don't use the same last layer as the one on the VGG16 implementation, use a Dense Layer with one neuron and with a sigmoid/linear/leaky relu activation.
ie:
#model.add(Dense(1000, activation='softmax'))
model.add(Dense(1, activation='sigmoid'))
This means : predict 1 number (sigmoid will bound it between 0 and 1, but maybe lrelu or linear is better)
Also, I guess you could use MongoDB to read the images as arrays, but I would just put the images on a folder.
Edit : When compiling the model, use a mean squared error as in
adam = keras.optimizers.Adam(lr=1e-4)
model.compile(optimizer=adam, loss='mse')
Here you have the "hello world program" in terms of neural networks and digits classification. You can start studying it because I think you will end up with a similar architecture for your NN. What you should focus on is the output of your model, because in this example they are performing classification on 10 classes (digits from 0 to 9) but you are trying to read a real number. You could try to use a single neurone with sigmoid or linear activation at the end of your model.

niftynet multi-class 3D segmentation with dense vnet

Neural network newbie here. I've been testing Niftynet and achieved decent single-class 3D segmentation predictions on an own MRI data set with dense_vnet. However, I ran out of luck when I tried to add a second label. The network seems to spot the correct organs but can't get rid of additional artifacts as if it cannot get out of a local minimum or it doesn't have enough degrees of freedom or something. This is one of the better looking prediction slices which does show some correct labels but also additional noise.
Why would a single-class segmentation work better than a multi-class segmentation? Is it even reasonable to expect good multi-class 3D segmentation results out of DenseVnet? If yes, is there a specific approach to improve the results?
P.S.
Niftynet's site refers to stackoverflow for general questions.
Apparently, DenseVnet does handle multi-class segmentation okay. They have provided a ready model with a Dice loss extension. It worked with my MRI data without any pre-processing even though it's been designed for CT images and Hounsfield units.

Convolution Neural Network for image detection/classification

So here is there setup, I have a set of images (labeled train and test) and I want to train a conv net that tells me whether or not a specific object is within this image.
To do this, I followed the tensorflow tutorial on MNIST, and I train a simple conv net reduced to the area of interest (the object) which are training on image of size 128x128. The architecture is as follows : successively 3 layers consisting of 2 conv layers and 1 max pool down-sampling layers, and one fully connected softmax layers (with two class 0 and 1 whether the object is present or not)
I impleted it using tensorflow, and this works quite well, but since I have enough computing power I was wondering how I could improve the complexity of the classification:
- adding more layers ?
- adding more channel at each layer ? (currently 32,64,128 and 1024 for the fully connected)
- anything else ?
But the most important part is that now I want to detect this same object on larger images (roughle 600x600 whereas the size of the object should be around 100x100).
I was wondering how I could use the previously training "small" network used for small images, in order to pretrained a larger network on the large images ? One option could be to classify the image using a slicing window of size 128x128 and scan the whole image but I would like to try if possible to train a whole network on it.
Any suggestion on how to proceed ? Or an article / ressource tackling this kind of problem ? (I am really new to deep learning so sorry if this is stupid question...)
Thanks !
I suggest that you continue reading on the field overall. Your search keys include CNN, image classification, neural net, AlexNet, GoogleNet, and ResNet. This will return many articles, on-line classes and lectures, and other materials to help you learn about classification with neural nets.
Don't just add layers or filters: the complexity of the topology (net design) must be fitted to the task; a net that's too complex will over-fit the training data. The one you've been using is probably LeNet; the three I cite above are for the ImageNet image classification contest.
Since you are working on images, I would suggest you to use a pretrained image classification network (like VGG, Alexnet etc.)and fine tune this network with your 128x128 image data. In my experience until we have very large data set fine tuned network will give more accuracy and also save training time. After building a good image classifier on your data set you can use any popular algorithm to generate region of proposal from the image. Now take all regions of proposal and pass them to classification network one by one and check weather this network is classifying given region of proposal as positive or negative. If it classifying as positively then most probably your object is present in that region. Otherwise it's not. If there are a lot of region of proposal in which object is present according to classifier then you can use non maximal suppression algorithms to reduce number of positive proposals.

PCA on Sift desciptors and Fisher Vectors

I was reading this particular paper http://www.robots.ox.ac.uk/~vgg/publications/2011/Chatfield11/chatfield11.pdf and I find the Fisher Vector with GMM vocabulary approach very interesting and I would like to test it myself.
However, it is totally unclear (to me) how do they apply PCA dimensionality reduction on the data. I mean, do they calculate Feature Space and once it is calculated they perform PCA on it? Or do they just perform PCA on every image after SIFT is calculated and then they create feature space?
Is this supposed to be done for both training test sets? To me it's an 'obviously yes' answer, however it is not clear.
I was thinking of creating the feature space from training set and then run PCA on it. Then, I could use that PCA coefficient from training set to reduce each image's sift descriptor that is going to be encoded into Fisher Vector for later classification, whether it is a test or a train image.
EDIT 1;
Simplistic example:
[coef , reduced_feat_space]= pca(Feat_Space','NumComponents', 80);
and then (for both test and train images)
reduced_test_img = test_img * coef; (And then choose the first 80 dimensions of the reduced_test_img)
What do you think? Cheers
It looks to me like they do SIFT first and then do PCA. the article states in section 2.1 "The local descriptors are fixed in all experiments to be SIFT descriptors..."
also in the introduction section "the following three steps:(i) extraction
of local image features (e.g., SIFT descriptors), (ii) encoding of the local features in an image descriptor (e.g., a histogram of the quantized local features), and (iii) classification ... Recently several authors have focused on improving the second component" so it looks to me that the dimensionality reduction occurs after SIFT and the paper is simply talking about a few different methods of doing this, and the performance of each
I would also guess (as you did) that you would have to run it on both sets of images. Otherwise your would be using two different metrics to classify the images it really is like comparing apples to oranges. Comparing a reduced dimensional representation to the full one (even for the same exact image) will show some variation. In fact that is the whole premise of PCA, you are giving up some smaller features (usually) for computational efficiency. The real question with PCA or any dimensionality reduction algorithm is how much information can I give up and still reliably classify/segment different data sets
And as a last point, you would have to treat both images the same way, because your end goal is to use the Fisher Feature Vector for classification as either test or training. Now imagine you decided training images dont get PCA and test images do. Now I give you some image X, what would you do with it? How could you treat one set of images differently from another BEFORE you've classified them? Using the same technique on both sets means you'd process my image X then decide where to put it.
Anyway, I hope that helped and wasn't to rant-like. Good Luck :-)

Optimization of Neural Network input data

I'm trying to build an app to detect images which are advertisements from the webpages. Once I detect those I`ll not be allowing those to be displayed on the client side.
Basically I'm using Back-propagation algorithm to train the neural network using the dataset given here: http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements.
But in that dataset no. of attributes are very high. In fact one of the mentors of the project told me that If you train the Neural Network with that many attributes, it'll take lots of time to get trained. So is there a way to optimize the input dataset? Or I just have to use that many attributes?
1558 is actually a modest number of features/attributes. The # of instances(3279) is also small. The problem is not on the dataset side, but on the training algorithm side.
ANN is slow in training, I'd suggest you to use a logistic regression or svm. Both of them are very fast to train. Especially, svm has a lot of fast algorithms.
In this dataset, you are actually analyzing text, but not image. I think a linear family classifier, i.e. logistic regression or svm, is better for your job.
If you are using for production and you cannot use open source code. Logistic regression is very easy to implement compared to a good ANN and SVM.
If you decide to use logistic regression or SVM, I can future recommend some articles or source code for you to refer.
If you're actually using a backpropagation network with 1558 input nodes and only 3279 samples, then the training time is the least of your problems: Even if you have a very small network with only one hidden layer containing 10 neurons, you have 1558*10 weights between the input layer and the hidden layer. How can you expect to get a good estimate for 15580 degrees of freedom from only 3279 samples? (And that simple calculation doesn't even take the "curse of dimensionality" into account)
You have to analyze your data to find out how to optimize it. Try to understand your input data: Which (tuples of) features are (jointly) statistically significant? (use standard statistical methods for this) Are some features redundant? (Principal component analysis is a good stating point for this.) Don't expect the artificial neural network to do that work for you.
Also: remeber Duda&Hart's famous "no-free-lunch-theorem": No classification algorithm works for every problem. And for any classification algorithm X, there is a problem where flipping a coin leads to better results than X. If you take this into account, deciding what algorithm to use before analyzing your data might not be a smart idea. You might well have picked the algorithm that actually performs worse than blind guessing on your specific problem! (By the way: Duda&Hart&Storks's book about pattern classification is a great starting point to learn about this, if you haven't read it yet.)
aplly a seperate ANN for each category of features
for example
457 inputs 1 output for url terms ( ANN1 )
495 inputs 1 output for origurl ( ANN2 )
...
then train all of them
use another main ANN to join results