Caffe | data augmentation by random cropping - neural-network

I am trying to train my own network on Caffe, similar to Imagenet model. But I am confused with the crop layer. Till the point I understand about crop layer in Imagenet model, during training it will take random 227x227 image crops and train the network. But during testing it will take the center 227x227 image crop, does not we loose the information from image while we crop the center 227x27 image from 256x256 image? And second question, how can we define the number of crops to be taken during training?
And also, I trained the same network(same number of layers, same convolution size FC neurons will differ obviously), first taking 227x227 crop from 256x256 image, and second time taking 255x255 crop from 256x256 image. According to my intuition, the model with 255x255 crop should give me the best result. But I am getting higher accuracy with 227x227 image, can anyone explain me the intuition behind it, or am i doing something wrong?

Your observations are not specific to Caffe.
The sizes of the cropped images during training and testing need to be the same (227x227 in your case), because the upstream network layers (convolutions, etc) need the images to be the same size. Random crops are done during training is because you want data augmentation. However, during testing, you want to test against a standard dataset. Otherwise, the accuracy reported during testing would also depend on a shifting test database.
The crops are made dynamically at each iteration. All images in a training batch are randomly cropped. I hope this answers your second question.
Your intuition is not complete: With a bigger crop (227x227), you have more data augmentation. Data augmentation essentially creates "new" training samples out of nothing. This is vital to prevent overfitting during training. With a smaller crop (255x255), you should expect a better training accuracy but lower test accuracy, since the data is more likely be overfitted.
Of course, cropping can be overdone. Too much cropping and you lose too much information from an image. For image categorization, the ideal crop size is one that does not alter the category of an image, (ie, only background is cropped away).

Related

Training U-Net with Negatives

How to train a U-Net with negative examples?
I trained U-Net with pictures of hands and fingers. The ground truth data are binary masks with white pixels for the foreground object (finger/hand) and black pixels for the background object. Now I want to add negatives, i.e. images without hand/finger. The respective ground truth would then be completely black. However, the dice coefficient is not suitable as a metric or loss function. The reason for this is described here:
" If smooth is set too low, when the ground truth has few to 0 white pixels and the predicted image has some non-zero number of white pixels, the model will be penalized more heavily. Setting smooth higher means if the predicted image has some low amount of white pixels when the ground truth has none, the loss value will be lower. Depending on how aggressive the model needs to be, though, maybe a lower value is good..."
Correct Implementation of Dice Loss in Tensorflow / Keras
My question now is, does anyone have any experience on how best to train a U-Net with negatives?

Different approach to detecting shapes using CNNs (optic disc in a retina image)

I'm solving a problem of detecting an optic disc in a retina image. As you can see from the image:
the optic disc is the epicentrum of the blood vessels, has an irregular circular shape and has a brighter color than the rest of the retina.
Now I want to use a convolutional neural network to detect it. I know that typical approaches to detecting something in an image using CNNs (consisting mostly of conv., pooling, dropout and fully connected layers) devide an image into smaller parts, each of them is send to a classifier asking whether there is the object or not.
But I'm thinking about another approach. It'd be a model, which gets a normal RGB image of Height x Width size as input, which goes through several convolution layers so as the size remains the same (Height x Width) but with more channels let's say N. There would be no pooling layers(??), so the final output of the convolution would be of the size Height x Width x N.
In this output there'd be Height x Width feature vectors of the size N, each somehow describing the pixel on this position in the original image and its neighbourhood (??). Now what I'm trying to do here is to take these individual vectors as inputs to a fully connected layer network. Output of this would be some number describing the relative position of the input pixel in respect to the position of the optic disc in the image (maybe its distance, or the position itself, I don't know yet...). The training data consists of an image and the x, y position ot the optic disc.
But I'm not sure about some things about this approach. Can I not to use pooling layers? I thought maybe it wouldn't be transform invariant then, or something like that. I'm also not sure if what I'm doing in the fully connected layers is correct. I don't understand neural networks so well to say that "it is obvious that this should work" or "or this is that case where it is not easy to say and it's worth implementing it to see how it will work" or "this obviously won't work because...". So my question is just: which one of this three cases is this?
And isn't there some "obvious" method for this stuff and I'm just trying to solve something that was already solved? (maybe RNNs or something...)

Data augmentation factor in training a CNN

I training a CNN, many authors have mentioned of randomly cropping images from the center of the original image with a factor of 2048 data augmentation. Can anyone plz elaborate what does it mean?
I believe you are referring to the ImageNet Classification with Deep Convolutional Neural Networks data augmentation scheme. The 2048x aspect of their data augmentation scheme goes as follows:
First all images are rescaled down to 256x256
Then for each image they take random 224x224 sized crops.
For each random 224x224 crop, they additionally augment by taking horizontal reflections of these 224x224 patches.
So my guess as to how they get to the 2048x data augmentation factor:
There are 32*32 = 1024 possible 224x224 sized image crops of a 256x256 image. To see this simply observe that 256-224=32, so we have 32 possible horizontal indices and 32 possible vertical indices for our crops.
Doing horizontal reflections of each crop doubles the size.
1024 * 2 = 2048.
The center crop aspect of your question stems from the fact that the original images are not all the same size. So what the authors did was they rescaled each rectangular image so that the shortest side was now of size 256, and they they took the center crop from this, thereby rescaling the entire dataset to 256x256. Once they have rescaled all the images to 256x256, they can perform the above (up to)-2048x data augmentation scheme.

use scale space representation to filter one image

Currently I hope to use scale space representation to filter one image. Features in one image can be filtered using an Gaussian smooth filter with one optimal sigma. It means different features in one image can be expressed best in different scale under scale space representation.
For example, I have one image with one tree in it. In the scale space representation, three sigma values are used and they are represented as sigma0, sigma1 and sigma2. The ground is best expressed in the smoothed image with sigma0 because it contains textures mainly. The branches are best expressed in the smoother image with sigma1 and the trunk is with the smoother image with sigma2. If I hope to filter the image, I hope that the filtered pixels for the group is from the smoothed image with sigma0.
The filtered pixels for the branches are from the smoothed image with sigma1. The filtered pixels for the trunk are from the smoothed image with sigma2.
It requires that I need to determine in which smoothed image one pixel is expressed best. Is this idea plausible?
I am trying to use differece-of-Gaussian of two successive smoothed images to perform the above task. Is there any other way to combine the three smoothed image?
I use Matlab to implement the idea. The values of the three sigmas is 1.0, 2.0 and 3.0. The corresponding size of Gaussian kernel is 3, 5 and 7. I use the function fspecial to generate the kernel. Are the parameter reasonable? Please share your experience with the scale space representation to help me. You can provide some links to useful papers.
your idea is very much plausible! You are just one step away from it. I did something very similar once and it looked like this:
After smoothing your images and extracting the edges for each smoothing step (I used a weighted [to compensate for maxima supression after Gauss filtering] Sobel filter for this since DOG was not quite stable for my aplication), you can proyect (and normalize) your whole stack of edge images into a single image ("cummulative edges") which will contain the characteristic edges. You can then compare the cummulative edges image (using cross-correlation or whatever you wish) with every single image in your edge stack, the biggest value of this comparation is then the smooth-scale in which the pixel is expressed the best.
Hope that makes sense for you after reading it a couple of times.
Also don't be afraid of using much bigger kernel sizes, while it all depends on your application, I ended up using things of 51 and bigger!!! (was working with 40MP images though...)
T. Lindeberg has literally dozens of papers related to this problem. I found this one the most useful, but since you are already in the right track, I don't think reading the 50 pages will make you that much smarter. The most important part of it is maybe this one:
Principle for scale selection:
In the absence of other evidence, assume that a scale level, at which some
(possibly non-linear) combination of normalized derivatives assumes a
local maximum over scales, can be treated as reflecting a characteristic
length of a corresponding structure in the data.

Remove paper texture pattern from a photograph

I've scanned an old photo with paper texture pattern and I would like to remove the texture as much as possible without lowering the image quality. Is there a way, probably using Image Processing toolbox in MATLAB?
I've tried to apply FFT transformation (using Photoshop plugin), but I couldn't find any clear white spots to be paint over. Probably the pattern is not so regular for this method?
You can see the sample below. If you need the full image I can upload it somewhere.
Unfortunately, you're pretty much stuck in the spatial domain, as the pattern isn't really repetitive enough for Fourier analysis to be of use.
As #Jonas and #michid have pointed out, filtering will help you with a problem like this. With filtering, you face a trade-off between the amount of detail you want to keep and the amount of noise (or unwanted image components) you want to remove. For example, the median filter used by #Jonas removes the paper texture completely (even the round scratch near the bottom edge of the image) but it also removes all texture within the eyes, hair, face and background (although we don't really care about the background so much, it's the foreground that matters). You'll also see a slight decrease in image contrast, which is usually undesirable. This gives the image an artificial look.
Here's how I would handle this problem:
Detect the paper texture pattern:
Apply Gaussian blur to the image (use a large kernel to make sure that all the paper texture information is destroyed
Calculate the image difference between the blurred and original images
EDIT 2 Apply Gaussian blur to the difference image (use a small 3x3 kernel)
Threshold the above pattern using an empirically-determined threshold. This yields a binary image that can be used as a mask.
Use median filtering (as mentioned by #Jonas) to replace only the parts of the image that correspond to the paper pattern.
Paper texture pattern (before thresholding):
You want as little actual image information to be present in the above image. You'll see that you can very faintly make out the edge of the face (this isn't good, but it's the best I have time for). You also want this paper texture image to be as even as possible (so that thresholding gives equal results across the image). Again, the right hand side of the image above is slightly darker, meaning that thresholding it well will be difficult.
Final image:
The result isn't perfect, but it has completely removed the highly-visible paper texture pattern while preserving more high-frequency content than the simpler filtering approaches.
EDIT
The filled-in areas are typically plain-colored and thus stand out a bit if you look at the image very closely. You could also try adding some low-strength zero-mean Gaussian noise to the filled-in areas to make them look more realistic. You'd have to pick the noise variance to match the background. Determining it empirically may be good enough.
Here's the processed image with the noise added:
Note that the parts where the paper pattern was removed are more difficult to see because the added Gaussian noise is masking them. I used the same Gaussian distribution for the entire image but if you want to be more sophisticated you can use different distributions for the face, background, etc.
A median filter can help you a bit:
img = imread('http://i.stack.imgur.com/JzJMS.jpg');
%# convert rgb to grayscale
img = rgb2gray(img);
%# apply median filter
fimg = medfilt2(img,[15 15]);
%# show
imshow(fimg,[])
Note that you may want to pad the image first to avoid edge effects.
EDIT: A smaller filter kernel than [15 15] will preserve image texture better, but will leave more visible traces of the filtering.
Well i have tried out a different approach using Anisotropc diffusion using the 2nd coefficient that operates on wider areas
Here is the output i got:
From what i can See from the Picture, the Noise has a relatively high Frequency Compared to the image itself. So applying a low Pass filter should work. Have a look at the Power spectrum abs(fft(...)) to determine the cutoff Frequency.