How to train a U-Net with negative examples?
I trained U-Net with pictures of hands and fingers. The ground truth data are binary masks with white pixels for the foreground object (finger/hand) and black pixels for the background object. Now I want to add negatives, i.e. images without hand/finger. The respective ground truth would then be completely black. However, the dice coefficient is not suitable as a metric or loss function. The reason for this is described here:
" If smooth is set too low, when the ground truth has few to 0 white pixels and the predicted image has some non-zero number of white pixels, the model will be penalized more heavily. Setting smooth higher means if the predicted image has some low amount of white pixels when the ground truth has none, the loss value will be lower. Depending on how aggressive the model needs to be, though, maybe a lower value is good..."
Correct Implementation of Dice Loss in Tensorflow / Keras
My question now is, does anyone have any experience on how best to train a U-Net with negatives?
Related
I'm solving a problem of detecting an optic disc in a retina image. As you can see from the image:
the optic disc is the epicentrum of the blood vessels, has an irregular circular shape and has a brighter color than the rest of the retina.
Now I want to use a convolutional neural network to detect it. I know that typical approaches to detecting something in an image using CNNs (consisting mostly of conv., pooling, dropout and fully connected layers) devide an image into smaller parts, each of them is send to a classifier asking whether there is the object or not.
But I'm thinking about another approach. It'd be a model, which gets a normal RGB image of Height x Width size as input, which goes through several convolution layers so as the size remains the same (Height x Width) but with more channels let's say N. There would be no pooling layers(??), so the final output of the convolution would be of the size Height x Width x N.
In this output there'd be Height x Width feature vectors of the size N, each somehow describing the pixel on this position in the original image and its neighbourhood (??). Now what I'm trying to do here is to take these individual vectors as inputs to a fully connected layer network. Output of this would be some number describing the relative position of the input pixel in respect to the position of the optic disc in the image (maybe its distance, or the position itself, I don't know yet...). The training data consists of an image and the x, y position ot the optic disc.
But I'm not sure about some things about this approach. Can I not to use pooling layers? I thought maybe it wouldn't be transform invariant then, or something like that. I'm also not sure if what I'm doing in the fully connected layers is correct. I don't understand neural networks so well to say that "it is obvious that this should work" or "or this is that case where it is not easy to say and it's worth implementing it to see how it will work" or "this obviously won't work because...". So my question is just: which one of this three cases is this?
And isn't there some "obvious" method for this stuff and I'm just trying to solve something that was already solved? (maybe RNNs or something...)
I am trying to train my own network on Caffe, similar to Imagenet model. But I am confused with the crop layer. Till the point I understand about crop layer in Imagenet model, during training it will take random 227x227 image crops and train the network. But during testing it will take the center 227x227 image crop, does not we loose the information from image while we crop the center 227x27 image from 256x256 image? And second question, how can we define the number of crops to be taken during training?
And also, I trained the same network(same number of layers, same convolution size FC neurons will differ obviously), first taking 227x227 crop from 256x256 image, and second time taking 255x255 crop from 256x256 image. According to my intuition, the model with 255x255 crop should give me the best result. But I am getting higher accuracy with 227x227 image, can anyone explain me the intuition behind it, or am i doing something wrong?
Your observations are not specific to Caffe.
The sizes of the cropped images during training and testing need to be the same (227x227 in your case), because the upstream network layers (convolutions, etc) need the images to be the same size. Random crops are done during training is because you want data augmentation. However, during testing, you want to test against a standard dataset. Otherwise, the accuracy reported during testing would also depend on a shifting test database.
The crops are made dynamically at each iteration. All images in a training batch are randomly cropped. I hope this answers your second question.
Your intuition is not complete: With a bigger crop (227x227), you have more data augmentation. Data augmentation essentially creates "new" training samples out of nothing. This is vital to prevent overfitting during training. With a smaller crop (255x255), you should expect a better training accuracy but lower test accuracy, since the data is more likely be overfitted.
Of course, cropping can be overdone. Too much cropping and you lose too much information from an image. For image categorization, the ideal crop size is one that does not alter the category of an image, (ie, only background is cropped away).
I have to reconstruct an object which will be placed around 1 meter to 1.5 meters away from the baseline of my stereo setup. The image captured by both cameras have high resolution (10 MP)
The accuracy with which I have to detect it's position is +/- 0.5mm, in all the three co-ordinate axes. (If you require more details, please let me know)
For these, what should the optimal specifications of my checkerboard (for calibration) be?
I only know that it should be an asymmetric board. It should be placed in the same distance range as the range where object is expected to be placed. Also, it should be oriented in all possible angles (making sure all corners are seen by both cameras)
What about:
Number of squares horizontally and vertically? (also, on which side should the squares be more / even?)
Dimension of each square on checkerboard?
What effect does the baseline distance have on this?
Do these parameters of the checkerboard affect my accuracy in anyway? Are there any other parameters I need to consider for calibration?
I am using the MATLAB Stereo Calibrator App.
I will try to answer as good as I can:
Numbers of squares. Well, as you can guess, the more squares (actually corners between squares are used!) the better the result will be, as you have a more overdetermined system of equations to solve. Additionally, it doesnt matter the size of the chequerboard, only the odd/even pattern matters.
Dimensions of squares. the size does not matter very much in "mathematical" reresentation, but it matters practically. If your squares are very small, probably your printer wont draw a that good corner of the square and that will make your data "noisier". In the past, for really small calibration system I needed to go to an specialised printing shop so they could print it with the maximum quality possible. Of course if you make them very big you wont be able to fit lost of them in the iage which is not good.
The baseline distance has effect only in how properly can you see the corners between squares. The more accurate (in mm!, real distance!) you are detecting this corners the better. Obviously if you make small squares and put them very far, well, you wont see very much. This fits with the 1,2 question. Additionally, another problem you may have is focal length. In a application I worked on, some really small and close things wanted to be imaged. That was a problem while calibrating, as the amount if z distance I could see without blur was around 2mm. This really crippled my ability to calibrate properly because I could big angles in Z direction without getting blurred corners.
TL;DR: You want to have lots of corners between squares of the chequerboard but you want to see them as precisely as possible.
I'm trying to segment the sky and water part in this image.
Link of the Picture
I've tried so many methods like k-means, threshold, multi threshold etc. BUt unfortunately nothing worked so well.
Here is an example of my code(Matlab):
img=imread('1.jpg');
im_gray=rgb2gray(img);
b=imadjust(im_gray);
imshow(b);
bw_remove_small=imopen(b,strel('square',5));
imshow(bw_remove_small); %after 1st iteration
m3=medfilt2(bw_remove_small,[18,16]);
imshow(m3);
m3=medfilt2(bw_remove_small,20,20]);
m3=medfilt2(bw_remove_small,[20,20]);
imshow(m3);
I1=m3;
I2=rgb2gray(I1);
I=double(I2);
figure
subplot(1,3,1)
imshow(I1)
subplot(1,3,2)
imshow(I2)
g=kmeans(I(:),4);
J = reshape(g,size(I));
subplot(1,3,3)
imshow(J,[]);
Can any one help me?please
The picture's two regions are different in hue, texture, and gray level brightness.
The horizon is the best line in the image from our point of view and can be seen by the distinct change in brightness. The brightness will not work with a single threshold because the image brightness is not flat, so use brightness a model of the distribution to flatten out the sky or the water. This implies knowledge of the objective but there are two things that can give you an approximate answer: texture and/or hue.
The hue with a threshold of 120 (derived from the hue histogram) will give you the two regions but will not be divided cleanly and will have overlapping sections. Though using these two sections a model of the brightness can be found.
The same with texture. Using a small fft of the image, subtracting the dc out, then averaging or just summing up the non dc parts will result in a histogram with two peaks that may not be as distinct as the hue's is but is enough to find a threshold and two areas that will allow a model of the brightness to be found.
The key fact is if the sky is modeled properly as a gray surface then you can subtract it out of the image and use a simple threshold to pull it out.
Edge detection is very noisy in this image to be able to easily see the line but if you can pull out the image lines without losing shape then look for a straight and long contour it may take less code/work.
Hope this helps some! I used this to find mountains in the distance when there was not a big difference between the sky and the mountains. Plus I just tried this on your pic and almost got a good answer without a good model of the sky.
I have two images – mannequin with and without garment.
Please refer sample images below. Ignore the jewels, footwear on the mannequin, imagine the second mannequin has only dress.
I want to extract only the garment from the two images for further processing.
The complexity is that there is slight displacement in the position of camera when taking the two pictures. Due to this simple subtraction to generate the garment mask will not work.
Can anyone tell me how to handle it?
I think I need to do registration between the two images so that I can extract only the garment from the image?
Any references to blogs, articles and codes is highly appreciated.
--
Thanks
Idea
This is an idea of how you could do it, I haven't tested it but my gut tells me it might work. I'm assuming that there will be slight differences in the pose of the manequin as well as the camera attitude.
Let the original image be A, and the clothed image be B.
Take the difference D = |A - B|, apply a median filter that is proportional to the largest deviation you expect from pose and camera attitude error: Dmedian = Median(D, kernelsize).
Quantize Dmedian into a binary mask Dmask = Q(Dmedian, threshold) using appropriate threshold values to obtain an approximate mask for the garment (this will be smaller than the garment itself due to the median filter). Reject any shapes in Dmedian that have too small area by setting their pixels to 0.
Expand the shape(s) in Dmask proportionally to the size of the median kernel into Emask=expand(Dmask, k*kernelsize). Then construct the difference in the masks Fmask=|Dmask - Emask| which now contains areas of pixels where the garment edge is expected to be. For every pixel in Fmask which is in this area, find the correlation Cxy between A and B using a small neighbourhood, store the correlations into an image C=1.0 - Corr(A,B, Fmask, n).
Your final garment mask will be M=C+Dmask.
Explanation
Since your image has nice and continuous swatches of colour, the difference between the two similar images will be thin lines and small gradients where the pose and camera attitude is different. When taking a median filter of the difference image over a sufficiently large kernel, these lines will be removed because they are in a minority of the pixels.
The garment on the other hand will (hopefully) have a significant difference from the colors in the unclothed version. And will generate a bigger difference. Thresholding the difference after the median filter should give you a rough mask of the garment that is undersized dues to some of the pixels on the edge being rejected due to their median values being too low. You could stop here if the approximation is good enough for you.
By expanding the mask we obtained above we get a probable region for the "true" edge. The above process has served to narrow our search region for the true edge considerably and we can apply a more costly correlation search between the images along this edge to find where the garment is. High correlation means no carment and low correlation means garment.
We use the inverted correlation as an alpha value together with the initially smaller mask to obtain a alpha valued mask of the garment that can be used for extracting it.
Clarification
Expand: What I mean by "expanding the mask" is to find the contour of the mask region and outsetting/growing/enlarging it to make it larger.
Corr(A,B,Fmask,n): Is just an arbitrarily chosen correlation function that gives correlation between pixels in A and B that are selected by the mask Fmask using a region of size n. The function returns 1.0 for perfect match and 0.0 for anti-match for each pixel tested. A good function is this pseudocode:
foreach px_pos in Fmask where Fmask[px_pos] == 1
Ap = subregion(A, px_pos, size) - mean(mean(A));
Bp = subregion(B, px_pos, size) - mean(mean(B))
Cxy = sum(sum(Ap .* Bp))*sum(sum(Ap .* Bp)) / (sum(sum(Ap.*Ap))*sum(sum(Bp.*Bp)))
C[px_pos] = 1.0 - Cxy;
end
where subregion selects a region of size size around the pixel with position px_pos.
You can see that if Ap == Bp then Cxy=1