Below is an image of the YOLO network for object detection. The first layer says Conv. Layer 7x7x64-s-2. To the best of my knowledge, it means a convolution layer using a kernel of size 4x4 with 64 output channels and stride 2. But the picture shows 192
output channels at this layer. Am I understanding something wrong or is it a typo in the paper? The second layer has the same porblem while the others don't.
We get 64 output channels per each input channel in the image. Since each image is a colored one, it has 3 channels corresponding to red, green, and blue. Hence we get 64 x 3 = 192 output channels.
Related
I'm solving a problem of detecting an optic disc in a retina image. As you can see from the image:
the optic disc is the epicentrum of the blood vessels, has an irregular circular shape and has a brighter color than the rest of the retina.
Now I want to use a convolutional neural network to detect it. I know that typical approaches to detecting something in an image using CNNs (consisting mostly of conv., pooling, dropout and fully connected layers) devide an image into smaller parts, each of them is send to a classifier asking whether there is the object or not.
But I'm thinking about another approach. It'd be a model, which gets a normal RGB image of Height x Width size as input, which goes through several convolution layers so as the size remains the same (Height x Width) but with more channels let's say N. There would be no pooling layers(??), so the final output of the convolution would be of the size Height x Width x N.
In this output there'd be Height x Width feature vectors of the size N, each somehow describing the pixel on this position in the original image and its neighbourhood (??). Now what I'm trying to do here is to take these individual vectors as inputs to a fully connected layer network. Output of this would be some number describing the relative position of the input pixel in respect to the position of the optic disc in the image (maybe its distance, or the position itself, I don't know yet...). The training data consists of an image and the x, y position ot the optic disc.
But I'm not sure about some things about this approach. Can I not to use pooling layers? I thought maybe it wouldn't be transform invariant then, or something like that. I'm also not sure if what I'm doing in the fully connected layers is correct. I don't understand neural networks so well to say that "it is obvious that this should work" or "or this is that case where it is not easy to say and it's worth implementing it to see how it will work" or "this obviously won't work because...". So my question is just: which one of this three cases is this?
And isn't there some "obvious" method for this stuff and I'm just trying to solve something that was already solved? (maybe RNNs or something...)
I training a CNN, many authors have mentioned of randomly cropping images from the center of the original image with a factor of 2048 data augmentation. Can anyone plz elaborate what does it mean?
I believe you are referring to the ImageNet Classification with Deep Convolutional Neural Networks data augmentation scheme. The 2048x aspect of their data augmentation scheme goes as follows:
First all images are rescaled down to 256x256
Then for each image they take random 224x224 sized crops.
For each random 224x224 crop, they additionally augment by taking horizontal reflections of these 224x224 patches.
So my guess as to how they get to the 2048x data augmentation factor:
There are 32*32 = 1024 possible 224x224 sized image crops of a 256x256 image. To see this simply observe that 256-224=32, so we have 32 possible horizontal indices and 32 possible vertical indices for our crops.
Doing horizontal reflections of each crop doubles the size.
1024 * 2 = 2048.
The center crop aspect of your question stems from the fact that the original images are not all the same size. So what the authors did was they rescaled each rectangular image so that the shortest side was now of size 256, and they they took the center crop from this, thereby rescaling the entire dataset to 256x256. Once they have rescaled all the images to 256x256, they can perform the above (up to)-2048x data augmentation scheme.
I have a 14 bit image that I like to convert to YCrCb color space. As far as I know the conversions are written for 8-bit images. For instance when I use matlab function rgb2ycrcb and I convert it back to rgb then it would be all whites. It is very important for me to not lose any information. What I want to do is to separate luminance from chroma and do some process and convert it back to RGB.
The YCbCr standard to convert quantities from the RGB colour space was specifically designed for 8-bit colour channel images. The scaling factors and constants are tailored so that the input is a 24-bit RGB image (8-bits per channel. BTW, your notation is confusing. Usually you use xx-bit RGB to represent how many bits in total that is required to represent the image).
One suggestion I could make is to rescale your channels independently so that they go from [0-1] for all channels. rgb2ycbcr can accept floating point inputs so long as they're in the range of [0-1]. Judging from your context, you have 14 bits representing each colour channel. Therefore, you can simply do this, given that your image is stored in A and the output will be stored in B:
B = rgb2ycbcr(double(A) / (2^14 - 1));
You can then process your chroma and luminance components using the output of rgb2ycbcr. Bear in mind that the components will also be normalized between [0-1]. Do your processing, then convert back using ycbcr2rgb, then rescale your outputs by 2^14 - 1 to bring your image back into 14-bit RGB per channel. Assuming Bout is your output image after your processing in the YCbCr colour space, do:
C = round((2^14 - 1)*ycbcr2rgb(Bout));
We round as this will most likely provide floating point values, and your image colour planes need to be unsigned integers.
I have both RGB and depth images from Kinect as png format. I'm trying to use depth data with the watershed segmentation but I don't know how to combine both data and obtain a more accurate result. I checked some papers but I didn't understand the results or couldn't find a solution written specifically for the watershed algorithm. How can I include the depth data as a reference point to the segmentation process?
I'm using MatLab's Image Processing Toolbox.
The images are from Nathan Silberman et. al.'s database on Silberman's website
An example RGB image and its corresponding depth file are shown below (note that the depth image, originally it is a binary image, is converted to uint8):
Update: I tried to create a weighted grayscale image from the RGB source together with the depth data by taking each channel (red, green, blue and depth) and calculating their weights; then including the values multiplied with their weights for every corresponding pixel. But the resulting grayscale image does not improve the result significantly. It's not that better than the solely RGB based segmentation. What else could I do if I follow this approach? Alternatively, how can I see the effects of the depth data?
unless you had an error uploading, the depth image is black and doesnt contain any depth data. Keep in mind that (dutch term) your comparing apples and pears here.
Whatershed images are not depth images, they are extractions of contour.
Then there is a next thing where you go wrong, depth images have a lower resulotion then color images. for the kinect v2 its only 512x424, and the kinect one's true depth vision is even lower then its returned bitmap size (it is a low res depth and not every pixel in is a result of a measurement, in contrast to kinect v2). But then the v2 has even better video output.
If you want better watershed of a rgb image, then average out multiple camera frames to get rid of camera noise.
PS i recomend you download the windows kinect sdk and take a look of the samples provided with it.
I'm working on image processing, and I have an image that has been DCT'd and quantized for 8 x 8 blocks of the 512 x 512 matrix, now I have to find how many quantizing levels that the image has. Do I need to take the top left pixel and place it in to an array and then place this on a graph calling hist?
length(unique(x(:))), where x is your image array. This is appropriate for grayscale images.