Classifying an object from an ordered sequence of spatially aligned images - neural-network

This is about a project to count the number of tumour cells in a given 88 by 88 pixel frame. To say there is just one image is not actually accurate; there are 4 independent images of that frame i.e. the same 'situation' on the ground. All 4 of them have to be considered to count how many tumour cells there are (usually 0, sometimes 1, rarely 2).
Here are the 4 images of a sample frame concatenated together
These images are obtained from visualising the situation through different lens (wavelengths).
I have read up on several blog articles of neural networks. However, all assume one image-one label relationship, instead of the many image-one label relationship I am working with.
Hence, I am looking for suggestions for possible tools or just the technical term related to this kind of problem so I can proceed further.

Related

Structuring an experiment with psychtoolbox

I have a design for an experiment that I would like to code in psychtoolbox using MATLAB. I understand the basics of how to use this toolbox however it would be extremely helpful if someone has designed a similar experiment before and could provide me with some code that could help me to carry out the following:
The experimental procedure will be composed of 80 trials divided into 5 blocks with 16 trials in each block. The experiment consists of the participant selecting a number from the screen. There is a desirable number (target number) and an associated less desirable number (lure number). I won't go into further detail about the reasoning behind this experiment as it is not relevant to my question.
I have attached an image that shows 1 block of trials (16 trials). The other 4 blocks are the same as this block.
Target and lure numbers will be presented on the screen to choose from (an example can be seen in image below).
In some of the trials as can be seen from the trials table only one target number and one lure number are presented for the participant to choose from (instead of two targets and two lures).
The lure(s) that appears with each target(s) should not always be the same. I want the lure that is shown with the target to be randomly selected in each trial (as can be seen in the trials image there is more than one possible lure). In the trials image I have attached the trial numbers are presented just for clarity, in each block the presentation of the targets needs to be randomized.
You can use the BalanceTrials function that comes with psychtoolbox. You use all the possible lures and targets as inputs and it returns a random order of all possible combinations. You can also specify a minimum length of the list it returns, but if there are more combinations it will make the list longer to make it balanced. Here is an example:
numberOfTrials = 80;
targetNumbers = {'3','4','5','6','7','4 5','4 6','4 7'};
lureNumbers = {'3','4','5','6','7','4 7'};
[targets, lures] = BalanceTrials(numberOfTrials, 1, targetNumbers, lureNumbers);
You can split this up into 5 blocks or you do it each time for each block.

How to fine tune an FCN-32s for interactive object segmentation

I'm trying to implement the proposed model in a CVPR paper (Deep Interactive Object Selection) in which the data set contains 5 channels for each input sample:
1.Red
2.Blue
3.Green
4.Euclidean distance map associated to positive clicks
5.Euclidean distance map associated to negative clicks (as follows):
To do so, I should fine tune the FCN-32s network using "object binary masks" as labels:
As you see, in the first conv layer I have 2 extra channels, so I did net surgery to use pretrained parameters for the first 3 channels and Xavier initialization for 2 extras.
For the rest of the FCN architecture, I have these questions:
Should I freeze all the layers before "fc6" (except the first conv layer)? If yes, how the extra channels of the first conv will be learned? Are the gradients strong enough to reach the first conv layer during training process?
What should be the kernel size of the "fc6"? should I keep 7? I saw in "Caffe net_surgery" notebook that it depends on the output size of the last layer ("pool5").
The main problem is the number of outputs of the "score_fr" and "upscore" layers, since I'm not doing class segmentation (to use 21 for 20 classes and the background), how should I change it? What about 2? (one for object and the other for the non-object (background) area)?
Should I change "crop" layer "offset" to 32 to have center crops?
In case of changing each of these layers, what is the best initialization strategy for them? "bilinear" for "upscore" and "Xavier" for the rest?
Should I convert my binary label matrix values into zero-centered ( {-0.5,0.5} ) status, or it is OK to use them with the values in {0,1} ?
Any useful idea will be appreciated.
PS:
I'm using Euclidean loss, while I'm using "1" as the number of outputs for "score_fr" and "upscore" layers. If I use 2 for that, I guess it should be softmax.
I can answer some of your questions.
The gradients will reach the first layer so it should be possible to learn the weights even if you freeze the other layers.
Change the num_output to 2 and finetune. You should get a good output.
I think you'll need to experiment with each of the options and see how the accuracy is.
You can use the values 0,1.

Performing Intra-frame Prediction in Matlab

I am trying to implement a hybrid video coding framework which is used in the H.264/MPEG-4 video standard for which I need to perform 'Intra-frame Prediction' and 'Inter Prediction' (which in other words is motion estimation) of a set of 30 frames for video processing in Matlab. I am working with Mother-daughter frames.
Please note that this post is very similar to my previously asked question but this one is solely based on Matlab computation.
Edit:
I am trying to implement the framework shown below:
My question is how to perform horizontal coding method which is one of the nine methods of Intra Coding framework? How are the pixels sampled?
What I find confusing is that Intra Prediction needs two inputs which are the 8x8 blocks of input frame and the 8x8 blocks of reconstructed frame. But what happens when coding the very first block of the input frame since there will be no reconstructed pixels to perform horizontal coding?
In the image above the whole system is a closed loop where do you start?
END:
Question 1: Is intra-predicted image only for the first image (I-frame) of the sequence or does it need to be computed for all 30 frames?
I know that there are five intra coding modes which are horizontal, vertical, DC, Left-up to right-down and right-up to left-down.
Question 2: How do I actually get around comparing the reconstructed frame and the anchor frame (original current frame)?
Question 3: Why do I need a search area? Can the individual 8x8 blocks be used as a search area done one pixel at a time?
I know that pixels from reconstructed block are used for comparing, but is it done one pixel at a time within the search area? If so wouldn't that be too time consuming if 30 frames are to be processed?
Continuing on from our previous post, let's answer one question at a time.
Question #1
Usually, you use one I-frame and denote this as the reference frame. Once you use this, for each 8 x 8 block that's in your reference frame, you take a look at the next frame and figure out where this 8 x 8 block best moved in this next frame. You describe this displacement as a motion vector and you construct a P-frame that consists of this information. This tells you where the 8 x 8 block from the reference frame best moved in this frame.
Now, the next question you may be asking is how many frames is it going to take before we decide to use another reference frame? This is entirely up to you, and you set this up in your decoder settings. For digital broadcast and DVD storage, it is recommended that you generate an I-frame every 0.5 seconds or so. Assuming 24 frames per second, this means that you would need to generate an I-frame every 12 frames. This Wikipedia article was where I got this reference.
As for the intra-coding modes, these tell the encoder in what direction you should look for when trying to find the best matching block. Actually, take a look at this paper that talks about the different prediction modes. Take a look at Figure 1, and it provides a very nice summary of the various prediction modes. In fact, there are nine all together. Also take a look at this Wikipedia article to get better pictorial representations of the different mechanisms of prediction as well. In order to get the best accuracy, they also do subpixel estimation at a 1/4 pixel accuracy by doing bilinear interpolation in between the pixels.
I'm not sure whether or not you need to implement just motion compensation with P-frames, or if you need B frames as well. I'm going to assume you'll be needing both. As such, take a look at this diagram I pulled off of Wikipedia:
Source: Wikipedia
This is a very common sequence for encoding frames in your video. It follows the format of:
IBBPBBPBBI...
There is a time axis at the bottom that tells you the sequence of frames that get sent to the decoder once you encode the frames. I-frames need to be encoded first, followed by P-frames, and then B-frames. A typical sequence of frames that are encoded in between the I-frames follow this format that you see in the figure. The chunk of frames in between I-frames is what is known as a Group of Pictures (GOP). If you remember from our previous post, B-frames use information from ahead and from behind its current position. As such, to summarize the timeline, this is what is usually done on the encoder side:
The I-frame is encoded, and then is used to predict the first P-frame
The first I-frame and the first P-frame are then used to predict the first and second B-frame that are in between these frames
The second P-frame is predicted using the first P-frame, and the third and fourth B-frames are created using information between the first P-frame and the second P-frame
Finally, the last frame in the GOP is an I-frame. This is encoded, then information between the second P-frame and the second I-frame (last frame) are used to generate the fifth and sixth B-frames
Therefore, what needs to happen is that you send I-frames first, then the P-frames, and then the B-frames after. The decoder has to wait for the P-frames before the B-frames can be reconstructed. However, this method of decoding is more robust because:
It minimizes the problem of possible uncovered areas.
P-frames and B-frames need less data than I-frames, so less data is transmitted.
However, B-frames will require more motion vectors, and so there will be some higher bit rates here.
Question #2
Honestly, what I have seen people do is do a simple Sum-of-Squared Differences between one frame and another to compare similarity. You take your colour components (whether it be RGB, YUV, etc.) for each pixel from one frame in one position, subtract these with the colour components in the same spatial location in the other frame, square each component and add them all together. You accumulate all of these differences for every location in your frame. The higher the value, the more dissimilar this is between the one frame and the next.
Another measure that is well known is called Structural Similarity where some statistical measures such as mean and variance are used to assess how similar two frames are.
There are a whole bunch of other video quality metrics that are used, and there are advantages and disadvantages when using any of them. Rather than telling you which one to use, I defer you to this Wikipedia article so you can decide which one to use for yourself depending on your application. This Wikipedia article describes a whole bunch of similarity and video quality metrics, and the buck doesn't stop there. There is still on-going research on what numerical measures best capture the similarity and quality between two frames.
Question #3
When searching for the best block from an I-frame that has moved in a P-frame, you need to restrict the searching to a finite sized windowed area from the location of this I-frame block because you don't want the encoder to search all of the locations in the frame. This would simply be too computationally intensive and would thus make your decoder slow. I actually mentioned this in our previous post.
Using one pixel to search for another pixel in the next frame is a very bad idea because of the minuscule amount of information that this single pixel contains. The reason why you compare blocks at a time when doing motion estimation is because usually, blocks of pixels have a lot of variation inside the blocks which are unique to the block itself. If we can find this same variation in another area in your next frame, then this is a very good candidate that this group of pixels moved together to this new block. Remember, we're assuming that the frame rate for video is adequately high enough so that most of the pixels in your frame either don't move at all, or move very slowly. Using blocks allows the matching to be somewhat more accurate.
Blocks are compared at a time, and the way blocks are compared is using one of those video similarity measures that I talked about in the Wikipedia article I referenced. You are certainly correct in that doing this for 30 frames would indeed be slow, but there are implementations that exist that are highly optimized to do the encoding very fast. One good example is FFMPEG. In fact, I use FFMPEG at work all the time. FFMPEG is highly customizable, and you can create an encoder / decoder that takes advantage of the architecture of your system. I have it set up so that encoding / decoding uses all of the cores on my machine (8 in total).
This doesn't really answer the actual block comparison itself. Actually, the H.264 standard has a bunch of prediction mechanisms in place so that you're not looking at all of the blocks in an I-frame to predict the next P-frame (or one P-frame to the next P-frame, etc.). This alludes to the different prediction modes in the Wikipedia article and in the paper that I referred you to. The encoder is intelligent enough to detect a pattern, and then generalize an area of your image where it believes that this will exhibit the same amount of motion. It skips this area and moves onto the next.
This assignment (in my opinion) is way too broad. There are so many intricacies in doing motion prediction / compensation that there is a reason why most video engineers already use available tools to do the work for us. Why invent the wheel when it has already been perfected, right?
I hope this has adequately answered your questions. I believe that I have given you more questions than answers really, but I hope that this is enough for you to delve into this topic further to achieve your overall goal.
Good luck!
Question 1: Is intra-predicted image only for the first image (I-frame) of the sequence or does it need to be computed for all 30 frames?
I know that there are five intra coding modes which are horizontal, vertical, DC, Left-up to right-down and right-up to left-down.
Answer: intra prediction need not be used for all the frames.
Question 2: How do I actually get around comparing the reconstructed frame and the anchor frame (original current frame)?
Question 3: Why do I need a search area? Can the individual 8x8 blocks be used as a search area done one pixel at a time?
Answer: we need to use the block matching algo for finding the motion vector. so search area is reqd. Normally size of the search area should be larger than the block size. larger the search area, more the computation and higher the accuracy.

Enhancing 8 bit images to 16 bit

My objective is to enhance 8 bit images to 16 bit ones. In other words, I want to increase the dynamic range of an 8 bit image. And to do that, I can sequentially take multiple images of 8 bit with fixed scene and fixed camera. To simplify the issue, let's assume they are grayscale images
Intuitively, I think I can achieve the goal by
Multiplying two 8 bit images
resimage = double(img1) .* double(img2)
Averaging specified number of 8 bit images
resImage = mean(images,3)
assuming images(:,:,i) contains ith 8 bit image.
After that, I can make the resulting image to 16 bit one.
resImage = uint16(resImage)
But before testing these methods, I wonder there is another way to do this - except for buying 16 bit camera, or literature for this subject might be better.
UPDATE: As comments below display, I got great information on drawbacks of simple averaging above and on image stacks for the enhancement. So it may be a good topic to study after all. Thank all for your great comments.
This question appears to relate to increasing the Dynamic Range of an image by integrating information from multiple 8 bit exposures into a 16 bit image. This is related to the practice of capturing and combining "image stacks" in astronomical imaging among other fields. An explanation of this practice and how it can both reduce image noise, and enhance dynamic range is available here:
http://keithwiley.com/astroPhotography/imageStacking.shtml
The idea is that successive captures of the same scene are subject to image noise, and this noise leads to stochastic variation of the pixel values captured. In the simplest case these variations can be leveraged by summing and dividing i.e. mean averaging the stack to improve its dynamic range but the practicality would depend very much on the noise characteristics of the camera.
You want to sum many images together, assuming there is no jitter and the camera is steady. Accumulate a large sum and then divide by some amount.
Note that to get a reasonable 16-bit image from an 8 bit source, you'd need to take hundreds of images to get any kind of reasonable result. Note that jitter will distort edge information and there is some inherent noise level of the camera that might mean you are essentially 'grinding metal'. In a practical sense, you might get 2 or 3 more bits of data from image summing, but not 8 more. To get 3 bits more would require at least 64 images (6 bits) to sum. Then divide by 8 (3 bits), as the lower bits are garbage.
Rule of thumb is to get a new bit of data, you need the squared(bits) of images, so 3 bits (8) means 64 images, 4 bits would be 256 images, etc.
Here's a link that talks about sampling:
http://electronicdesign.com/analog/understand-tradeoffs-increasing-resolution-averaging
"In fact, it can be shown that the improvement is proportional to the square root of the number of samples in the average."
Note that SNR is a log scale so equating it to bits is reasonable.

Depth of Artificial Neural Networks

According to this answer, one should never use more than two hidden layers of Neurons.
According to this answer, a middle layer should contain at most twice the amount of input or output neurons (so if you have 5 input neurons and 10 output neurons, one should use (at most) 20 middle neurons per layer).
Does that mean that all data will be modeled within that amount of Neurons?
So if, for example, one wants to do anything from modeling weather (a million input nodes from data from different weather stations) to simple OCR (of scanned text with a resolution of 1000x1000DPI) one would need the same amount of nodes?
PS.
My last question was closed. Is there another SE site where these kinds of questions are on topic?
You will likely have overfitting of your data (aka, High Variance). Think of it like this: The more neurons and layers you have gives you more parameters to fit your data better.
Remember that for the first layer node the equation becomes Z = sigmoid(sum(W*x))
The second layer node becomes Z2 = Sigmoid(sum(W*Z))
Look into machine learning class taught at Stanford...its a great online course and good tool as a reference.
More than two hidden layers can be useful in certain architectures
such as cascade correlation (Fahlman and Lebiere 1990) and in special
applications, such as the two-spirals problem (Lang and Witbrock 1988)
and ZIP code recognition (Le Cun et al. 1989).
Fahlman, S.E. and Lebiere, C. (1990), "The Cascade Correlation
Learning Architecture," NIPS2, 524-532.
Le Cun, Y., Boser, B., Denker, J.s., Henderson, D., Howard, R.E.,
Hubbard, W., and Jackel, L.D. (1989), "Backpropagation applied to
handwritten ZIP code recognition", Neural Computation, 1, 541-551.
Check out the sections "How many hidden layers should I use?" and "How many hidden units should I use?" on comp.ai.neural-nets's FAQ for more information.