Max-pooling vs. zero padding: Loosing spatial information - neural-network

When it comes to convolutional neural networks there are normally many papers recommending different strategies. I have heard people say that it is an absolute must to add padding to the images before a convolution, otherwise to much spatial information is lost. On the other hand they are happy to use pooling, normally max-pooling, to reduce the size of the images. I guess the thought here is that max pooling reduces the spatial information but also reduces the sensitivity to relative positions, so it is a trade-off?
I have heard other people saying that zero-padding does not keep more information, just more empty data. This is because by adding zeros you will not get a reaction from your kernel anyway when part of the information is missing.
I can imagine that zero-padding works if you have big kernels with "scrap values" in the edges and the source of activation centered in a smaller region of the kernel?
I would be happy to read some papers about the effect of down-sampling using pooling contra not using padding, but I cant find much about it. Any good recommendations or thoughts?
Figure: Spatial down-sampling using convolution contra pooling (Researchgate)

Adding padding is NOT an "absolute must". Sometimes it can be useful to control the size of the output so that it is not reduced by the convolution (it can also augment the output, depending on its size and kernel size). The only information that zero padding adds is the condition of border (or near-border) of the features- pixels in the limits of the input, also depending on kernel size. (You can think of it as a "passe-partout" in a picture frame)
Pooling is of MUCH MORE IMPORTANCE in convnets. Pooling is not exactly "down-sampling", or "losing spatial information". Consider first that kernel calculations have been made previous to pooling, with full spatial information. Pooling reduces dimension but keeps -hopefully- the information learnt by the kernels previously. And, by doing so, achieves one of the most interesting things about convnets; robustness to displacement, rotation or distortion of the input. Invariance, if learnt, is located even if it appears in another location or with distortions. It also implies learning through increasing scale, discovering -again, hopefully- hierarchical patterns on different scales. And of course, and also necessary in convnets, pooling makes computation possible as number of layers grows.

I have bothered on this question for a while too, and I have also seen some papers mention this same issue. Here is a recent paper I found; Recombinator Networks: Learning Coarse-to-Fine Feature Aggregation. I have not fully read the paper but it seems to bother on your question. I can update this answer as soon as I fully grasp the paper.

Related

Determining Lost Image Quality Through Lossy Compression

I recently came upon a question that I haven't seen anywhere else while searching about lossy compression. Can you determine the quality lost through a certain algorithm? I have been asking around and it seems like that there isn't a sure way to determine the quality lost compared to an original image and can only be differentiated by the naked eye. Is there an algorithm that shows % lost or blended?
I would really appreciate it if someone could give me some insight into this matter.
You can use lots of metrics to measure quality loss. But, of course, each metric will interpret quality loss differently.
One direction, following the suggestion already commented, would be to use something like the Euclidian distance or the mean squared error between the original and the compressed image (considered as vectors). There are many more metrics of this "absolute" kind.
The above will indicate a certain quality loss but the result may not correlate with human perception of quality. To give more weight to perception you can inspect the structural similarity of the images and use the structural similarity index measure (SSIM) or one of its variants. Another algorithm in this area is butteraugli.
In Python, for instance, there is an implementation of SSIM in the scikit-image package, see this example.
The mentioned metrics have in common that they do not return a percentage. If this is crucial to you, another conversion step will be necessary.

How does upsampling in Fully Connected Convolutional network work?

I read several posts / articles and have some doubts on the mechanism of upsampling after the CNN downsampling.
I took the 1st answer from this question:
https://www.quora.com/How-do-fully-convolutional-networks-upsample-their-coarse-output
I understood that similar to normal convolution operation, the "upsampling" also uses kernels which need to be trained.
Question1: if the "spatial information" is already lost during the first stages of CNN, how can it be re-constructed in anyway ?
Question2: Why >"Upsampling from a small (coarse) featuremap deep in the network has good semantic information but bad resolution. Upsampling from a larger feature map closer to the input, will produce better detail but worse semantic information" ?
Question #1
Upsampling doesn't (and cannot) reconstruct any lost information. Its role is to bring back the resolution to the resolution of previous layer.
Theoretically, we can eliminate the down/up sampling layers altogether. However to reduce the number of computations, we can downsample the input before a layers and then upsample its output.
Therefore, the sole purpose of down/up sampling layers is to reduce computations in each layer, while keeping the dimension of input/output as before.
You might argue the down-sampling might cause information loss. That is always a possibility but remember the role of CNN is essentially extracting "useful" information from the input and reducing it into a smaller dimension.
Question #2
As we go from the input layer in CNN to the output layer, the dimension of data generally decreases while the semantic and extracted information hopefully increases.
Suppose we have the a CNN for image classification. In such CNN, the early layers usually extract the basic shapes and edges in the image. The next layers detect more complex concepts like corners, circles. You can imagine the very last layers might have nodes that detect very complex features (like presence of a person in the image).
So up-sampling from a large feature map close to the input produces better detail but has lower semantic information compared to the last layers. In retrospect, the last layers generally have lower dimension hence their resolution is worse compared to the early layers.

Looking for a suggested Clustering technique

I have a series (let's say 1000) of images of a biological sample...living cells. Over this series, the data for each pixel will describe a time variant "wave", if you will, giving the measure of light intensity vs time. After performing an FFT for this wave, I'll have the frequency content and phase for each pixel.
My goal is to be able to find all the pixels that are measuring a single cell, and was wondering if some sort of clustering technique would give me what I'm looking for. After some research (I know almost nothing of cluster analysis) looking at KMeans, DBSCAN, and a few others, I'm unsure how to proceed.
Here's my criteria:
a cluster should consist of connected pixels, with a maximum size of
around 9-12 pixels (this is defined by the actual size of the cell in
the field of view). Putting more pixels in a cluster likely means
that the cluster contains more than one cell, and I'd prefer each
cluster to represent a single cell.
the cells are signalling (glowing) with some frequency/phase. These are not necessarily in sync, so I think that this might be useful in segregating the cells/clusters.
there is an unknown number of cells in each image, so an unknown number of clusters.
the images are segmented into smaller, sub-images for analysis (the reason for this is not relevant here). These sub-images are to be analyzed separately for clusters. The sub-images are about 100 x 100 pixels.
Any suggestions would be greatly appreciated. I'm just looking for help getting pointed in the right direction.
Probably the most flexible is the classic old hierarchical agglomerative clustering (HAC). For some reason, people always overlook this powerful method, and prefer the much more limited kmeans.
HAC is very nice to parameterize. It needs a distance or similarity (little requirements here - probably should be symmetric, but no triangle inequality necessary). And with the linkage you can control the cluster shape or diameters nicely. For example, with complete linkage you can control the maximum diameter of a cluster. This is probably useful here, and my suggestion.
The main drawbacks of HAC are (1) scalability: at 50.000 instances it will be slow and use too much memory, and of course that (2) you need to know what you want to do: you need to choose distance, linkage, and cut the dendrogram. With k-means, you only need to choose k to get a (bad) result.
DBSCAN is a great algorithm, but in your case it is likely to form clusters with multiple cells. So I'd rather try OPTICS instead which may be able to discover substructures where DBSCAN only sees a large blob.

How do I determine processor speed required for optical flow?

I'd like to use an optical flow system to get velocities from surrounding environment. I've read papers about how optical flow works, but they don't treat details about optic sensors.
My question is: How do I determine how much computational power is required to perform optical flow analysis?
I'd like to use a low-power system (like microcontrollers), but I don't know what kind of camera I could use with such a system. I mean, could it be color or does it need to be B/W? Rolling shutter or global shutter? Which frame rate or number of pixels?
I'd like to specify the system myself but, without knowing how those camera attributes impact the processing load, I'm not sure where to start.
As Chuck already said in the comment. You first need to start with something. Opticalflow calculation really depends on what you are using it for and what you are trying to achieve. For realtime applications you might want to consider using faster processors (this is always true though).
Continuing to my answer.
Opticalflow calculation performance depends on few main things:
The optical-flow method you choose (dense or sparse), you can read more about it here and here. Of course that you should take into account not only that sparse is faster than dense, also that sparse might be less accurate in some cases. Again, this depends on what you're trying to achieve.
In addition, you will see that there are different optical-flow algorithms. Some might be faster than others. There are many algorithms such as Lucas-Kanade, Horn-Schunck, TVL1, Farneback, etc.
Most optical-flow methods from libraries such as OpenCV gives you the ability to change some parameters in order to play with the trade-off between accuracy and performance. See this and also check the OpenCV methods such as this and this for example - see the different arguments.
The resolution of your image. Smaller image usually means faster calculation.
Few things you might also want to consider:
If you are using a processor that has multiple cores, make sure that you are using all the cores in the optical-flow calculation. Some libraries may already do this for you, but in some cases you will need to do it by yourself. Take a look at my question and answer in this post, it might give you some idea and help you getting starting with such case.
If you want more accurate optical-flow results you must use global shutter camera. Rolling shutter cameras, such as most of the web-cams, will give you an extra error you don't want.
You don't need color image, if you have a grayscale camera it will be even better. If not, you will need to convert it to grayscale (not B/W) for faster performance as well.
Some libraries such as OpenCV has an option (in some cases) to run these algorithms on a GPU. If using a GPU is an option you might want to consider this as well.
From my own experience, the main thing that gave me a boost in performance was changing my resolution from 640x480 to 320x240 and even 160x120. In my case it didn't really hurt the accuracy.
I used an Odroid U3 mini-pc with OpenCV PyrLK algorithm and input frames of 320x240 resolution. After applying what's described here (splitting the image to 4 for parallel calculation) it worked pretty well (realtime).
The answer given by Sarid has some strong points, and many of them are shared by researchers around the world. My opinions are shared by anyone who has actually worked with these topics in the real-world setting.... with real world, i mean implementing optical flow in drones, on mobile phones and IP cameras that are not sitting in a protected office, and where other systems (such as humans) need to interact and be co-dependent.
First of all, depending on your problem, you may want to invest time in looking for ready-made solutions. Optical flow sensors are readily available, cheap and robust (but usually not strong in accuracy). These are the kind of sensors you find in optical mice. They are low power, and easily interfaced with micro-controllers. Some have staggering sample rates of thousands of fps. They commonly have low spatial resolution however, and (to emphasize) high robustness but low accuracy.
If instead you are looking for the kind of optical flow that can be used for shape from motion, pedestrian detection and video-encoding, for example, then you are probably better off to look for something more advanced, and thats where Sarids answer becomes relevant.
Since your question has been migrated from robotics stack exchange, I am going to assume you are interested applications close to machine control and human machine interaction. In that case, the most important aspects are the ones usually most ignored by people working in the field of optical flow estimation, namely:
Latency. If you have a human interfacing at the front-end... then the common term is "glass-to-glass latency". This is completely different from the fps of your system, which is connected to throughput. If you find that you are in a discussion with someone, and they do not understand the difference between latency and fps, then they are not the expert you are interested in. For example, almost all researchers in computer vision who do GPU implementations of optical flow add massive latency by allowing for frame delays and ineffecient memory handling (inefficient from perspective of latency, but efficient in terms of throughput and hard-ware utilization). Consider the problem of controlling a drone, say make it self-stabilizing, it is better to receive a bad optical flow estimation 10 ms earlier, then a good one with 10 ms extra delay.... especially if the optical system does not give you any upper bounds of the delay for any given time.
Algorithm stability. This is completely different from accuracy. Accuracy is what 99% of all research in optical flow has been obsessing about for the last 30 years. Stability is not at all something evaluated in the Middlebury benchmark for example. Stability deals with how small changes in your data will guarantee small changes in the estimated optical flow. While some good work has been done in the community (on robust statistics most interestingly) in the end the final evaluation of any algortihm disregards stability. Consider the optical mouse as a good example. The first generations of optical mice had higher accuracy (the average error from the true motion was smaller) but they had lower stability (especially when you ran the mice over "bad textures", with rotational motions). Later generations of optical mouse have worse accuracy, but are focusing on the stability, as that is the most important thing. You dont experience the mouse cursor jumping around as much as you did the earlier days of the devices.... but if you move the mouse on your mat, left and right repeatedly, you will see the cursor slowly drifting (i.e. low accuracy).
Heat. Any device that will estimate high accuracy optical flow, will require lots of computations. When it comes to computations per watt, GPUs are not that good. In drones, you may be able to get away with this, because it is a setting where you have active cooling as a by-product of the propulsion system. In the real-world, you most often can not assume active cooling nor unlimited power supply.
To conclude, its a fascinating area, and I hope you have a great experience coding solutions.

Lucas Kanade Optical Flow, Direction Vector

I am working on optical flow, and based on the lecture notes here and some samples on the Internet, I wrote this Python code.
All code and sample images are there as well. For small displacements of around 4-5 pixels, the direction of vector calculated seems to be fine, but the magnitude of the vector is too small (that's why I had to multiply u,v by 3 before plotting them).
Is this because of the limitation of the algorithm, or error in the code? The lecture note shared above also says that motion needs to be small "u, v are less than 1 pixel", maybe that's why. What is the reason for this limitation?
#belisarius says "LK uses a first order approximation, and so (u,v) should be ideally << 1, if not, higher order terms dominate the behavior and you are toast. ".
A standard conclusion from the optical flow constraint equation (OFCE, slide 5 of your reference), is that "your motion should be less than a pixel, less higher order terms kill you". While technically true, you can overcome this in practice using larger averaging windows. This requires that you do sane statistics, i.e. not pure least square means, as suggested in the slides. Equally fast computations, and far superior results can be achieved by Tikhonov regularization. This necessitates setting a tuning value(the Tikhonov constant). This can be done as a global constant, or letting it be adjusted to local information in the image (such as the Shi-Tomasi confidence, aka structure tensor determinant).
Note that this does not replace the need for multi-scale approaches in order to deal with larger motions. It may extend the range a bit for what any single scale can deal with.
Implementations, visualizations and code is available in tutorial format here, albeit in Matlab not Python.