In order to integrate a place recognition algorithm in a bayesian framework, we have to estimate the uncertainty expressed by a covariance matrix.
The place recognition algorithm contains a database with geo-tagged images and for a certain query, it returns the best match image j and the global coordinates. What is the best way to estimate the uncertainty in the returned image j?
Related
Currently, I am working at a short project about stereo-vision.
I'm trying to create depth maps of a scenery. For this, I use my phone from to view points and use the following code/workflow provided by Matlab : https://nl.mathworks.com/help/vision/ug/uncalibrated-stereo-image-rectification.html
Following this code I am able to create nice disparity maps, but I want to now the depths (as in meters). For this, I need the baseline, focal length and disparity, as shown here: https://www.researchgate.net/figure/Relationship-between-the-baseline-b-disparity-d-focal-length-f-and-depth-z_fig1_2313285
The focal length and base-line are known, but not the baseline. I determined the estimate of the Fundamental Matrix. Is there a way to get from the Fundamental Matrix to the baseline, or by making some assumptions to get to the Essential Matrix, and from there to the baseline.
I would be thankful for any hint in the right direction!
"The focal length and base-line are known, but not the baseline."
I guess you mean the disparity map is known.
Without a known or estimated calibration matrix, you cannot determine the essential matrix.
(Compare Multi View Geometry of Hartley and Zisserman for details)
With respect to your available data, you cannot compute a metric reconstruction. From the fundamental matrix, you can only extract camera matrices in a canonical form that allow for a projective reconstruction and will not satisfy the true baseline of the setup. A projective reconstruction is a reconstruction that differs from the metric result by an unknown transformation.
Non-trivial techniques could allow to upgrade these reconstructions to an Euclidean reconstruction result. However, the success of these self-calibration techniques strongly depends of the quality of the data. Thus, using images of a calibrated camera is actually the best way to go.
Frequently linear interpolation is used with a Gaussian or uniform prior which has unit variance and zero mean where the size of the vector can be defined in an arbitrary way e.g. 100 to generate initial random vectors for generator model in Generative Adversarial Neural (GAN).
Let's say we have 1000 images for training and batch size is 64. Then each epoch, need to generate a number of random vectors using prior distribution corresponding to each image given small batch. But the problem I see is that since there is no mapping between random vector and corresponding image, the same image can be generated using multiple initial random vectors. In this paper, it suggests overcoming this problem by using different spherical interpolation up to some extent.
So what will happens if initially generate random vectors corresponding to the number of training images and when train the model uses the same random vector which is generated initially?
In GANs the random seed used as input does not actually correspond to any real input image. What GANs actually do is learn a transformation function from a known noise distribution (e.g. Gaussian) to a complex unknown distribution, which is representated by i.i.d. samples (e.g. your training set). What the discriminator in a GAN does is to calculate a divergence (e.g. Wasserstein divergence, KL-divergence, etc.) between the generated data (e.g. transformed gaussian) and the real data (your training data). This is done in a stochastic fashion and therefore no link is neccessary between the real and the fake data. If you want to learn more about this on a hands on example, I can recommend that you train to train a Wasserstein GAN to transform one 1D gaussian distribution into another one. There you can visualize the discriminator and the gradient of the discriminator and really see the dynamics of such a system.
Anyways, what your paper is trying to tell you is after you have trained your GAN and want to see how it has mapped the generated data from the known noise space to the unknown image space. For this reason interpolation schemes have been invented like the spherical one you are quoting. They also show that the GAN has learned to map some parts of the latent space to key characteristics in images, like smiles. But this has nothing to do with the training of GANs.
I found a Matlab implementation of the LKT algorithm here and it is based on the brightness constancy equation.
The algorithm calculates the Image gradients in x and y direction by convolving the image with appropriate 2x2 horizontal and vertical edge gradient operators.
The brightness constancy equation in the classic literature has on its right hand side the difference between two successive frames.
However, in the implementation referred to by the aforementioned link, the right hand side is the difference of convolution.
It_m = conv2(im1,[1,1;1,1]) + conv2(im2,[-1,-1;-1,-1]);
Why couldn't It_m be simply calculated as:
it_m = im1 - im2;
As you mentioned, in theory only pixel by pixel difference is stated for optical flow computation.
However, in practice, all natural (not synthetic) images contain some degree of noise. On the other hand, differentiating is some kind of high pass filter and would stress (high pass) noise ratio to the signal.
Therefore, to avoid artifact caused by noise, usually an image smoothing (or low pass filtering) is carried out prior to any image differentiating (we have such process in edge detection too). The code does exactly this, i.e. apply and moving average filter on the image to reduce noise effect.
It_m = conv2(im1,[1,1;1,1]) + conv2(im2,[-1,-1;-1,-1]);
(Comments converted to an answer.)
In theory, there is nothing wrong with taking a pixel-wise difference:
Im_t = im1-im2;
to compute the time derivative. Using a spatial smoother when computing the time derivative mitigates the effect of noise.
Moreover, looking at the way that code computes spatial (x and y) derivatives:
Ix_m = conv2(im1,[-1 1; -1 1], 'valid');
computing the time derivate with a similar kernel and the valid option ensures the matrices It_x, It_y and Im_t have compatible sizes.
The temporal partial derivative(along t), is connected to the spatial partial derivatives (along x and y).
Think of the video sequence you are analyzing as a volume, spatio-temporal volume. At any given point (x,y,t), if you want to estimate partial derivatives, i.e. estimate the 3D gradient at that point, then you will benefit from having 3 filters that have the same kernel support.
For more theory on why this should be so, look up the topic steerable filters, or better yet look up the fundamental concept of what partial derivative is supposed to be, and how it connects to directional derivatives.
Often, the 2D gradient is estimated first, and then people tend to think of the temporal derivative secondly as independent of the x and y component. This can, and very often do, lead to numerical errors in the final optical flow calculations. The common way to deal with those errors is to do a forward and backward flow estimation, and combine the results in the end.
One way to think of the gradient that you are estimating is that it has a support region that is 3D. The smallest size of such a region should be 2x2x2.
if you do 2D gradients in the first and second image both using only 2x2 filters, then the corresponding FIR filter for the 3D volume is collected by averaging the results of the two filters.
The fact that you should have the same filter support region in 2D is clear to most: thats why the Sobel and Scharr operators look the way they do.
You can see the sort of results you get from having sanely designed differential operators for optical flow in this Matlab toolbox that I made, in part to show this particular point.
my aim is to classify the data into two sections- upper and lower- finding the mid line of the peaks.
I would like to apply machine learning methods- i.e. Discriminant analysis.
Could you let me know how to do that in MATLAB?
It seems that what you are looking for is GMM (gaussian mixture model). With K=2 (number of mixtures) and dimension equal 1 this will be simple, fast method, which will give you a direct solution. Given components it is easy to analytically find a local minima (which is just a weighted average of means, with weights proportional to the std's).
I am using PCA for face recognition. I have obtained the eigenvectors / eigenfaces for each image, which is a colomn matrix. I want to know if selecting the first three eigenvectors , since their corresponding eigenvalues amount to 70% of total variance, will be sufficient for face recognition?
Firstly, lets be clear about a few things. The eigenvectors are computed from the covariance matrix formed from the entire dataset i.e., you reshape each grayscale image of a face into a single column and treat it as a point in R^d space, compute the covariance matrix from them and compute the eigenvectors of the covariance matrix. These eigenvectors become a new basis for your space of face images. You do not have eigenvectors for each image. Instead, you represent each face image in terms of the eigenvectors by projecting onto (a possibly subset) of them.
Limitations of eigenfaces
As to whether the representation of your face images under this new basis good enough for face recognition depends on many factors. But in general, the eigenfaces method does not perform well for real world unconstrained faces. It only works for faces which are pixel-wise aligned, facing frontal, and has fairly uniform illumination conditions across the images.
More is not necessarily better
While it is commonly believed (when using PCA) that retaining more variance is better than less, things are much more complicated than that because of two factors: 1) Noise in real world data and 2) dimensionality of data. Sometimes projecting to a lower dimension and losing variance can actually produce better results.
Conclusion
Hence, my answer is it is difficult to say whether retaining a certain amount of variance is enough beforehand. The number of dimensions (and hence the number of eigenvectors to keep and the associated variance retained) should be determined by cross-validation. But ultimately, as I have mentioned above, eigenfaces is not a good method for face recognition unless you have a "nice" dataset. You might be slightly better off using "Fisherfaces" i.e., LDA on the face images or combine these methods with Local Binary Pattern (LBP) as features (instead of raw face pixels). But seriously, face recognition is a difficult problem and in general the state-of-the-art has not reached a stage where it can be deployed in real world systems.
It's not impossible, but a little rare to me that only 3 eigenvalues can achieve 70% variance. How many training samples do you have (what is the total dimension)? Make sure you are reshape each image from the database into a vector, normalize the vector data then align them into a matrix. The eigenvalues/eigenvectors are obtained from the covariance of the matrix.
In theory, 70% variance should be enough to form a human-recognizable face with the corresponding eigenvectors. However, the optimal number of eigenvalues is better to get from cross-validation: you can try to increase 1 eigenvector each time, observe the face formation and the recognition accuracy. You can even plot the cross validation accuracy curve, there may be a sharp corner on the curve, then the corresponding eigenvector number is hopefully applied in your test.