Given a calibrated stereo pair the following things are known:
Camera Intrinsics
Essential Matrix
Relative transformation
A set of keypoint matches (matches satisfy epipolar constraint)
I want to filter out wrong matches by "projecting" the orientation of one keypoint to the other image and compare it to the orientation of the matched keypoint.
My solution idea is the following:
Given the match (p1,p2) with orientation (o1,o2) I compute the depth z of p1 by triangulation. I know create a second point close to p1 shifted a few pixels towards the orientation vector p1' = p1 + o1. After that you compute the 3D point of p1' with z and project it back to image 2 yielding in p2'. The projected orientation is now o2 = p2'-p2.
Does that algorithm work? Are there better ways (for example using the essential matrix)?
While your idea sounds very interesting at first, I don't think that it can work because your way of computing the depth of p' will inevitably lead to wrong keypoint orientations in the second image. Consider this example I came up with:
Assume that p1 is reprojected to Q. Now, you said that since you can't know the depth of p'_1, you set it to z, thus back-projecting p'_1 to Q'. However, imagine that the true depth that corresponds to p'1 is the point shown in green, Q_t. In that case, the correct orientation in the the second image is c-b, while with your solution, we have computed a-b, which is a wrong orientation.
A better solution, in my opinion, is to fix the pose of one of the two cameras, triangulate all the matches that you have, and do a small bundle adjustment (preferably using a robust kernel) where you optimize all the points but only the non-fixed camera. This should take care of a lot of outliers. It will change your estimation of the Essential though, but I think it is probable that it will improve it.
Edit:
The example above used large distances for visibility, and made abstraction from the fact that a,b and c are not necessarily colinear. However, assume that p'1 is close enough to p1, so that Q' is close to Q. I think we can agree that most of the matches that passed the test would be in a configuration similar to this:
In that case, c and a both lie on the epipolar line given by the projection of Q' and camera center 1 in camera 2. But, b is not on that line (it is on the epipolar line corresponding to Q). So, the vectors a-b and c-b will be different by some angle.
But there are also two other issues with the method, that are related to this question: how do you determine the size of the vector o1? I assume that it will be a good idea to define it as some_small_coef*(1/z), because o1 will need to be smaller for distant objects. So, the two other problems are
if you are in an urban settings with for example, buildings that are a bit far, z grows, and the size of o1 will need to be smaller than the width of one pixel.
Assuming you overcome that problem, then the value of some_small_coef will need to be determined separately for different image couples (what if you go from indoors to outdoors?).
Related
I am reviewing some MATLAB code that is publicly available at the following location:
https://github.com/mattools/matGeom/blob/master/matGeom/geom2d/orientedBox.m
This is an implementation of the rotating calipers algorithm on the convex hull of a set of points in order to compute an oriented bounding box. My review was to understand intuitively how the algorithm works however I seek clarification on certain lines within the file which I am confused on.
On line 44: hull = bsxfun(#minus, hull, center);. This appears to translate all the points within the convex hull set so the calculated centroid is at (0,0). Is there any particular reason why this is performed? My only guess would be that it allows straightforward rotational transforms later on in the code, as rotating about the real origin would cause significant problems.
On line 71 and 74: indA2 = mod(indA, nV) + 1; , indB2 = mod(indB, nV) + 1;. Is this a trick in order to prevent the access index going out of bounds? My guess is to prevent out of bounds access, it will roll the index over upon reaching the end.
On line 125: y2 = - x * sit + y * cot;. This is the correct transformation as the code behaves properly, but I am not sure why this is actually used and different from the other rotational transforms done later and also prior (with the calls to rotateVector). My best guess is that I am simply not visualizing what rotation needs to be done in my head correctly.
Side note: The external function calls vectorAngle, rotateVector, createLine, and distancePointLine can all be found under the same repository, in files named after the function name (as per MATLAB standard). They are relatively uninteresting and do what you would expect aside from the fact that there is normalization of vector angles going on.
I'm the author of the above piece of code, so I can give some explanations about it:
First of all, the algorithm is indeed a rotating caliper algorithm. In the current implementation, only the width of the algorithm is tested (I did not check the west and est vertice). Actually, it seems the two results correspond most of the time.
Line 44 -> the goal of translating to origin was to improve numerical accuracy. When a polygon is located far away from the origin, coordinates may be large, and close together. Many computation involve products of coordinates. By translating the polygon around the origin, the coordinates are smaller, and the precision of the resulting products are expected to be improved. Well, to be honest, I did not evidenced this effect directly, this is more a careful way of coding than a fix…
Line 71-74! Yes. The idea is to find the index of the next vertex along the polygon. If the current vertex is the last vertex of the polygon, then the next vertex index should be 1. The use of modulo rescale between 0 and N-1. The two lines ensure correct iteration.
Line 125: There are several transformations involved. Using the rotateVector() function, one simply computes the minimal with for a given edge. On line 125, one rotate the points (of the convex hull) to align with the “best” direction (the one that minimizes the width). The last change of coordinates (lines 132->140) is due to the fact that the center of the oriented box is different from the centroid of the polygon. Then we add a shift, which is corrected by the rotation.
I did not really look at the code, this is an explanation of how the rotating calipers work.
A fundamental property is that the tightest bounding box is such that one of its sides overlaps an edge of the hull. So what you do is essentially
try every edge in turn;
for a given edge, seen as being horizontal, south, find the farthest vertices north, west and east;
evaluate the area or the perimeter of the rectangle that they define;
remember the best area.
It is important to note that when you switch from an edge to the next, the N/W/E vertices can only move forward, and are readily found by finding the next decrease of the relevant coordinate. This is how the total processing time is linear in the number of edges (the search for the initial N/E/W vertices takes 3(N-3) comparisons, then the updates take 3(N-1)+Nn+Nw+Ne comparisons, where Nn, Nw, Ne are the number of moves from a vertex to the next; obviously Nn+Nw+Ne = 3N in total).
The modulos are there to implement the cyclic indexing of the edges and vertices.
Sorry, it is all by one question but relate to many small questions. I can't split them into seperated questions.
For example, input picture size 960x640
Through VGG16 layer 13 Conv5_3, get feature_map 60x40x512
Do 3x3 convolution.
3.1 How 3x3 convolution compress the output above to 1x512 ?
3.2 I read some article said, RPN would random select 512 samples from 2000 anchors . If 1x512 matrix mean this , what is 3x3 convolution doing ?
Loop feature_map, with 16 stride and 16 scale to find the center of the original map (corresponding feature map current point), cut 9 anchors out, calculate IoU< 0.3 as neg samples and IoU > 0.7 as pos samples.
4.1 If there are several points on the feature_map, how to cover the GT? I mean, because it need IoU > 0.7 to label pos sample, here IoU refers to [ the intersection of the area(map from this point to the original image) and GT], or [all the area of GT]? I think it should be the former.
After all the loops are over, filter out the positive and negative samples by nms . Is it possible to have multiple anchors in a single point, or is nms sure to filter this out?
Pass to softmax.
6.1 My problem is , in many cases, the positions(labeled by positive and negative) of the points on the feature_map are different. Because the position of the parameter is also fixed at a specific position on the feature_map, how to find proposals from an image in the detect phase?
6.2 random selection of anchors is at here? ? ?
RoI merges feature_map and proposals to do pooling. (1. roi (roi said a group of anchors it) is located in the feature map , and get the patch zone in feature map 2. something like SPP layer(7x7 down sampling ) is applied to the feature map patch, transform to fixed size of features, to fit full connection layer)
Another softmax. (Training phase using BP to tune the parameters), my problem is that in many cases the positions of the points on the feature_map each time labeled positive and negative are different. Because the position of the parameter is also fixed at a specific position on the feature_map, how to find proposals from the image in the detect phase?
RoI compare to GT, do reggession.
After finishing the above questions, re-think again. I found my understanding of anchors, proposals a bit confusing. Does many anchors compose to a proposal?
If so, then the above 6 becomes
Select 512 anchors , pass their parameters into softmax, the output show if it is part of the target object. So this layer is the detect phase. When doing detect phase, just loop all the anchors to get the possible ones .
6.1 But in this case, how RPN output bbox size (x, y, w h) ? I think it need merge selected anchors and then scale to the size of the original image , to get the bbox size.
6.2 If operation is merger , then randomly selected 512 from 2000 is likely to miss some areas, isn't it ?
Mainly is 3 and 6, and I think all of them are highly relative can not be seperated. Some are just need yes or no confirm, thanks
So I have a set of points V, which are the vertices of a convex polytope, and a separate point p. Basically, I want to check whether p is contained in V. To do so, I set up a linear program that checks whether there exists a hyperplane such that all points in V lie on one side, while p lies on the other, like so (using YALMIP):
z=sdpvar(size(p,1),1);
sdpvar z0;
LMI=[z'*vert-z0<=0,z'*probs-z0<=1];
solvesdp(LMI,-z'*probs+z0);
The hyperplane is defined by the set of points z such that z'*x - z0 = 0, such that if I get a value larger than zero for the point p, and one smaller than zero for all vertices, then I know they are separated by the plane (the second constraint is just so the problem is bounded). This works fine. However, now I want to check whether there is a hyperplane separating the two point sets such that it contains the origin. For this, I simply set z0 = 0, i.e. drop it entirely, getting:
z=sdpvar(size(p,1),1);
LMI=[z'*vert<=0,z'*probs<=1];
solvesdp(LMI,-z'*probs);
Now, however, even for cases in which I know there is a solution, it doesn't find it, and I'm at a loss for understanding why. As a test, I've used the vertices
v1=[0;0;0];v2=[1;0;0];v3=[0;1;0];v4=[1;1;1];
and the point
p=[0.4;0.6;0.6];
When plotted, that looks like the picture here.
So it's clear that there should be a plane separating the lone point and the polytope that contains the origin (the front and center point of the polytope).
One thing I've tried already is to offset the vertex of the polytope that's now on the origin from the origin a little (10^-5), such that the plane would not touch the polytope (although the LP should allow for that), but that didn't work either.
I'm grateful for any ideas!
I am trying to compute the 3D coordinates from several pair of two view points.
First, I used the matlab function estimateFundamentalMatrix() to get the F of the matched points (Number > 8) which is:
F1 =[-0.000000221102386 0.000000127212463 -0.003908602702784
-0.000000703461004 -0.000000008125894 -0.010618266198273
0.003811584026121 0.012887141181108 0.999845683961494]
And my camera - taken these two pictures - was pre-calibrated with the intrinsic matrix:
K = [12636.6659110566, 0, 2541.60550098958
0, 12643.3249022486, 1952.06628069233
0, 0, 1]
From this information I then computed the essential matrix using:
E = K'*F*K
With the method of SVD, I finally got the projective transformation matrices:
P1 = K*[ I | 0 ]
and
P2 = K*[ R | t ]
Where R and t are:
R = [ 0.657061402787646 -0.419110137500056 -0.626591577992727
-0.352566614260743 -0.905543541110692 0.235982367268031
-0.666308558758964 0.0658603659069099 -0.742761951588233]
t = [-0.940150699101422
0.320030970080146
0.117033504470591]
I know there should be 4 possible solutions, however, my computed 3D coordinates seemed to be not correct.
I used the camera to take pictures of a FLAT object with marked points. I matched the points by hand (which means there should not be obvious mistake exists about the raw material). But the result turned out to be a surface with a little bit banding.
I guess this might be due to the reason pictures did not processed with distortions (but actually I remember I did).
I just want to know whether this method to solve the 3D reconstruction issue right? Especially when we already know the camera intrinsic matrix.
Edit by JCraft at Aug.4: I have redone the process and got some pictures showing the problem, I will write another question with detail then post the link.
Edit by JCraft at Aug.4: I have posted a new question: Calibrated camera get matched points for 3D reconstruction, ideal test failed. And #Schorsch really appreciate your help formatting my question. I will try to learn how to do inputs in SO and also try to improve my gramma. Thanks!
If you only have the fundamental matrix and the intrinsics, you can only get a reconstruction up to scale. That is your translation vector t is in some unknown units. You can get the 3D points in real units in several ways:
You need to have some reference points in the world with known distances between them. This way you can compute their coordinates in your unknown units and calculate the scale factor to convert your unknown units into real units.
You need to know the extrinsics of each camera relative to a common coordinate system. For example, you can have a checkerboard calibration pattern somewhere in your scene that you can detect and compute extrinsics from. See this example. By the way, if you know the extrinsics, you can compute the Fundamental matrix and the camera projection matrices directly, without having to match points.
You can do stereo calibration to estimate the R and the t between the cameras, which would also give you the Fundamental and the Essential matrices. See this example.
Flat objects are critical surfaces, not possible to achive your goal from them. try adding two (or more) points off the plane (see Hartley and Zisserman or other text on the matter if still interested)
When showing the extrinsic parameters of calibration (the 3D model including the camera position and the position of the calibration checkerboards), the toolbox does not include units for the axes. It seemed logical to assume that they are in mm, but the z values displayed can not possibly be correct if they are indeed in mm. I'm assuming that there is some transformation going on, perhaps having to do with optical coordinates and units, but I can't figure it out from the documentation. Has anyone solved this problem?
If you marked the side length of your squares in mm, then the z-distance shown would be in mm.
I know next to nothing about matlabs (not entirely true but i avoid matlab wherever I can, and that would be almost always possible) tracking utilities but here's some general info.
Pixel dimension on the sensor has nothing to do with the size of the pixel on screen, or in model space. For all purposes a camera produces a picture that has no meaningful units. A tracking process is unaware of the scale of the scene. (the perspective projection takes care of that). You can re insert a scale by taking 2 tracked points and measuring the distance between those points. This is the solver spaces distance is pretty much arbitrary. Now if you know the real distance between these points you can get a conversion factor. By doing:
real distance / solver space distance.
There's really now way to knowing this distance form the cameras settings as the camera is unable to differentiate between different scales of scenes. So a perfect 1:100 replica is no different for the solver than the real deal. So you must allays relate to something you can measure separately for each measuring session. The camera always produces something that's relative in nature.