I have to reconstruct an object which will be placed around 1 meter to 1.5 meters away from the baseline of my stereo setup. The image captured by both cameras have high resolution (10 MP)
The accuracy with which I have to detect it's position is +/- 0.5mm, in all the three co-ordinate axes. (If you require more details, please let me know)
For these, what should the optimal specifications of my checkerboard (for calibration) be?
I only know that it should be an asymmetric board. It should be placed in the same distance range as the range where object is expected to be placed. Also, it should be oriented in all possible angles (making sure all corners are seen by both cameras)
What about:
Number of squares horizontally and vertically? (also, on which side should the squares be more / even?)
Dimension of each square on checkerboard?
What effect does the baseline distance have on this?
Do these parameters of the checkerboard affect my accuracy in anyway? Are there any other parameters I need to consider for calibration?
I am using the MATLAB Stereo Calibrator App.
I will try to answer as good as I can:
Numbers of squares. Well, as you can guess, the more squares (actually corners between squares are used!) the better the result will be, as you have a more overdetermined system of equations to solve. Additionally, it doesnt matter the size of the chequerboard, only the odd/even pattern matters.
Dimensions of squares. the size does not matter very much in "mathematical" reresentation, but it matters practically. If your squares are very small, probably your printer wont draw a that good corner of the square and that will make your data "noisier". In the past, for really small calibration system I needed to go to an specialised printing shop so they could print it with the maximum quality possible. Of course if you make them very big you wont be able to fit lost of them in the iage which is not good.
The baseline distance has effect only in how properly can you see the corners between squares. The more accurate (in mm!, real distance!) you are detecting this corners the better. Obviously if you make small squares and put them very far, well, you wont see very much. This fits with the 1,2 question. Additionally, another problem you may have is focal length. In a application I worked on, some really small and close things wanted to be imaged. That was a problem while calibrating, as the amount if z distance I could see without blur was around 2mm. This really crippled my ability to calibrate properly because I could big angles in Z direction without getting blurred corners.
TL;DR: You want to have lots of corners between squares of the chequerboard but you want to see them as precisely as possible.
Related
I'm solving a problem of detecting an optic disc in a retina image. As you can see from the image:
the optic disc is the epicentrum of the blood vessels, has an irregular circular shape and has a brighter color than the rest of the retina.
Now I want to use a convolutional neural network to detect it. I know that typical approaches to detecting something in an image using CNNs (consisting mostly of conv., pooling, dropout and fully connected layers) devide an image into smaller parts, each of them is send to a classifier asking whether there is the object or not.
But I'm thinking about another approach. It'd be a model, which gets a normal RGB image of Height x Width size as input, which goes through several convolution layers so as the size remains the same (Height x Width) but with more channels let's say N. There would be no pooling layers(??), so the final output of the convolution would be of the size Height x Width x N.
In this output there'd be Height x Width feature vectors of the size N, each somehow describing the pixel on this position in the original image and its neighbourhood (??). Now what I'm trying to do here is to take these individual vectors as inputs to a fully connected layer network. Output of this would be some number describing the relative position of the input pixel in respect to the position of the optic disc in the image (maybe its distance, or the position itself, I don't know yet...). The training data consists of an image and the x, y position ot the optic disc.
But I'm not sure about some things about this approach. Can I not to use pooling layers? I thought maybe it wouldn't be transform invariant then, or something like that. I'm also not sure if what I'm doing in the fully connected layers is correct. I don't understand neural networks so well to say that "it is obvious that this should work" or "or this is that case where it is not easy to say and it's worth implementing it to see how it will work" or "this obviously won't work because...". So my question is just: which one of this three cases is this?
And isn't there some "obvious" method for this stuff and I'm just trying to solve something that was already solved? (maybe RNNs or something...)
I'm trying to use to stereo camera measure distance from cameras to a dynamic object(a moving car for example). I used a checkerboard pattern with 7 by 8 squares with square size of 89 millimeters(~ 3.5 inches). distance from camera to pattern was 212 centimeters (~ 83.5 inches). I'm using Python and OpenCV
My questions are:
that does the distance from pattern to camera affect much at the calibration parameters? It is stated in One of Matlab examples that distance from camera to pattern in calibration process should be the same as object distance that it is desired to measure1.
Should I use bigger board size and increase the camera to pattern distance to get more accurate results for my application?
I think that the specific distance you use for the calibration shouldn't really matter. What does matter is, that you take as many possible different images of your checkerboard as possible. At least 15. Checkerboard should be moved so that you cover the whole camera field. Checkerboard should be also imaged at different out of plane orientations. Having a checkerboard with more squares should also be beneficial as this means more corner points per image. Size of the squares shouldn't make a difference.
On the other hand, camera calibration should be performed with fixed focus which also shouldn't change after the calibration. So, in practice, I guess that this forces you to perform calibration at similar distance that will be used later for the experiment.
Currently I hope to use scale space representation to filter one image. Features in one image can be filtered using an Gaussian smooth filter with one optimal sigma. It means different features in one image can be expressed best in different scale under scale space representation.
For example, I have one image with one tree in it. In the scale space representation, three sigma values are used and they are represented as sigma0, sigma1 and sigma2. The ground is best expressed in the smoothed image with sigma0 because it contains textures mainly. The branches are best expressed in the smoother image with sigma1 and the trunk is with the smoother image with sigma2. If I hope to filter the image, I hope that the filtered pixels for the group is from the smoothed image with sigma0.
The filtered pixels for the branches are from the smoothed image with sigma1. The filtered pixels for the trunk are from the smoothed image with sigma2.
It requires that I need to determine in which smoothed image one pixel is expressed best. Is this idea plausible?
I am trying to use differece-of-Gaussian of two successive smoothed images to perform the above task. Is there any other way to combine the three smoothed image?
I use Matlab to implement the idea. The values of the three sigmas is 1.0, 2.0 and 3.0. The corresponding size of Gaussian kernel is 3, 5 and 7. I use the function fspecial to generate the kernel. Are the parameter reasonable? Please share your experience with the scale space representation to help me. You can provide some links to useful papers.
your idea is very much plausible! You are just one step away from it. I did something very similar once and it looked like this:
After smoothing your images and extracting the edges for each smoothing step (I used a weighted [to compensate for maxima supression after Gauss filtering] Sobel filter for this since DOG was not quite stable for my aplication), you can proyect (and normalize) your whole stack of edge images into a single image ("cummulative edges") which will contain the characteristic edges. You can then compare the cummulative edges image (using cross-correlation or whatever you wish) with every single image in your edge stack, the biggest value of this comparation is then the smooth-scale in which the pixel is expressed the best.
Hope that makes sense for you after reading it a couple of times.
Also don't be afraid of using much bigger kernel sizes, while it all depends on your application, I ended up using things of 51 and bigger!!! (was working with 40MP images though...)
T. Lindeberg has literally dozens of papers related to this problem. I found this one the most useful, but since you are already in the right track, I don't think reading the 50 pages will make you that much smarter. The most important part of it is maybe this one:
Principle for scale selection:
In the absence of other evidence, assume that a scale level, at which some
(possibly non-linear) combination of normalized derivatives assumes a
local maximum over scales, can be treated as reflecting a characteristic
length of a corresponding structure in the data.
This paper describes nicely the geometry of a stereo image system. I am trying to figure out, if the cameras tilted towards each other with a certain angle, how the calculation would change? I looked around but couldn't find any reference to tilted camera systems.
Unfortunately, the calculation changes significantly. The rectified case (where both cameras are well-aligned to each other) has the advantage that you can calculate the disparity and the depth is proportional to the disparity. This is not the case in the general case.
When you introduce tilts, you end up with something called epipolar geometry. Here is a paper about this I just googled. In order to calculate the depth from a pixel-pair you need the fundamental matrix or the essential matrix. Both are not easy to obtain from the image pair. If, however, you have the geometric relation of both cameras (translation and rotation), calculating these matrices is a lot easier.
There are several ways to calculate the depth of a pixel-pair. One way is to use the fundamental matrix to rectify both images (although rectifying is not easy either, or even unique) and run a simple disparity check.
When showing the extrinsic parameters of calibration (the 3D model including the camera position and the position of the calibration checkerboards), the toolbox does not include units for the axes. It seemed logical to assume that they are in mm, but the z values displayed can not possibly be correct if they are indeed in mm. I'm assuming that there is some transformation going on, perhaps having to do with optical coordinates and units, but I can't figure it out from the documentation. Has anyone solved this problem?
If you marked the side length of your squares in mm, then the z-distance shown would be in mm.
I know next to nothing about matlabs (not entirely true but i avoid matlab wherever I can, and that would be almost always possible) tracking utilities but here's some general info.
Pixel dimension on the sensor has nothing to do with the size of the pixel on screen, or in model space. For all purposes a camera produces a picture that has no meaningful units. A tracking process is unaware of the scale of the scene. (the perspective projection takes care of that). You can re insert a scale by taking 2 tracked points and measuring the distance between those points. This is the solver spaces distance is pretty much arbitrary. Now if you know the real distance between these points you can get a conversion factor. By doing:
real distance / solver space distance.
There's really now way to knowing this distance form the cameras settings as the camera is unable to differentiate between different scales of scenes. So a perfect 1:100 replica is no different for the solver than the real deal. So you must allays relate to something you can measure separately for each measuring session. The camera always produces something that's relative in nature.