Stereo vision: Depth estimation - matlab

I am working on Stereo vision task and I would like to get the distance between stereo vision cameras and the object. I am using Matlab with Computer Vision System Toolbox.
I have calibrated cameras with using "Camera Calibration Toolbox for Matlab" thus I have Intrinsic parameters of left and right camera and Extrinsic parameters (position of right camera wrt left camera). I have also a pair of rectified pictures and thier disparity map. For estimation of disparity I have used Matlab function disparity(). I know the baseline and the focal length of cameras but my results are still wrong.
baseline = 70 mm
focal length = 25 mm
disparity = 60 pixels
---------------------
depth = baseline * focal length / disparity = 70 * 25 / 60 = 29 mm
But I know that the distance is cca 600 mm. Is this formula right? What about the units? mm * mm / pixel != mm. And especially I would like to use Camera matrix (Intrinsic parameters) for calculation but I didn't figure out how. I would be thankful for any hint.

Like you said, you have to convert the unit into mm. And for that you need this formulas
z = (b*F) / (d*s)
mm = (mm * mm) / (pixel * (mm/pixel))
Where
z = depth in mm
b = baseline in mm
F = focal length in mm
d = depth in pixel
s = sensor size in mm/pixel. (Normally it provide in um, so do conversion before).
EDIT
Sometime your focal is in pixel so you don't need to use the sensor size. So just use your formula :
z = b*F / d
mm = mm * pixel / pixel

Related

Depth to world registration hololens2 unity

I'm working on a program on hololens2 research mode on unity. Hololens give us a depth image that is distance from depth sensor to object in front, for every pixel.
What I do is for every pixel I project pixel to image plane, then backproject it according to depth distance captured by depth sensor and it gives me the xyz in depth sensor coordinate frame. now it is needed to transform this coordinate to global coordinate system. to do so I get camera coordinate from unity by cam_pose = Camera.main.transform and in the other hand saved depth sensor extrinsic matrix.
From these two matrices I create a depth_to_world = cam_pose # inv(extrinsic). Now for every xyz on depth I perform global_xyz = depth_to_world # xyz to get point in real world. Problem is it return a point with 10-15 cm error. What am I doing wrong? (code is in python)
x = self.us[Depth_i, Depth_j] # projection from pixels to image plane
y = self.vs[Depth_i, Depth_j] # projection from pixels to image plane
D = distance_img[Depth_i, Depth_j] #distance_img is depth image
distance = 1000*float(D) / np.sqrt(x * x + y * y + 1) #distance according to spherical image plane D is in millimeter
depth_to_world = cam_pose # np.linalg.inv(Constants.camera_extrinsic)
X = (np.array([x * distance, y * distance, 1.0 * distance, 1])).reshape(4, 1)
point = (depth_to_world # X )[0:3, 0]
I got it! according to (https://github.com/petergu684/HoloLens2-ResearchMode-Unity) first I passed unity world origin to a winrt plugin, and depth_to_world was depth_to_world = inv(extrinsic) * cam_pose witch cam_pose is given by TryLocateAtTimeStamp. And other point is that unity coordinate is left handed (surprisingly!) so we should multiply a (-1) to z. (z <- -z)
my depth_to_world transformation was near but not correct.

How to perform an orthographic projection on a z-Buffer image in Matlab?

I am facing the same problem as mentioned in this post, however, I am not facing it with OpenGL, but simply with MATLAB. Depth as distance to camera plane in GLSL
I have a depth image rendered from the Z-Buffer from 3ds max. I was not able to get an orthographic representation of the z-buffer. For a better understanding, I will use the same sketch as made by the previous post:
* |--*
/ |
/ |
C-----* C-----*
\ |
\ |
* |--*
The 3 asterisks are pixels and the C is the camera. The lines from the
asterisks are the "depth". In the first case, I get the distance from the pixel to the camera. In the second, I wish to get the distance from each pixel to the plane.
The settins of my camera are the following:
WIDTH = 512;
HEIGHT = 424;
FOV = 89.971;
aspect_ratio = WIDTH/HEIGHT;
%clipping planes
near = 500;
far = 5000;
I calulate the frustum settings like the following:
%calculate frustums settings
top = tan((FOV/2)*5000)
bottom = -top
right = top*aspect_ratio
left = -top*aspect_ratio
And set the projection matrix like this:
%Generate matrix
O_p = [2/(right-left) 0 0 -((right+left)/(right-left)); ...
0 2/(top-bottom) 0 -((top+bottom)/(top-bottom));...
0 0 -2/(far-near) -(far+near)/(far-near);...
0 0 0 1];
After this I read in the depth image, which was saved as a 48 bit RGB- image, where each channel is the same, thus only one channel has to be used.
%Read in image
img = imread('KinectImage.png');
%Throw away, except one channel (all hold the same information)
c1 = img(:,:,1);
The pixel values have to be inverted, since the closer the values are to the camera, the brigher they were. If a pixel is 0 (no object to render available) it is set to 2^16, so , that after the bit complementation, the value is still 0.
%Inverse bits that are not zero, so that the z-image has the correct values
c1(c1 == 0) = 2^16
c1_cmp = bitcmp(c1);
To apply the matrix, to each z-Buffer value, I lay out the vector one dimensional and build up a vector like this [0 0 z 1] , over every element.
c1_cmp1d = squeeze(reshape(c1_cmp,[512*424,1]));
converted = double([zeros(WIDTH*HEIGHT,1) zeros(WIDTH*HEIGHT,1) c1_cmp1d zeros(WIDTH*HEIGHT,1)]) * double(O_p);
After that, I pick out the 4th element of the result vector and reshape it to a image
img_con = converted(:,4);
img_con = reshape(img_con,[424,512]);
However, the effect, that the Z-Buffer is not orthographic is still there, so did I get sth wrong? Is my calculation flawed ? Or did I make mistake here?
Depth Image coming from 3ds max
After the computation (the white is still "0" , but the color axis has changed)
It would be great to achieve this with 3ds max, which would resolve this issue, however I was not able to find this setting for the z-buffer. Thus, I want to solve this using Matlab.

Estimating distance to a point using camera calibration

I want to estimate distance (camera to a point in the ground : that means Yw=0) from a given pixel coordinate of that point . For that I used camera calibration methods
But the results are not meaningful.
I have following details to calibration
-focal length x and y , principal point x and y, effective pixel size in meters , yaw and pitch angles and camera heights etc.
-I have entered focal length ,principal points and translation vector in terms of pixels for calculation
-I have multiplied image point with camera_matrix and then rotational| translation matrix (R|t), to get the world point.
Is my procedure correct?? What can be wrong ?
result
image_point(x,y) =400,380
world_point z co ordinate(distance) = 12.53
image_point(x,y) =400,180
world_point z co ordinate(distance) = 5.93
problem
I am getting very few pixels for z coordinate ,
That means z co ordinate is << 1 m , (because effective pixel size in meters = 10 ^-5 )
This is my matlab code
%positive downward pitch
xR = 0.033;
yR = 0;
zR = pi;
%effective pixel size in meters = 10 ^-5 ; focal_length x & y = 0.012 m
% principal point x & y = 320 and 240
intrinsic_params =[1200,0,320;0,1200,240;0,0,1];
Rx=[1,0,0 ; 0,cos(xR),sin(xR); 0,-sin(xR),cos(xR)];
Ry=[cos(yR),0,-sin(yR) ; 0,1,0 ; sin(yR),0,cos(yR)];
Rz=[cos(zR),sin(zR),0 ; -sin(zR),cos(zR),0 ; 0,0,1];
R= Rx * Ry * Rz ;
% The camera is 1.17m above the ground
t=[0;117000;0];
extrinsic_params = horzcat(R,t);
% extrinsic_params is 3 *4 matrix
P = intrinsic_params * extrinsic_params; % P 3*4 matrix
% make it square ....
P_sq = [P; 0,0,0,1];
%image size is 640 x 480
%An arbitrary pixel 360,440 is entered as input
image_point = [400,380,0,1];
% world point will be in the form X Y Z 1
world_point = P_sq * image_point'
Your procedure is kind of right, however it is going in the wrong direction.
See this link. Using your intrinsic and extrinsic calibration matrix you can find the pixel-space position of a real-world vector, NOT the other way around. The exception to this is if your camera is stationary in the global frame and you have the Z position of the feature in the global space.
Stationary camera, known feature Z case: (see also this link)
%% First we simulate a camera feature measurement
K = [0.5 0 320;
0 0.5 240;
0 0 1]; % Example intrinsics
R = rotx(0)*roty(0)*rotz(pi/4); % orientation of camera in global frame
c = [1; 1; 1]; %Pos camera in global frame
rwPt = [ 10; 10; 5]; %position of a feature in global frame
imPtH = K*R*(rwPt - c); %Homogeneous image point
imPt = imPtH(1:2)/imPtH(3) %Actual image point
%% Now we use the simulated image point imPt and the knowledge of the
% features Z coordinate to determine the features X and Y coordinates
%% First determine the scaling term lambda
imPtH2 = [imPt; 1];
z = R.' * inv(K) * imPtH2;
lambda = (rwPt(3)-c(3))/z(3);
%% Now the RW position of the feature is:
rwPt2 = c + lambda*R.' * inv(K) * imPtH2 % Reconstructed RW point
Non-stationary camera case:
To find the real-world position or distance from the camera to a particular feature (given on the image plane) you have to employ some method of reconstructing the 3D data from the 2D image.
The two that come to mind immediately is opencv's solvePnP and stereo-vision depth estimation.
solvePnP requires 4 co-planar (in RW space) features to be available in the image, and the positions of the features in RW space known. This may not sound useful as you need to know the RW position of the features, but you can simply define the 4 features with a known offset rather than a position in the global frame - the result will be the relative position of the camera in the frame the features are defined in. solvePnP gives very accurate pose estimation of the camera. See my example.
Stero vision depth estimation requires the same feature to be found in two spatially-separate images and the transformation between the images in RW space must be known very precisely.
There may be other methods but these are the two I am familiar with.

create opencv camera matrix for iPhone 5 solvepnp

I am developing an application for the iPhone using opencv. I have to use the method solvePnPRansac:
http://opencv.willowgarage.com/documentation/cpp/camera_calibration_and_3d_reconstruction.html
For this method I need to provide a camera matrix:
__ __
| fx 0 cx |
| 0 fy cy |
|_0 0 1 _|
where cx and cy represent the center pixel positions of the image and fx and fy represent focal lengths, but that is all the documentation says. I am unsure what to provide for these focal lengths. The iPhone 5 has a focal length of 4.1 mm, but I do not think that this value is usable as is.
I checked another website:
http://docs.opencv.org/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html
which shows how opencv creates camera matrices. Here it states that focal lengths are measured in pixel units.
I checked another website:
http://www.velocityreviews.com/forums/t500283-focal-length-in-pixels.html
(about half way down)
it says that focal length can be converted from units of millimeters to pixels using the equation: fx = fy = focalMM * pixelDensity / 25.4;
Another Link I found states that fx = focalMM * width / (sensorSizeMM);
fy = focalMM * length / (sensorSizeMM);
I am unsure about these equations and how to properly create this matrix.
Any help, advice, or links on how to create an accurate camera matrix (especially for the iPhone 5) would be greatly appreciated,
Isaac
p.s. I think that (fx/fy) or (fy/fx) might be equal to the aspect ratio of the camera, but that might be completely wrong.
UPDATE:
Pixel coordinates to 3D line (opencv)
using this link, I can figure out how they want fx and fy to be formatted because they use it to scale angles relative to their distance from the center. therefore, fx and fy are likely in pixels/(unit length) but im still not sure what this unit length needs to be, can it be arbitrary as long as x and y are scaled to each other?
You can get an initial (rough) estimate of the focal length in pixel dividing the focal length in mm by the width of a pixel of the camera' sensor (CCD, CMOS, whatever).
You get the former from the camera manual, or read it from the EXIF header of an image taken at full resolution. Finding out the latter is a little more complicated: you may look up on the interwebs the sensor's spec sheet, if you know its manufacturer and model number, or you may just divide the overall width of its sensitive area by the number of pixels on the side.
Absent other information, it's usually safe to assume that the pixels are square (i.e. fx == fy), and that the sensor is orthogonal to the lens's focal axis (i.e. that the term in the first row and second column of the camera matrix is zero). Also, the pixel coordinates of the principal point (cx, cy) are usually hard to estimate accurately without a carefully designed calibration rig, and an as-carefully executed calibration procedure (that's because they are intrinsically confused with the camera translation parallel to the image plane). So it's best to just set them equal to the geometrical geometrical center of the image, unless you know that the image has been cropped asymmetrically.
Therefore, your simplest camera model has only one unknown parameter, the focal length f = fx = fy.
Word of advice: in your application is usually more convenient to carry around the horizontal (or vertical) field-of-view angle, rather than the focal length in pixels. This is because the FOV is invariant to image scaling.
The "focal length" you are dealing with here is simply a scaling factor from objects in the world to camera pixels, used in the pinhole camera model (Wikipedia link). That's why its units are pixels/unit length. For a given f, an object of size L at a distance (perpendicular to the camera) z, would be f*L/z pixels.
So, you could estimate the focal length by placing an object of known size at a known distance of your camera and measuring its size in the image. You could aso assume the central point is the center of the image. You should definitely not ignore the lens distortion (dist_coef parameter in solvePnPRansac).
In practice, the best way to obtain the camera matrix and distortion coefficients is to use a camera calibration tool. You can download and use the MRPT camera_calib software from this link, there's also a video tutorial here. If you use matlab, go for the Camera Calibration Toolbox.
Here you have a table with the spec of the cameras for iPhone 4 and 5.
The calculation is:
double f = 4.1;
double resX = (double)(sourceImage.cols);
double resY = (double)(sourceImage.rows);
double sensorSizeX = 4.89;
double sensorSizeY = 3.67;
double fx = f * resX / sensorSizeX;
double fy = f * resY / sensorSizeY;
double cx = resX/2.;
double cy = resY/2.;
Try this:
func getCamMatrix()->(Float, Float, Float, Float)
{
let format:AVCaptureDeviceFormat? = deviceInput?.device.activeFormat
let fDesc:CMFormatDescriptionRef = format!.formatDescription
let dim:CGSize = CMVideoFormatDescriptionGetPresentationDimensions(fDesc, true, true)
// dim = dimensioni immagine finale
let cx:Float = Float(dim.width) / 2.0;
let cy:Float = Float(dim.height) / 2.0;
let HFOV : Float = format!.videoFieldOfView
let VFOV : Float = ((HFOV)/cx)*cy
let fx:Float = abs(Float(dim.width) / (2 * tan(HFOV / 180 * Float(M_PI) / 2)));
let fy:Float = abs(Float(dim.height) / (2 * tan(VFOV / 180 * Float(M_PI) / 2)));
return (fx, fy, cx, cy)
}
Old thread, present problem.
As Milo and Isaac mentioned after Milo's answer, there seems to be no "common" params available for, say, the iPhone 5.
For what it is worth, here is the result of a run with the MRPT calibration tool, with a good old iPhone 5:
[CAMERA_PARAMS]
resolution=[3264 2448]
cx=1668.87585
cy=1226.19712
fx=3288.47697
fy=3078.59787
dist=[-7.416752e-02 1.562157e+00 1.236471e-03 1.237955e-03 -5.378571e+00]
Average err. of reprojection: 1.06726 pixels (OpenCV error=1.06726)
Note that dist means distortion here.
I am conducting experiments on a toy project, with these parameters---kind of ok. If you do use them on your own project, please keep in mind that they may be hardly good enough to get started. The best will be to follow Milo's recommendation with your own data. The MRPT tool is quite easy to use, with the checkerboard they provide. Hope this does help getting started !

What is the depth image received from Kinect

When I ran this Matlab code to get the depth image, the result I got is a matrix of 480x640. The min element value is 0 and the max element value is 2711. What does 2711 mean? Is that the distance from the camera to the farthest part of the image. But what is the unit of 2711. Is that meter of feet or ??
I don't know what the Matlab code exactly does to the depth, but it probably does some processing on it because the depth sent by the Kinect is on 11 bits, so it shouldn't be higher than 2048. Try to find out what it does, or to get access to the raw data sent by the Kinect.
The data sent by the Kinect is not a proper distance (it's a "disparity"), so you have to do some math to convert it to useful units.
From the OpenKinect project wiki (which contains useful information about the Kinect) :
From their data, a basic first order
approximation for converting the raw
11-bit disparity value to a depth
value in centimeters is: 100/(-0.00307
* rawDisparity + 3.33). This approximation is approximately 10 cm
off at 4 m away, and less than 2 cm
off within 2.5 m.
A better approximation is given by
Stéphane Magnenat in this post:
distance = 0.1236 * tan(rawDisparity /
2842.5 + 1.1863) in meters. Adding a final offset term of -0.037 centers
the original ROS data. The tan
approximation has a sum squared
difference of .33 cm while the 1/x
approximation is about 1.7 cm.
Once you have the distance using the
measurement above, a good
approximation for converting (i, j, z)
to (x,y,z) is:
x = (i - w / 2) * (z + minDistance) * scaleFactor * (w/h)
y = (j - h / 2) * (z + minDistance) * scaleFactor
z = z
Where
minDistance = -10
scaleFactor = .0021.
These values were found by hand.
You can find more details about the Kinect's depth camera and its calibration on the ROS website (and many others !).
If you map the data to a meter scale it compresses the depth image slightly. I found this was an issue when I was trying to look for planes in the mapped data.