3D trajectory reconstruction from video (taken by a single camera) - matlab

I am currently trying to reconstruct a 3D trajectory of a falling object like a ball or a rock out of a sequence of images taken from an iPhone video.
Where should I start looking? I know I have to calibrate the camera (I think I'll use the matlab calibration toolbox by Jean-Yves Bouguet) and then find the vanishing point from the same sequence, but then I'm really stuck.

read this: http://www.cs.auckland.ac.nz/courses/compsci773s1c/lectures/773-GG/lectA-773.htm
it explains 3d reconstruction using two cameras. Now for a simple summary, look at the figure from that site:
You only know pr/pl, the image points. By tracing a line from their respective focal points Or/Ol you get two lines (Pr/Pl) that both contain the point P. Because you know the 2 cameras origin and orientation, you can construct 3d equations for these lines. Their intersection is thus the 3d point, voila, it's that simple.
But when you discard one camera (let's say the left one), you only know for sure the line Pr. What's missing is depth. Luckily you know the radius of your ball, this extra information can give you the missing depth information. see next figure (don't mind my paint skills):
Now you know the depth using the intercept theorem
I see one last issue: the shape of ball changes when projected under an angle (ie not perpendicular on your capture plane). However you do know the angle, so compensation is possible, but I leave that up to you :p
edit: #ripkars' comment (comment box was too small)
1) ok
2) aha, the correspondence problem :D Typically solved by correlation analysis or matching features (mostly matching followed by tracking in a video). (other methods exist too)
I haven't used the image/vision toolbox myself, but there should definitely be some things to help you on the way.
3) = calibration of your cameras. Normally you should only do this once, when installing the cameras (and every other time you change their relative pose)
4) yes, just put the Longuet-Higgins equation to work, ie: solve
P = C1 + mu1*R1*K1^(-1)*p1
P = C2 + mu2*R2*K2^(-1)*p2
with
P = 3D point to find
C = camera center (vector)
R = rotation matrix expressing the orientation of the first camera in the world frame.
K = calibration matrix of the camera (containing internal parameters of the camera, not to be confused with the external parameters contained by R and C)
p1 and p2 = the image points
mu = parameter expressing the position of P on the projection line from camera center C to P (if i'm correct R*K^-1*p expresses a line equation/vector pointing from C to P)
these are 6 equations containing 5 unknowns: mu1, mu2 and P
edit: #ripkars' comment (comment box too small once again)
The only computer vison library that pops up in my mind is OpenCV (http://opencv.willowgarage.com/wiki ). But that's a C library, not matlab... I guess google is your friend ;)
About the calibration: yes, if those two images contain enough information to match some features. If you change the relative pose of the cameras, you'll have to recalibrate of course.
The choice of the world frame is arbitrary; it only becomes important when you want to analyze the retrieved 3d data afterwards: for example you could align one of the world planes with the plane of motion -> simplified motion equation if you want to fit one.
This world frame is just a reference frame, changeable with a 'change of reference frame transformation' (translation and/or rotation transformation)

Unless you have a stereo camera, you will never be able to know the position for sure, even with calibrated camera. Because you don't know whether the ball is small and close or large and far away.
There are other methods with single camera, based on series of images with different focus. But I doubt that you can control the camera of your cell phone in that way.
Edit(1):
as #GuntherStruyf points out correctly, you can know the position if one of your inputs is the size of the ball.

Related

Restoring the image of a face's plane

I am using an API to analyze faces in Matlab, where I get for each picture a 3X3 rotation matrix of the face's orientation, telling which direction the head is pointing.
I am trying to normalize the image according to that matrix, so that it will be distorted to get the image of the face's plane. This is something like 'undoing' the projection of the face to the camera plane. For example, if the head is directed a little to the left, it will stretch the left side to (more or less) preserve the face's original proportions.
Tried using 'affine2d' and 'projective2d' with 'imwarp', but it didn't achieve that goal
Achieving your goal with simple tools like affine transformations seems impossible to me since a face is hardly a flat surface. An extreme example: Imagine the camera recording a profile view of someone's head. How are you going to reconstruct the missing half of the face?
There have been successful attempts to change the orientation of faces in images and real-time video, but the methods used are quite complex:
[We] propose a gaze correction method that needs just a
single webcam. We apply recent shape deformation techniques
to generate a 3D face model that matches the user’s face. We
then render a gaze-corrected version of this face model and
seamlessly insert it into the original image.
(Giger et al., https://graphics.ethz.ch/Downloads/Publications/Papers/2014/Gig14a/Gig14a.pdf)

Open GL - ES 2.0 : Touch detection

Hi Guys I am doing some work on iOS and the work requires use of OpenGL es. So now I have a bunch of squares, cubes and triangles on the screen. Some of these geometries might overlap. Any ideas/ approaches for touch detection?
Regards
To follow up on the answer already given, squares, cubes and triangles are convex shapes so you can perform ray-object intersection quite easily, even directly from the geometry rather than from the mathematical description of the perfect object.
You're going to need to be able to calculate the distance of a point from the plane and the intersection of a ray with the plane. As a simple test you can implement yourself very quickly, for each polygon on the convex shape work out the intersection between the ray and the plane. Then check whether that point is behind all the planes defined by polygons that share an edge with the one you just tested. If so then the hit is on the surface of the object — though you should be careful about coplanar adjoining polygons and rounding errors.
Once you've found a collision you can easily get the length of the ray to the point of collision. The object with the shortest distance is the one that's in front.
If that's fast enough then great, otherwise you'll probably want to look into partitioning the world or breaking objects down to their silhouettes. Convex objects are really simple — consider all the edges that run between one polygon and the next. If only exactly one of those polygons is front facing then the edge is part of the silhouette. All the silhouettes edges together can be projected to a convex 2d shape on the view plane. You can then test touches by performing a 2d point-in-polygon from that.
A further common alternative that eliminates most of the maths is picking. You'd render the scene to an invisible buffer with each object appearing as a solid blob in a suitably unique colour. To test for touch, you'd just do a glReadPixels and inspect the colour.
For the purposes of glu on the iPhone, you can grab SGI's implementation (as used by MESA). I've used its tessellator in a shipping, production project before.
I had that problem in the past. What I have used is an implementation of glu unproject that you can find on google (it uses the inverse of the model view projection matrix and the viewport size). This allows you to map the 2D screen coordinates to a 3D vector into the world. Then, you can use this vector to intersect with your objects and see which one intersects (or comes really close to doing so).
I do hope there are better ways of doing this, so I look forward to other answers as well!
Once you get the inverse-modelview and cast your ray (vector), you still need to know if the ray intersects your geometry. One approach would be to grab the depth (z in view coordinate system) of the object's center and extend (stretch) your vector just that far. Then see if the vector's "head" ends within the volume of your object or not (you need the objects center and e.g. Its radius, if it's a sphere)

Not able to calibrate camera view to 3D Model

I am developing an app which uses LK for tracking and POSIT for estimation. I am successful in getting rotation matrix, projection matrix and able to track perfectly but the problem for me is I am not able to translate 3D object properly. The object is not fitting in to the right place where it has to fit.
Will some one help me regarding this?
Check this links, they may provide you some ideas.
http://computer-vision-talks.com/2011/11/pose-estimation-problem/
http://www.morethantechnical.com/2010/11/10/20-lines-ar-in-opencv-wcode/
Now, you must also check whether the intrinsic camera parameters are correct. Even a small error in estimating the field of view can cause troubles when trying to reconstruct 3D space. And from your details, it seems that the problem are bad fov angles (field of view).
You can try to measure them, or feed the half or double value to your algorithm.
There are two conventions for fov: half-angle (from image center to top or left, or from bottom to top, respectively from left to right) Maybe you just mixed them up, using full-angle instead of half, or vice-versa
Maybe you can show us how you build a transformation matrix from R and T components?
Remember, that cv::solvePnP function returns inverse transformation (e.g camera in world) - it finds object pose in 3D space where camera is in (0;0;0). For almost all cases you need inverse it to get correct result: {Rt; -T}

Calculating corresponding pixels

I have a computer vision set up with two cameras. One of this cameras is a time of flight camera. It gives me the depth of the scene at every pixel. The other camera is standard camera giving me a colour image of the scene.
We would like to use the depth information to remove some areas from the colour image. We plan on object, person and hand tracking in the colour image and want to remove far away background pixel with the help of the time of flight camera. It is not sure yet if the cameras can be aligned in a parallel set up.
We could use OpenCv or Matlab for the calculations.
I read a lot about rectification, Epipolargeometry etc but I still have problems to see the steps I have to take to calculate the correspondence for every pixel.
What approach would you use, which functions can be used. In which steps would you divide the problem? Is there a tutorial or sample code available somewhere?
Update We plan on doing an automatic calibration using known markers placed in the scene
If you want robust correspondences, you should consider SIFT. There are several implementations in MATLAB - I use the Vedaldi-Fulkerson VL Feat library.
If you really need fast performance (and I think you don't), you should think about using OpenCV's SURF detector.
If you have any other questions, do ask. This other answer of mine might be useful.
PS: By correspondences, I'm assuming you want to find the coordinates of a projection of the same 3D point on both your images - i.e. the coordinates (i,j) of a pixel u_A in Image A and u_B in Image B which is a projection of the same point in 3D.

Screen-to-World coordinate conversion in OpenGLES an easy task?

The Screen-to-world problem on the iPhone
I have a 3D model (CUBE) rendered in an EAGLView and I want to be able to detect when I am touching the center of a given face (From any orientation angle) of the cube. Sounds pretty easy but it is not...
The problem:
How do I accurately relate screen-coordinates (touch point) to world-coordinates (a location in OpenGL 3D space)? Sure, converting a given point into a 'percentage' of the screen/world-axis might seem the logical fix, but problems would arise when I need to zoom or rotate the 3D space. Note: rotating & zooming in and out of the 3D space will change the relationship of the 2D screen coords with the 3D world coords...Also, you'd have to allow for 'distance' in between the viewpoint and objects in 3D space. At first, this might seem like an 'easy task', but that changes when you actually examine the requirements. And I've found no examples of people doing this on the iPhone. How is this normally done?
An 'easy' task?:
Sure, one might undertake the task of writing an API to act as a go-between between screen and world, but the task of creating such a framework would require some serious design and would likely take 'time' to do -- NOT something that can be one-manned in 4 hours...And 4 hours happens to be my deadline.
The question:
What are some of the simplest ways to
know if I touched specific locations
in 3D space in the iPhone OpenGL ES
world?
You can now find gluUnProject in http://code.google.com/p/iphone-glu/. I've no association with the iphone-glu project and haven't tried it yet myself, just wanted to share the link.
How would you use such a function? This PDF mentions that:
The Utility Library routine gluUnProject() performs this reversal of the transformations. Given the three-dimensional window coordinates for a location and all the transformations that affected them, gluUnProject() returns the world coordinates from where it originated.
int gluUnProject(GLdouble winx, GLdouble winy, GLdouble winz,
const GLdouble modelMatrix[16], const GLdouble projMatrix[16],
const GLint viewport[4], GLdouble *objx, GLdouble *objy, GLdouble *objz);
Map the specified window coordinates (winx, winy, winz) into object coordinates, using transformations defined by a modelview matrix (modelMatrix), projection matrix (projMatrix), and viewport (viewport). The resulting object coordinates are returned in objx, objy, and objz. The function returns GL_TRUE, indicating success, or GL_FALSE, indicating failure (such as an noninvertible matrix). This operation does not attempt to clip the coordinates to the viewport or eliminate depth values that fall outside of glDepthRange().
There are inherent difficulties in trying to reverse the transformation process. A two-dimensional screen location could have originated from anywhere on an entire line in three-dimensional space. To disambiguate the result, gluUnProject() requires that a window depth coordinate (winz) be provided and that winz be specified in terms of glDepthRange(). For the default values of glDepthRange(), winz at 0.0 will request the world coordinates of the transformed point at the near clipping plane, while winz at 1.0 will request the point at the far clipping plane.
Example 3-8 (again, see the PDF) demonstrates gluUnProject() by reading the mouse position and determining the three-dimensional points at the near and far clipping planes from which it was transformed. The computed world coordinates are printed to standard output, but the rendered window itself is just black.
In terms of performance, I found this quickly via Google as an example of what you might not want to do using gluUnProject, with a link to what might lead to a better alternative. I have absolutely no idea how applicable it is to the iPhone, as I'm still a newb with OpenGL ES. Ask me again in a month. ;-)
You need to have the opengl projection and modelview matrices. Multiply them to gain the modelview projection matrix. Invert this matrix to get a matrix that transforms clip space coordinates into world coordinates. Transform your touch point so it corresponds to clip coordinates: the center of the screen should be zero, while the edges should be +1/-1 for X and Y respectively.
construct two points, one at (0,0,0) and one at (touch_x,touch_y,-1) and transform both by the inverse modelview projection matrix.
Do the inverse of a perspective divide.
You should get two points describing a line from the center of the camera into "the far distance" (the farplane).
Do picking based on simplified bounding boxes of your models. You should be able to find ray/box intersection algorithms aplenty on the web.
Another solution is to paint each of the models in a slightly different color into an offscreen buffer and reading the color at the touch point from there, telling you which brich was touched.
Here's source for a cursor I wrote for a little project using bullet physics:
float x=((float)mpos.x/screensize.x)*2.0f -1.0f;
float y=((float)mpos.y/screensize.y)*-2.0f +1.0f;
p2=renderer->camera.unProject(vec4(x,y,1.0f,1));
p2/=p2.w;
vec4 pos=activecam.GetView().col_t;
p1=pos+(((vec3)p2 - (vec3)pos) / 2048.0f * 0.1f);
p1.w=1.0f;
btCollisionWorld::ClosestRayResultCallback rayCallback(btVector3(p1.x,p1.y,p1.z),btVector3(p2.x,p2.y,p2.z));
game.dynamicsWorld->rayTest(btVector3(p1.x,p1.y,p1.z),btVector3(p2.x,p2.y,p2.z), rayCallback);
if (rayCallback.hasHit())
{
btRigidBody* body = btRigidBody::upcast(rayCallback.m_collisionObject);
if(body==game.worldBody)
{
renderer->setHighlight(0);
}
else if (body)
{
Entity* ent=(Entity*)body->getUserPointer();
if(ent)
{
renderer->setHighlight(dynamic_cast<ModelEntity*>(ent));
//cerr<<"hit ";
//cerr<<ent->getName()<<endl;
}
}
}
Imagine a line that extends from the viewer's eye
through the screen touch point into your 3D model space.
If that line intersects any of the cube's faces, then the user has touched the cube.
Two solutions present themselves. Both of them should achieve the end goal, albeit by a different means: rather than answering "what world coordinate is under the mouse?", they answer the question "what object is rendered under the mouse?".
One is to draw a simplified version of your model to an off-screen buffer, rendering the center of each face using a distinct color (and adjusting the lighting so color is preserved identically). You can then detect those colors in the buffer (e.g. pixmap), and map mouse locations to them.
The other is to use OpenGL picking. There's a decent-looking tutorial here. The basic idea is to put OpenGL in select mode, restrict the viewport to a small (perhaps 3x3 or 5x5) window around the point of interest, and then render the scene (or a simplified version of it) using OpenGL "names" (integer identifiers) to identify the components making up each face. At the end of this process, OpenGL can give you a list of the names that were rendered in the selection viewport. Mapping these identifiers back to original objects will let you determine what object is under the mouse cursor.
Google for opengl screen to world (for example there’s a thread where somebody wants to do exactly what you are looking for on GameDev.net). There is a gluUnProject function that does precisely this, but it’s not available on iPhone, so that you have to port it (see this source from the Mesa project). Or maybe there’s already some publicly available source somewhere?