Swift: What matrix should be used to convert 3D point to 2D in ARKit/Scenekit - swift

I am trying to use ARCamera matrix to do the conversion of 3D point to 2D in ARkit/Scenekit. Previously, I used projectpoint to get the projected x and y coordinate which is working fine. However, the app is significantly slowed down and would crash for appending long recordings.
So I turn into another approach: using the ARCamera parameter to do the conversion on my own. The Apple document for projectionMatrix did not give much. So I looked into the theory about projection matrix The Perspective and Orthographic Projection Matrix and Metal Tutorial. From my understanding that when we have a 3D points P=(x,y,z), in theory we should be able to just get the 2D point like so: P'(2D)=P(3D)*projectionMatrix.
I am assuming that's would be the case, so I did:
func session(_ session: ARSession, didUpdate frame: ARFrame) {
guard let arCamera = session.currentFrame?.camera else { return }
//intrinsics: a matrix that converts between the 2D camera plane and 3D world coordinate space.
//projectionMatrix: a transform matrix appropriate for rendering 3D content to match the image captured by the camera.
print("ARCamera ProjectionMatrix = \(arCamera.projectionMatrix)")
print("ARCamera Intrinsics = \(arCamera.intrinsics)")
}
I am able to get the projection matrix and intrinsics (I even tired to get intrinsics to see whether it changes) but they are all the same for each frame.
ARCamera ProjectionMatrix = simd_float4x4([[1.774035, 0.0, 0.0, 0.0], [0.0, 2.36538, 0.0, 0.0], [-0.0011034012, 0.00073593855, -0.99999976, -1.0], [0.0, 0.0, -0.0009999998, 0.0]])
ARCamera Intrinsics = simd_float3x3([[1277.3052, 0.0, 0.0], [0.0, 1277.3052, 0.0], [720.29443, 539.8974, 1.0]])...
I am not too sure I understand what is happening here as I am expecting that the projection matrix will be different for each frame. Can someone explain the theory here with projection matrix in scenekit/ARKit and validate my approach? Am I using the right matrix or do I miss something here in the code?
Thank you so much in advance!

You'd likely need to use the camera's transform matrix as well, as this is what changes between frames as the user moves around the real world camera and the virtual camera's transform is updated to best match that. Composing that together with the projection matrix should allow you to get into screen space.

Related

Camera Intrinsics Resolution vs Real Screen Resolution

I am writing an ARKit app where I need to use camera poses and intrinsics for 3D reconstruction.
The camera Intrinsics matrix returned by ARKit seems to be using a different image resolution than mobile screen resolution. Below is one example of this issue
Intrinsics matrix returned by ARKit is :
[[1569.249512, 0, 931.3638306],[0, 1569.249512, 723.3305664],[0, 0, 1]]
whereas input image resolution is 750 (width) x 1182 (height). In this case, the principal point seems to be out of the image which cannot be possible. It should ideally be close to the image center. So above intrinsic matrix might be using image resolution of 1920 (width) x 1440 (height) returned that is completely different than the original image resolution.
The questions are:
Whether the returned camera intrinsics belong to 1920x1440 image resolution?
If yes, how can I get the intrinsics matrix representing original image resolution i.e. 750x1182?
Intrinsics 3x3 matrix
Intrinsics camera matrix converts between the 2D camera plane and 3D world coordinate space. Here's a decomposition of an intrinsic matrix, where:
fx and fy is a Focal Length in pixels
xO and yO is a Principal Point Offset in pixels
s is an Axis Skew
According to Apple Documentation:
The values fx and fy are the pixel focal length, and are identical for square pixels. The values ox and oy are the offsets of the principal point from the top-left corner of the image frame. All values are expressed in pixels.
So you let's examine what your data is:
[1569, 0, 931]
[ 0, 1569, 723]
[ 0, 0, 1]
fx=1569, fy=1569
xO=931, yO=723
s=0
To convert a known focal length in pixels to mm use the following expression:
F(mm) = F(pixels) * SensorWidth(mm) / ImageWidth(pixels)
Points Resolution vs Pixels Resolution
Look at this post to find out what a Point Rez and what a Pixel Rez are.
Let's explore what is what when using iPhoneX data.
#IBOutlet var arView: ARSCNView!
DispatchQueue.main.asyncAfter(deadline: .now() + 1.0) {
let imageRez = (self.arView.session.currentFrame?.camera.imageResolution)!
let intrinsics = (self.arView.session.currentFrame?.camera.intrinsics)!
let viewportSize = self.arView.frame.size
let screenSize = self.arView.snapshot().size
print(imageRez as Any)
print(intrinsics as Any)
print(viewportSize as Any)
print(screenSize as Any)
}
Apple Documentation:
imageResolution instance property describes the image in the capturedImage buffer, which contains image data in the camera device's native sensor orientation. To convert image coordinates to match a specific display orientation of that image, use the viewMatrix(for:) or projectPoint(_:orientation:viewportSize:) method.
iPhone X imageRez (aspect ratio is 4:3).
These aspect ratio values correspond to camera sensor values:
(1920.0, 1440.0)
iPhone X intrinsics:
simd_float3x3([[1665.0, 0.0, 0.0], // first column
[0.0, 1665.0, 0.0], // second column
[963.8, 718.3, 1.0]]) // third column
iPhone X viewportSize (ninth part of screenSize):
(375.0, 812.0)
iPhone X screenSize (resolution declared in tech spec):
(1125.0, 2436.0)
Pay attention, there's no snapshot() method for RealityKit's ARView.

Understanding ARKit World Transform Matrices

In ARKit, when I perform a hit-test, I get back an instance of ARHitTestResult. One of the properties of this is worldTransform, which I understand contains a 4x4 transformation matrix of the position of the object – simd_float4x4.
As someone who is very unfamiliar with linear algebra and 3D graphics, how would I edit this matrix to, say, increase its y coordinate by 0.05?
If there is a blog post or something I could look at that would help me wrap my head around this, please let me know. I am not sure what terms I should be googling.
Sorry if my question is full of misunderstandings! As you can probably tell, I am not too familiar with these concepts.
Thank you to anyone who helps.
EDIT: The original question is best addressed by just adding 0.05 to the y component of the node's position. However, the original answer below does address a bit about composing transformation matrices, if that is something you are interested in.
======================================================================
If you want to apply an operation to a matrix, the most immediately simple way is to make a matrix that does that operation, and then multiply your original matrix by that new matrix.
For a translation, assuming you want to translate by x, y, z, you can do this:
let translation = simd_float4x4(
float4(1, 0, 0, 0),
float4(0, 1, 0, 0),
float4(0, 0, 1, 0),
float4(x, y, z, 1)
)
Note that this is just an identity matrix (1 down the diagonal) with the last column (!!!important, the float4s above are COLUMNS, not ROWS, as they would visually seem) set to contain the x/y/z values. You can research further into homogeneous coordinates, but think of this as just how a translation is represented.
Then, in simd, just do this: let newWorldTransform = translation * oldWorldTransform and you will have the old world transform translated by your x/y/z translation values (in your example, [x, y, z] = [0, 0.05, 0]).
However, it may be worth exploring why you want to edit your hit test results. I cannot think of a practical use case for that, so maybe if you explain a bit more about what you are trying to do I could suggest a more intuitive way to do it.
Matrices in 3D graphics is a regular way to translate, rotate, scale and shear 3D objects. In ARKit, RealityKit and SceneKit for consistent linear transformations you need to use simd_float4x4-like matrices:
var myMatrix: simd_float4x4
Translation 4x4 Matrix has 16 elements inside – 4 elements (float4) by 4 columns. Columns indices are 0, 1, 2 and 3. Translation matrix uses the fourth column with index 3.
SceneKit example
Use the following code to place your model 25 cm above its default position SCNVector3(0,0,0):
let sphereNode = SCNNode()
sphereNode.geometry = SCNSphere(radius: 1.0)
sphereNode.geometry?.firstMaterial?.diffuse.contents = UIColor.red
scene.rootNode.addChildNode(sphereNode)
var translation = matrix_identity_float4x4
translation.columns.3.y = 0.25
sphereNode.simdWorldTransform = translation
RealityKit example
let model = ModelEntity(mesh: .generateBox(size: 0.3))
let anchor = AnchorEntity()
anchor.addChild(model)
let currentMatrix = anchor.transform.matrix
var positionMatrix = simd_float4x4()
positionMatrix.columns.3.y = 0.25
let transform = simd_mul(currentMatrix, positionMatrix)
anchor.move(to: transform, relativeTo: model, duration: 1.0)

Why do vertices of a quad and the localScale of the quad not match in Unity?

I have a Quad whose vertices I'm printing like this:
public MeshFilter quadMeshFilter;
for(var vertex in quadMeshFilter.mesh.vertices)
{
print(vertex);
}
And, the localScale like this:
public GameObject quad;
print(quad.transform.localScale);
Vertices are like this:
(-0.5, -0.5), (0.5, 0.5), (0.5, -0.5), (-0.5, 0.5)
while the localScale is:
(6.4, 4.8, 0)
How is this possible - because the vertices make a square but localScale does not.
How do I use vertices and draw another square in front of the quad?
I am not well versed in the matters of meshes, but I believe I know the answer to this question.
Answer
How is this possible
Scale is a value which your mesh is multiplied in size by in given directions (x, y, z). A scale of 1 is default size. A scale of 2 is double size and so on. Your localSpace coordinates will then be multiplied by this scale.
Say a localSpace coordinate is (1, 0, 2), the scale however, is (3, 1, 3). Meaning that the result is (1*3, 0*1, 2*3).
How do I use vertices and draw another square in front of the quad?
I'd personally just create the object and then move it via Unity's Transform system. Since it allows you to change the worldSpace coordinates using transform.position = new Vector3(1f, 5.4f, 3f);
You might be able to move each individual vertex in WorldSpace too, but I haven't tried that before.
I imagine it is related to this bit of code though: vertices[i] = transform.TransformPoint(vertices[i]); since TransformPoint converts from localSpace to worldSpace based on the Transform using it.
Elaboration
Why do I get lots of 0's and 5's in my space coordinates despite them having other positions in the world?
If I print the vertices of a quad using the script below. I get these results, which have 3 coordinates and can be multiplied as such by localScale.
Print result:
Script:
Mesh mesh = GetComponent<MeshFilter>().mesh;
var vertices = mesh.vertices;
Debug.Log("Local Space.");
foreach (var v in vertices)
{
Debug.Log(v);
}
This first result is what we call local space.
There also exists something called WorldSpace. You can convert between local- and worldSpace.
localSpace is the objects mesh vertices in relation to the object itself while worldSpace is the objects location in the Unity scene.
Then you get the results as seen below, first the localSpace coordinates as in the first image, then the WorldSpace coordinates converted from these local coordinates.
Here is the script I used to print the above result.
Mesh mesh = GetComponent<MeshFilter>().mesh;
var vertices = mesh.vertices;
Debug.Log("Local Space.");
foreach (var v in vertices)
{
Debug.Log(v);
}
Debug.Log("World Space");
for (int i = 0; i < vertices.Length; ++i)
{
vertices[i] = transform.TransformPoint(vertices[i]);
Debug.Log(vertices[i]);
}
Good luck with your future learning process.
This becomes clear once you understand how Transform hierarchies work. Its a tree, in which parent transform [3x3] matrix (position, rotation, scale (rotation is actually a quaternion but lets assume its euler for simplicity so that math works). by extension of this philosophy, the mesh itself can be seen as child to the gameoobject that holds it.
If you imagine a 1x1 quad (which is what is described by your vertexes), parented to a gameobject, and that gameobject's Transform has a non-one localScale, all the vertexes in the mesh get multiplied by that value, and all the positions are added.
now if you parent that object to another gameObject, and give it another localScale, this will again multiply all the vertex positions by that scale, translate by its position etc.
to answer your question - global positions of your vertexes are different than contained in the source mesh, because they are feed through a chain of Transforms all the way up to the scene root.
This is both the reason that we only have localScale and not scale, and this is also the reason why non-uniform scaling of objects which contain rotated children can sometimes give very strange results. Transforms stack.

AKKit: How to select a group of 3D points from a 2D frame?

so the quest is this, I got an ARPointCloud with a bunch of 3d points and I'd like to select them based on a 2d frame from the perspective of the camera / screen.
I was thinking about converting the 2d frame to a 3d frustum and check if the points where the 3d frustum box, not sure if this is the ideal method, and not even sure how to do that.
Would anyone know how to do this or have a better method of achieving this?
Given the size of the ARKit frame W x H and the camera intrinsics we can create planes for the view frustum sides.
For example using C++ / Eigen we can construct our four planes (which pass
through the origin) as
std::vector<Eigen::Vector3d> _frustumPlanes;
frustumPlanes.emplace_back(Eigen::Vector3d( fx, 0, cx - W));
frustumPlanes.emplace_back(Eigen::Vector3d(-fx, 0, -cx));
frustumPlanes.emplace_back(Eigen::Vector3d( 0, fy, cy - H));
frustumPlanes.emplace_back(Eigen::Vector3d( 0, -fy, -cy));
We can then clip a 3D point by checking its position against the z < 0
half-space and the four sides of the frustum:
auto pointIsVisible = [&](const Eigen::Vector3d& P) -> bool {
if (P.z() >= 0) return false; // behind camera
for (auto&& N : frustumPlanes) {
if (P.dot(N) < 0)
return false; // outside frustum plane
}
return true;
};
Note that it is best to perform this clipping in 3D (before the projection) since points behind or near the camera or points far outside
the frustum can have unstable projection values (u,v).

Finding the angle between the 3D view vector and normal-Back face culling

Finding the angle between the view vector and the surface normal can be beneficial in getting the visible surfaces since we use it to conduct back face culling techniques and obtaining contours, crease edges of the object.
To obtain the visible surfaces I use the back face culling code below:
N = normals(vertex,faces);
BC = barycenter(vertex,faces);
back_facing = sum(N.*bsxfun(#minus,BC,campos),2)<=0
t.FaceVertexCData = 1*(sum(N.*bsxfun(#minus,BC,campos),2)<=0)
t.FaceVertexCData(sum(N.*bsxfun(#minus,BC,campos),2)>0) = nan;
faces1=faces(t.FaceVertexCData(:)==1,:);
facesv=sort(unique(faces1(:)));
How does one obtain the angle?
r=(sum(N.*bsxfun(#minus,BC,campos),2))
rr=bsxfun(#minus,BC,campos);
V_mag= sqrt(rr(:,1).^2+rr(:,2).^2+rr(:,3).^2);
N_mag= sqrt(N(:,1).^2+N(:,2).^2+N(:,3).^2);
for i = 1:(size(r,1))
A(i)=acosd(r(i)/(N_mag(i).*V_mag(i)));
end
This is what I have done thus far. I am not sure if it is correct and code is slow.