How do video encoding standards(like h.264) then serialize motion prediction? - encoding

Motion prediction brute force algorithms, in a nutshell work like this(if I'm not mistaken):
Search every possible macroblock in the search window
Compare each of them with the reference macroblock
Take the one that is the most similar and encode the DIFFERENCE between the frames instead of the actual frame.
Now this in theory makes sense to me. But when it gets to the actual serializing I'm lost. We've found the most similar block. We know where it is, and from that we can calculate the distance vector of it. Let's say it's about 64 pixels to the right.
Basically, when serializing this block, we do:
Ignore everything but luminosity(encode only Y, i think i saw this somewhere?), take note of the difference between it and the reference block
Encode the motion, a distance vector
Encode the MSE, so we can reconstruct it
Is the output of this a simple 2D array of luminosity values, with an appended/prepended MSE value and distance vector? Where is the compression in this? We got to take out the UV component? There seem to be many resources that take on the surface level of video encoders, but it's very hard to find actual in-depth explanations of modern video encoders. Feel free to correct me on my above statements.

Grossly oversimplified:
Encoders include built-in decoder functionality. That generates a reference frame for the encoder to use. It's the same frame, inaccuracies and all, that comes out of the decoder at the far end for display to the viewer.
Motion estimation, which can be absent, simple, or complex, generates a motion vector for each 4x4 or 16x16 macroblock, by comparing the reference frame to the input frame.
The decoders (both the built-in one and the one at the far end) apply them to their current decoded image.
Then the encoder generates the pixel-by-pixel differences between the input image and decoded image, compresses them, and sends them to to the decoder. H.264 first uses lossy integer transforms (a form of discrete cosine transforms) on the luma and chroma channels. Then it applies lossless entropy coding to the output of the integer transforms. ("zip" and "gzip" are examples of lossless entropy coding, but not the codings used in H.264).
The point of motion estimation is to reduce the differences between the input image and the reference image before encoding those differences.
(This is for P frames. It's more complex for B frames.)
Dog-simple motion estimation could compute a single overall vector and apply it to all macroblocks in the image. That would be useful for applications where the primary source of motion is slowly panning and tilting the camera.
More complex motion estimation can be optimized for one or more talking heads. Another way to handle it would be to detect multiple arbitrary objects and track each object's movement from frame to frame.
And, if an encoder cannot generate motion vectors at all, everything works the same on the decoder.
The complexity of motion estimation is a feature of the encoder. The more compute cycles it can use to search for motion, the fewer image differences there will be from frame to frame, and so the fewer image-difference bits need to be sent to the far end for the same image-difference quantization level. So, the viewer gets better picture quality for the same number of bits per second, or alternatively the same picture quality for fewer bits per second.
Motion estimation can analyze the luma only, or the luma and chroma. The motion vectors are applied to the luma and both chroma channels in all cases.

Related

Can the baseline between two cameras be determined from an uncalibrated rectified image pair?

Currently, I am working at a short project about stereo-vision.
I'm trying to create depth maps of a scenery. For this, I use my phone from to view points and use the following code/workflow provided by Matlab : https://nl.mathworks.com/help/vision/ug/uncalibrated-stereo-image-rectification.html
Following this code I am able to create nice disparity maps, but I want to now the depths (as in meters). For this, I need the baseline, focal length and disparity, as shown here: https://www.researchgate.net/figure/Relationship-between-the-baseline-b-disparity-d-focal-length-f-and-depth-z_fig1_2313285
The focal length and base-line are known, but not the baseline. I determined the estimate of the Fundamental Matrix. Is there a way to get from the Fundamental Matrix to the baseline, or by making some assumptions to get to the Essential Matrix, and from there to the baseline.
I would be thankful for any hint in the right direction!
"The focal length and base-line are known, but not the baseline."
I guess you mean the disparity map is known.
Without a known or estimated calibration matrix, you cannot determine the essential matrix.
(Compare Multi View Geometry of Hartley and Zisserman for details)
With respect to your available data, you cannot compute a metric reconstruction. From the fundamental matrix, you can only extract camera matrices in a canonical form that allow for a projective reconstruction and will not satisfy the true baseline of the setup. A projective reconstruction is a reconstruction that differs from the metric result by an unknown transformation.
Non-trivial techniques could allow to upgrade these reconstructions to an Euclidean reconstruction result. However, the success of these self-calibration techniques strongly depends of the quality of the data. Thus, using images of a calibrated camera is actually the best way to go.

How can I change the exposure of an sRGB image?

I wish to "normalize" the exposure of a set of images before doing further processing. I tried the following:
1) convert sRGB to CIE_XYZ per Wikipedia page on sRGB;
2) multiply or divide "Y" by 2 to achieve a 1 stop EV change;
3) convert CIE_XYZ back to sRGB.
The problem is that step 3 frequently yields negative values (they arise after matrix multiplication to convert back to linear rgb).
In particular, my test set of sRGB values have the form (n,n,n) where 0<=n<=255.
I would expect these to be near the center of the gamut, and that a 1 stop change would not push me out of the gamut.
What is wrong with this approach??
I believe that user:1146345's comment is the most accurate, in that it refers to the non-linearity introduced by the raw->rgb conversion. So, for example, conversion from sRGB -> linear RGB -> multiply by 2^(delta stops) -> sRGB will not work well near the ends of the curve. But we don't know how to characterize this non-linearity, since it most likely varies by camera.
There Are Good Linear Image Apps
Using an application such as Adobe AfterEffects, converting to linear is trivial, and most of the tools you need remain available. Unfortunately, Photoshop's implementation of 32bit float linear is less functional.
Nevertheless, once in 32bit floating-point linear space (gamma 1.0), then all the linear math you do functions like light in the real world. In the Film/TV industry, we work in linear most of the time, if not in AfterEffects, then Nuke or Fusion, etc.
Human perception is NOT linear however — so while linear math on linearized image data will behave the way light does, it won't be relative to perception. If you want to use linear math to affect perception in a linear way, then you need to be in a perceptually uniform colorspace such as CIELAB.
Let's assume though that you want to change photometric "exposure", then you want to affect light values as they would be affected in the real world, and so you need to linearize your image data. AE has tools to help here - you'd first set your project to 32 bit floating point, and then select an appropriate profile and "linearize". Make sure you turn ON display color management.
When you import an image, use the appropriate profile to "unwind" it into linear space.
If not using AE, but using MatLab or Octave, then apply the sRGB transfer curve (aka gamma) to unwind the image into linear space.
S Curves etc.
I see some of the comments regarding cameras/debayering algorithms adding S curves aka "soft clip" at the high or low ends. Going to CIEXYZ is not going to help this, and only adds unneeded matrix math.
Stay in sRGB
You will typically be fine to just linearize the sRGB, and stay in linearRGB for your various manipulations. If you ae scaling luminance by 2, then you are probably going to want to be adjusting the high clip anyway - any soft clip is just going to be scaled along with the rest of the image, and that is really not an issue, so long as you are in 32bit floating point, then you won't have any significant quantization errors, and you can adjust the S curves after exposure.
If you want you can use "Curves" to adjust/expand the high end. AE though also has a built in ImageRAW importer, so you can import directly from RAW and set it to not compress highlights.
If you don't have access to the RAW and only the JPG, then again, it should be fine so long as you are in linear 32 bit. After all your manipulations, just re-apply the gamma curve, and the original S curves will remain intact relative to the image highlight, which is usually what you want.
Plug Ins and Gamma
Note that AE and PS and others do have "exposure" plug-ins that can affect this change.
BUT ALSO:
Keep in mind that if you are wanting to emulate real film, that each of the color records has a different gamma, and in film they do interact more than the digital values in sRGB which essentially remain separate.
If you are trying to emulate a film look, try using the LEVELS plugin and playing with the gamma/hi/lo of each color channel separately. Or do the same using CUVES.

DWT: What is it and when and where we use it

I was reading up on the DWT for the first time and the document stated that it is used to represent time-frequency data of a signal which other transforms do not provide.
But when I look for a usage example of the DWT in MATLAB I see the following code:
X=imread('cameraman.tif');
X=im2double(X);
[F1,F2]= wfilters('db1', 'd');
[LL,LH,HL,HH] = dwt2(X,'db1','d');
I am unable to understand the implementation of dwt2 or rather what is it and when and where we use it. What actually does dwt2 return and what does the above code do?
The first two statements simply read in the image, and convert it so that the dynamic range of each channel is between [0,1] through im2double.
Now, the third statement, wfilters constructs the wavelet filter banks for you. These filter banks are what are used in the DWT. The method of the DWT is the same, but you can use different kinds of filters to achieve specific results.
Basically, with wfilters, you get to choose what kind of filter you want (in your case, you chose db1: Daubechies), and you can optionally specify the type of filter that you want. Different filters provide different results and have different characteristics. There are a lot of different wavelet filter banks you could use and I'm not quite the expert as to the advantages and disadvantages for each filter bank that exists. Traditionally, Daubechies-type filters are used so stick with those if you don't know which ones to use.
Not specifying the type will output both the decomposition and the reconstruction filters. Decomposition is the forward transformation where you are given the original image / 2D data and want to transform it using the DWT. Reconstruction is the reverse transformation where you are given the transform data and want to recreate the original data.
The fourth statement, dwt2, computes the 2D DWT for you, but we will get into that later.
You specified the flag d, so you want only the decomposition filters. You can use wfilters as input into the 2D DWT if you wish, as this will specify the low-pass and high-pass filters that you want to use when decomposing your image. You don't have to do it like this. You can simply specify what filter you want to use, which is how you're calling the function in your code. In other words, you can do this:
[F1,F2]= wfilters('db1', 'd');
[LL,LH,HL,HH] = dwt2(X,F1,F2);
... or you can just do this:
[LL,LH,HL,HH] = dwt2(X,'db1','d');
The above statements are the same thing. Note that there is a 'd' flag on the dwt2 function because you want the forward transform as well.
Now, dwt2 is the 2D DWT (Discrete Wavelet Transform). I won't go into the DWT in detail here because this isn't the place to talk about it, but I would definitely check out this link for better details. They also have fully working MATLAB code and their own implementation of the 2D DWT so you can fully understand what exactly the DWT is and how it's computed.
However, the basics behind the 2D DWT is that it is known as a multi-resolution transform. It analyzes your signal and decomposes your signal into multiple scales / sizes and features. Each scale / size has a bunch of features that describe something about the signal that was not seen in the other scales.
One thing about the DWT is that it naturally subsamples your image by a factor of 2 (i.e. halves each dimension) after the analysis is done - hence the multi-resolution bit I was talking about. For MATLAB, dwt2 outputs four different variables, and these correspond to the variable names of the output of dwt2:
LL - Low-Low. This means that the vertical direction of your 2D image / signal is low-pass filtered as well as the horizontal direction.
LH - Low-High. This means that the vertical direction of your 2D image / signal is low-pass filtered while the horizontal direction is high-pass filtered.
HL - High-Low. This means that the vertical direction of your 2D image / signal is high-pass filtered while the horizontal direction is low-pass filtered.
HH - High-High. This means that the vertical direction of your 2D image / signal is high-pass filtered as well as the horizontal direction.
Roughly speaking, LL corresponds to just the structural / predominant information of your image while HH corresponds to the edges of your image. The LH and HL components I'm not too familiar with, but they're used in feature analysis sometimes. If you want to do a further decomposition, you would apply the DWT again on the LL only. However, depending on your analysis, the other components are used.... it just depends on what you want to use it for! dwt2 only performs a single-level DWT decomposition, so if you want to use this again for the next level, you would call dwt2 on the LL component.
Applications
Now, for your specific question of applications. The DWT for images is mostly used in image compression and image analysis. One application of the 2D DWT is in JPEG 2000. The core of the algorithm is that they break down the image into the DWT components, then construct trees of the coefficients generated by the DWT to determine which components can be omitted before you save the image. This way, you eliminate extraneous information, but there is also a great benefit that the DWT is lossless. I don't know which filter(s) is/are being used in JPEG 2000, but I know for certain that the standard is lossless. This means that you will be able to reconstruct the original data back without any artifacts or quantization errors. JPEG 2000 also has a lossy option, where you can reduce the file size even more by eliminating more of the DWT coefficients in such a way that is imperceptible to the average use.
Another application is in watermarking images. You can embed information in the wavelet coefficients so that it prevents people from trying to steal your images without acknowledgement. The DWT is also heavily used in medical image analysis and compression as the images generated in this domain are quite high resolution and quite large. It would be extremely useful if you could represent the images in the same way but occupying less physical space in comparison to the standard image compression algorithms (that are also lossy if you want high compression ratios) that exist.
One more application I can think of would be the dynamic delivery of video content over networks. Depending on what your connection speed is or the resolution of your screen, you get a lower or higher quality video. If you specifically use the LL component of each frame, you would stream / use a particular version of the LL component depending on what device / connection you have. So if you had a bad connection or if your screen has a low resolution, you would most likely show the video with the smallest size. You would then keep increasing the resolution depending on the connection speed and/or the size of your screen.
This is just a taste as to what the DWT is used for (personally, I don't use it because the DWT is used in domains that I don't personally have any experience in), but there are a lot more applications that are quite useful where the DWT is used.

Hardware accelerated image comparison/search?

I need to find the position of a smaller image inside a bigger image. The smaller image is a subset of the bigger image. The requirement is also that pixel values can slightly differ for example if images were produced by different JPEG compressions.
I've implemented the solution by comparing bytes using the CPU but I'm now looking into any possibility to speed up the process.
Could I somehow utilize OpenGLES and thus iPhone GPU for it?
Note: images are grayscale.
#Ivan, this is a pretty standard problem in video compression (finding position of current macroblock in previous frame). You can use a metric for difference in pixels such as sum of abs differences (SAD), sum of squared differences (SSD), or sum of Hadamard-transformed differences (SATD). I assume you are not trying to compress video but rather looking for something like a watermark. In many cases, you can use a gradient descent type search to find a local minimum (best match), on the empirical observation that comparing an image (your small image) to a slightly offset version of same (a match the position of which hasn't been found exactly) produces a closer metric than comparing to a random part of another image. So you can start by sampling the space of all possible offsets/positions (motion vectors in video encoding) rather coarsely, and then do local optimization around the best result. The local optimization works by comparing a match to some number of neighboring matches, and moving to the best of those if any is better than your current match, repeat. This is very much faster than brute force (checking every possible position), but it may not work in all cases (it is dependent on the nature of what is being matched). Unfortunately, this type of algorithm does not translate very well to GPU, because each step depends on previous steps. It may still be worth it; if you check eg 16 neighbors to the position for a 256x256 image, that is enough parallel computation to send to GPU, and yes it absolutely can be done in OpenGL-ES. However the answer to all that really depends on whether you're doing brute force or local minimization type search, and whether local minimization would work for you.

Generate quantization matrix

how can i generate quantizatinon metrices with diffrent size and
quality, is there a function in matlab for this?
Pls explain your context. quantization matrix ... for what ? If you are dealing with JPEG image compression (image blocks + DCT + quantization + huffman coding), the compressor has freedom to use its own quatization matrix - or rather a family of matrices, one for each "quality factor".
Conceptually, one usually want to assign many bits to the low frequency components and few to the high frequencies - but that's about all that can be said in general.
Also, be aware that JPEG compresses luminance and croma separated (and chroma usually subsampled), so one can use different matrices for each.
I believe the standard suggests some typical matrix, eg, including a scaling factor for different qualities. But this is not required at all. Also, you can find (googling!) here many matrices for many cameras and image apps.
Update: From here:
Tuning the quantization tables for best results is something of a black art, and is an active research area. Most existing encoders use simple linear scaling of the example tables given in the JPEG standard, using a single user-specified "quality" setting to determine the scaling multiplier. This works fairly well for midrange qualities (not too far from the sample tables themselves) but is quite nonoptimal at very high or low quality settings.