WebP technology is use for both lossy and lossless image compression which compress image more than "JPEG" and I'm studying in image compression technique. so if any one can provide me clear algorithm for WebP image compression it would be help full to me.
Oh well, providing the entire algorithm for both lossy and lossless compression including all variants in full detail is well beyond a single answer on this forum. Google provides specifications for the lossy and lossless bit streams for free. However, the latter is quite incomplete, inaccurate, and partially even wrong. You will have a hard time implementing a codec based on this spec, and it's indispensable to study quite some source code as well.
I can give you here at least some details about the lossless format:
Like PNG, which uses the ZIP DEFLATE compression algorithm, WebP uses Huffman encoding for all image information, too. Actually, much of WebP's compression gain stems from cleverly employing Huffman coding. Many details are obviously borrowed from DEFLATE, i.e. limiting the code size to 15 bits, and encoding the Huffman Alphabets with yet another Huffman code, which is almost the same as DEFLATE uses.
Additional compression is achieved by LZ77 sliding-window compression, and an optional color cache of size 2..2048, comprising the most recently used colors. The encoder is free to decide which one to use. Both can be mixed. However, I've found in my tests that this can be detrimental to the compression ratio. Photographic images usually compress fine with a big color cache, and when using the cache, it's usually better to use literal ARGB encoding all the time for pixels not present in the cache, rather than using LZ77 backward references.
While the use of LZ77 compression is pretty much the same as in DEFLATE, Google's specification is not based on bytes, but pixels. That is, ARGB quadruples are compressed, rather than individual A-R-G-B bytes. Moreover, WebP allows backward reference lengths up to 4096 pixels and reference distances up to 1048576 - 120 pixels, which is quite beyond the DEFLATE limits. Another benefit is obtaind by using separate Huffman alphabets for the ARGB channels.
Like PNG, the WebP LZ77 compression has a RLE (Run Length Encoding) feature, which is the result of clever handling of a special case, when the reference length exceeds the reference distance. In this case, the available bytes are copied over and over again, until reaching the specified length. I've found that using this feature yields great compression for "artificial" images with long runs of the same color. However, it conflicts with the color cache, which will generate quite efficient Huffman codes for such runs. Based on my preliminary tests, the color cache outperforms the RLE feature.
Like PNG, WebP doesn't perform well on raw ARGB data. Therefore, both formats allow application of various Transforms on the pixel data, before compression begins. Those transforms really do a great job reducing the variance of the pixel stream and account for a great deal of the resulting comression ratio. WebP defines 13 standard predictor transforms, while PNG has only 4. However, I've found that most of the predictors don't yield much gain, and I usually employ the "Select" filter, which picks either the pixel to the left or above, whichever appears to be more appropriate as predictor. As with PNG, the simple filters frequently are better than the complex ones.
Besides predictor transforms, WebP offers some other transforms, of which I've tried the "Subtract Green" only, which attempts to decorrelate the RGB channels by subtracting G from both R and B for each pixel. Indeed I've observed some benefit, when applied after the predictor. If applied before, it may have a negative impact on photographic images.
WebP uses 5 separate Huffman codes for the bitstream: One for the green channel, the LZ77 length codes, and the color cache indexes, one for the red channel, one for the blue channel, one for the alpha channel, and one for the LZ77 distance codes. This is a clever design, since the information in the ARGB channels might be quite uncorrelated, so merging them into a single alphabet can be suboptimal.
WebP lossless offers a wide range of options that can be combined and tweaked to the max. However, in my opinion, most of the combinations are not worthwhile to test. Based on my observations, compression is usually good with the following defaults:
Use the "Select" predictor.
Apply the "Subtract Green" transform.
Use a color cache with 2048 slots.
If the current pixel is not in the cache, encode it as literals.
Related
Below is a snippet taken from matlab's website describing imwrite:
imwrite(___,Name,Value) specifies additional parameters for output GIF, HDF, JPEG, PBM, PGM, PNG, PPM, and TIFF files, using one or more name-value pair arguments. You can specify Name,Value after the input arguments in any of the previous syntaxes.
Here are some Name,Value pairs arguments that you can use for imwrite compression:
'Compression' — Compression scheme
'Quality' — Quality of JPEG-compressed file
So an example of imwrite doing the compression is: imwrite(A,'C:\project\lena.jpg','jpeg','Quality',Q) where Q is the quality factor.
I'm new to matlab. I read that matlab's imwrite is able to do JPEG compression to a specific quality factor (as described above). Hence, I am curious in what is the difference (is there any?) in doing JPEG compression using imwrite as compared to dct2 when all other factors are fixed as much as possible (e.g, same quality factor, same input image source)? Ignoring any difference in computational cost, just focusing on the resulting jpeg image.
Update:
I found a paper detailing the differences in using these two techniques in page 4, section 3.3.1 Technical issues with zero embedding costs. Here is a snippet taken from the paper:
When using the slow DCT (implemented using ‘dct2’ in Matlab), the number 1/2-coefficients is small and does not affect security at least for low-quality factors. However, in the fast-integer implementation of DCT (e.g., Matlab’s imwrite), all Dij are multiples of 1/8. Thus, with decreasing quantization step (increasing JPEG quality factor), the number of 1/2-coefficients increases.
The "1/2 coefficients" above refers to the unquantized DCT coefficients having a "0.5" decimal point. It is also known as the "rounding error". Based on this, it seems like there is indeed a difference in the resulting unquantized DCT coefficients for different quality factors.
Motion prediction brute force algorithms, in a nutshell work like this(if I'm not mistaken):
Search every possible macroblock in the search window
Compare each of them with the reference macroblock
Take the one that is the most similar and encode the DIFFERENCE between the frames instead of the actual frame.
Now this in theory makes sense to me. But when it gets to the actual serializing I'm lost. We've found the most similar block. We know where it is, and from that we can calculate the distance vector of it. Let's say it's about 64 pixels to the right.
Basically, when serializing this block, we do:
Ignore everything but luminosity(encode only Y, i think i saw this somewhere?), take note of the difference between it and the reference block
Encode the motion, a distance vector
Encode the MSE, so we can reconstruct it
Is the output of this a simple 2D array of luminosity values, with an appended/prepended MSE value and distance vector? Where is the compression in this? We got to take out the UV component? There seem to be many resources that take on the surface level of video encoders, but it's very hard to find actual in-depth explanations of modern video encoders. Feel free to correct me on my above statements.
Grossly oversimplified:
Encoders include built-in decoder functionality. That generates a reference frame for the encoder to use. It's the same frame, inaccuracies and all, that comes out of the decoder at the far end for display to the viewer.
Motion estimation, which can be absent, simple, or complex, generates a motion vector for each 4x4 or 16x16 macroblock, by comparing the reference frame to the input frame.
The decoders (both the built-in one and the one at the far end) apply them to their current decoded image.
Then the encoder generates the pixel-by-pixel differences between the input image and decoded image, compresses them, and sends them to to the decoder. H.264 first uses lossy integer transforms (a form of discrete cosine transforms) on the luma and chroma channels. Then it applies lossless entropy coding to the output of the integer transforms. ("zip" and "gzip" are examples of lossless entropy coding, but not the codings used in H.264).
The point of motion estimation is to reduce the differences between the input image and the reference image before encoding those differences.
(This is for P frames. It's more complex for B frames.)
Dog-simple motion estimation could compute a single overall vector and apply it to all macroblocks in the image. That would be useful for applications where the primary source of motion is slowly panning and tilting the camera.
More complex motion estimation can be optimized for one or more talking heads. Another way to handle it would be to detect multiple arbitrary objects and track each object's movement from frame to frame.
And, if an encoder cannot generate motion vectors at all, everything works the same on the decoder.
The complexity of motion estimation is a feature of the encoder. The more compute cycles it can use to search for motion, the fewer image differences there will be from frame to frame, and so the fewer image-difference bits need to be sent to the far end for the same image-difference quantization level. So, the viewer gets better picture quality for the same number of bits per second, or alternatively the same picture quality for fewer bits per second.
Motion estimation can analyze the luma only, or the luma and chroma. The motion vectors are applied to the luma and both chroma channels in all cases.
I have read a lot about image encoding techniques, e.g. Bag of Visual Words, VLAD or Fisher Vectors.
However, I have a very basic question: we know that we can perform descriptor matching (brute force or by exploiting ANN techniques). My question is: why don't we just use them?
From my knowledge, Bag of Visual Words are made of hundreds of thousands of dimensions per image to have accurate representation. If we consider an image with 1 thousand SIFT descriptors (which is already a considerable number), we have 128 thousands floating numbers, which is usually less than the number of dimensions of BoVW, so it's not for a memory reason (at least if we are not considering large scale problems, then VLAD/FV codes are preferred).
Then why do we use such encoding techniques? Is it for performance reasons?
I had a hard time understanding your question.
Concerning descriptor matching, brute force, ANN matching techniques are used in retrieval systems. Recent matching techniques include KDtree, Hashing, etc.
BoVW is a traditional representation scheme. At one time BOVW combined with Inverted index was the state-of-the-art in information retrieval systems. But the dimension (memory usage per image) of BOVW representation (upto millions) limits the actual number of images that can be indexed in practice.
FV and VLAD are both compact visual representations with high discriminative ability, something which BoVW lacked. VLAD is known to be extremely compact (32Kb per image), very discriminative and efficient in retrieval and classification tasks.
So yes, such encoding techniques are used for performance reasons.
You may check this paper for deeper understanding: Aggregating local descriptors into a compact image
representation.
I need to find the position of a smaller image inside a bigger image. The smaller image is a subset of the bigger image. The requirement is also that pixel values can slightly differ for example if images were produced by different JPEG compressions.
I've implemented the solution by comparing bytes using the CPU but I'm now looking into any possibility to speed up the process.
Could I somehow utilize OpenGLES and thus iPhone GPU for it?
Note: images are grayscale.
#Ivan, this is a pretty standard problem in video compression (finding position of current macroblock in previous frame). You can use a metric for difference in pixels such as sum of abs differences (SAD), sum of squared differences (SSD), or sum of Hadamard-transformed differences (SATD). I assume you are not trying to compress video but rather looking for something like a watermark. In many cases, you can use a gradient descent type search to find a local minimum (best match), on the empirical observation that comparing an image (your small image) to a slightly offset version of same (a match the position of which hasn't been found exactly) produces a closer metric than comparing to a random part of another image. So you can start by sampling the space of all possible offsets/positions (motion vectors in video encoding) rather coarsely, and then do local optimization around the best result. The local optimization works by comparing a match to some number of neighboring matches, and moving to the best of those if any is better than your current match, repeat. This is very much faster than brute force (checking every possible position), but it may not work in all cases (it is dependent on the nature of what is being matched). Unfortunately, this type of algorithm does not translate very well to GPU, because each step depends on previous steps. It may still be worth it; if you check eg 16 neighbors to the position for a 256x256 image, that is enough parallel computation to send to GPU, and yes it absolutely can be done in OpenGL-ES. However the answer to all that really depends on whether you're doing brute force or local minimization type search, and whether local minimization would work for you.
how can i generate quantizatinon metrices with diffrent size and
quality, is there a function in matlab for this?
Pls explain your context. quantization matrix ... for what ? If you are dealing with JPEG image compression (image blocks + DCT + quantization + huffman coding), the compressor has freedom to use its own quatization matrix - or rather a family of matrices, one for each "quality factor".
Conceptually, one usually want to assign many bits to the low frequency components and few to the high frequencies - but that's about all that can be said in general.
Also, be aware that JPEG compresses luminance and croma separated (and chroma usually subsampled), so one can use different matrices for each.
I believe the standard suggests some typical matrix, eg, including a scaling factor for different qualities. But this is not required at all. Also, you can find (googling!) here many matrices for many cameras and image apps.
Update: From here:
Tuning the quantization tables for best results is something of a black art, and is an active research area. Most existing encoders use simple linear scaling of the example tables given in the JPEG standard, using a single user-specified "quality" setting to determine the scaling multiplier. This works fairly well for midrange qualities (not too far from the sample tables themselves) but is quite nonoptimal at very high or low quality settings.