yuv to rgb for libvpx/webm - yuv

does anybody know the correct matrices for yuv -> rgb and rgb -> yuv that is
used by libvpx/webm? when i use a standard one from wikipedia then my video output looks a bit different to vlc, the colors are stronger.

Seems to be same as need to create a webm video from RGB frames
there is one set of matrices for SD video and another for HD video. So remember that. Also RGB to YUV matrix is independent of the codec. The UV sample location is dependent on the codec.
You can also take a look at http://www.fourcc.org/fccyvrgb.php to understand the conversion issue better
EDIT: Explanation:
The problem is not the equations per se but the understanding. Let me explain:
Analog data on a component cable when converted to digital is supposed to be in the Range 16-220 for Y and 16-240 for Cb and Cr. So correctly captured data will always be in that range. For such data
Y601 = 0.299R ́ + 0.587G ́ + 0.114B ́
However in many computer softwares 0-255 is used as the range. For that 0.257R ́ + 0.504G ́ + 0.098B ́ + 16 is a more appropriate equation.
For HD data the color conversion scheme is 709 and it changes the equation to
Y709 = 0.213R ́ + 0.715G ́ + 0.072B ́
If your range is 0-255 the conversion should be
Y709 = 0.183R ́ + 0.614G ́ + 0.062B ́ + 16
I suggest you read www.compression.ru/download/articles/color_space/ch03.pdf once.

Vp8 has no colorspace information.
Vp9 uses BT601 (default) or RGB colorspaces. There are other modes but they are mostly unused or unspported.

Related

Encoding a value in gray code with floating point with negatives

My objective here is to be able to convert any number between -4.0 and 4.0 into a 5 bit binary string using gray code. I also need to be able to convert back to decimal.
Thanks for any help you can provide.
If it helps, the bigger picture here is that i'm taking the weights from a neural network and mutating them as a binary string.
If you have only 5 bits available, you can only encode 2^5 = 32 different input values.
The Gray code is useful, if while the input values change slowly, only a single bit each changes in the coded value.
So maybe the most straightforward implementation is to map your input range -4.0 to 4.0 to the integer range 0 … 31, and then to represent these integers by a standard Gray code, which can easily be converted back to 0 … 31 and then to -4.0 to 4.0.

Confusion in different HOG codes

I have downloaded three different HoG codes.
using the image of 64x128
1) using the matlab function:extractHOGFeatures,
[hog, vis] = extractHOGFeatures(img,'CellSize',[8 8]);
The size of hog is 3780.
How to calculate:
HOG feature length, N, is based on the image size and the function parameter values.
N = prod([BlocksPerImage, BlockSize, NumBins])
BlocksPerImage = floor((size(I)./CellSize – BlockSize)./(BlockSize – BlockOverlap) + 1)
2) the second HOG function is downloaded from here.
Same image is used
H = hog( double(rgb2gray(img)), 8, 9 );
% I - [mxn] color or grayscale input image (must have type double)
% sBin - [8] spatial bin size
% oBin - [9] number of orientation bins
The size of H is 3024
How to calculate:
H - [m/sBin-2 n/sBin-2 oBin*4] computed hog features
3) HoG code from vl_feat.
cellSize = 8;
hog = vl_hog(im2single(rgb2gray(img)), cellSize, 'verbose','variant', 'dalaltriggs') ;
vl_hog: image: [64 x 128 x 1]
vl_hog: descriptor: [8 x 16 x 36]
vl_hog: number of orientations: 9
vl_hog: bilinear orientation assignments: no
vl_hog: variant: DalalTriggs
vl_hog: input type: Image
the output is 4608.
Which one is correct?
All are correct. Thing is HOG feature extraction function default parameters vary with packages. (Eg - opencv, matlab, scikit-image etc). By parameters I mean, winsize, stride, blocksize, scale etc.
Usually HOG descriptor length is :
Length = Number of Blocks x Cells in each Block x Number of Bins in each Cell
Since all are correct, which one you may use can be answered in many ways.
You can experiment with different param values and choose the one that suits you. Since there is no fixed way to find right values, it would be helpful if you know how change in each parameters affect the result.
Cell-size : If you increase this, you may not capture small details.
Block-size : Again, large block with large cell size may not help you capture the small details. Also since large block means illumination variation can be more and due to gradient normalization step, lot of details will be lost. So choose accordingly.
Overlap/Stride: This again helps you capture more information about the image patch if you choose overlapping blocks. Usually it is set to half the blocksize.
You may have lot of information by choosing the values of the above params accordingly. But the descriptor length will become unnecessarily long.
Hope this helps :)

JPEG SOF0 subsampling components

with JPEG snoop for an image 4:2:2 hor (YYCbCr)
I see this in the SOF0:
Component[1]: ID=0x01, Samp Fac=0x21 (Subsamp 1 x 1), Quant Tbl Sel=0x00 (Lum: Y)
Component[2]: ID=0x02, Samp Fac=0x11 (Subsamp 2 x 1), Quant Tbl Sel=0x01 (Chrom: Cb)
Component[3]: ID=0x03, Samp Fac=0x11 (Subsamp 2 x 1), Quant Tbl Sel=0x01 (Chrom: Cr)
Now where are the values 0x21 and 0x11 coming from?
I know that sampling factors are stored like this: (1byte) (bit 0-3 vertical., 4-7 horizontal.)
but I don't see how 0x11 relates to 2x1 and 0x21 to 1x1.
I expected to see 0x11 for Y component and not 0x21.
(not sure how you get 0x21 as result).
Can somebody explain these values and how you calculate them for example 4:2:2 horizontal (16x8)?
JPEG does it bassackwarks. The values indicate RELATIVE SAMPLING RATES.
The highest sampling rate is for Y (2). The sampling rate for Cb and Cr is 1.
Use the highest sampling rate to normalize to pixels:
2Y = Cb = Cr.
Y = 1/2 Cb = 1/2 Cr.
For every Y pixel value in that direction you use 1/2 a Cb and Cr pixel value.
You could even have something like according to the JPEG standard.
4Y = 3Cb = 1Cr
Y = 3/4Cb = 1/4 Cr
or
3Y=2Cb=1Cr
Y=2/3Cb=1/3Cr
But most decoders could not handle that.
The labels like "4:4:4", "4:2:2", and "4:4:0" are just that: labels that are not in the JPEG standard. Quite frankly, I don't even know where those term even come from and they are not intuitive at all (there is never a zero sampling).
Let me add another way of looking at this problem. But first, you have to keep in mind that the JPEG standard itself is not implementable. Things necessary to encode images are undefined and the standard is sprawling with unnecessary stuff.
If a scan is interleaved (all three components), it is encoded in minimum coded units (MCUs). An MCU consists of 8x8 encoded blocks.
The sampling rate specifies the number of 8x8 blocks in an MCU.
You have 2x1 for Y + 1x1 for Cb and 1x1 for Cr. That means a total of 4 8x8 blocks are in an MCU. While I mentioned other theoretical values above, the maximum number of blocks in an MCU is 10. Thus 4x4 + 3x3 + 2x2 is not possible.
The JPEG standard does not say how those blocks are mapped to pixels in an image. We usually use the largest value and say that wave a 2x1 zone or 16x8 pixels.
But all kinds of weirdness is possible under the standard, such as:
Y = 2x1, Cb = 1x2 and Cr = 1x1
That would probably mean an MCU maps to a 16x16 block of pixels but your decoder would probably not support this. Alternatively, it might mean an MCA maps to a 16x8 block of pixels and the Cb component has more values in the 8 direction.
A final way of viewing this (the practicable way) is to use the Y component as a reference point. Assume that Y is always going to have 1 or 2 (and maybe a 4) as the sampling rate in the X and Y directions and define the rates on Cb and Cr are going to be 1 (and maybe 2).The Y component always defines the pixels in the image.
These would then be realistic possibilities:
Y Cb Cr
1x1, 1x1, 1x1
2x2, 1x1, 1x1
4x4, 1x1, 1x1
2x1, 1x1, 1x1
1x2, 1x1, 1x1

How to calculate the Number of parameters for GoogLe Net?

I have a pretty good understanding of AlexNet and VGG. I could verify the number of parameters used in each layer with what is being submitted in their respective papers.
However when i try to do the same on the GoogleNet paper "Going Deeper With COnvolution", even after many iterations I am NOT able to verify the numbers they have in the 'Table 1' of their paper.
For example, the first layer is the good old plain convolution layer with kernel size (7x7), input number of maps 3 , output number of maps is 64. So based on this fact the number of parameters needed would be (3 * 49 * 64) + 64 (bias) which is around 9.5k but they say they use 2.7k. I did the math for other layers as well and i am always off by few percent than what they report. Any idea?
Thanks
I think the first line (2.7k) is wrong, but the rest of the lines of the table are correct.
Here is my computation:
http://i.stack.imgur.com/4bDo9.jpg
Be care to check which input is connect to which layer,
e.g. for the layer "inception_3a/5x5_reduce":
input = "pool2/3x3_s2" with 192 channels
dims_kernel = C*S*S =192x1x1
num_kernel = 16
Hence parameter size for that layer = 16*192*1*1 = 3072
Looks like they divided the numbers by 1024^n to convert to the K/M labels on the number of parameters in the paper Table 1. That feels wrong. We're not talking about actual storage numbers here (as in "bytes"), but straight up number of parameters. They should have just divided by 1000^n instead.
May be 7*7 conv layer is actually the combination of 7*1 conv layer and 1*7 conv layer, then the num of params could be : ((7+7)*64*3 + 64*2) / 2014 = 2.75k, which approaches 2.7k (or you can omit 128 biases).
As we know, Google introduced asymmetric convolution while doing spatial factorization in paper "Spatial Factorization into Asymmetric Convolutions"
(1x7+7x1)x3x64=2688≈2.7k, this is my opinion, I am a fresh student
Number of parameters in a CONV layer would be : ((m * n * d)+1)* k), added 1 because of the bias term for each filter. The same expression can be written as follows: ((shape of width of the filter * shape of height of the filter * number of filters in the previous layer+1)*number of filters)

which are the ranges in hsv image representation in matlab?

i've to compare several images of the same scene taken from different devices/position. To do so, i want to quantize the colors in order to remove some color representation differences due to device and illumination.
If i work in RGB i know that matlab represent each channel in the range [0 255], if i work in YCbCr i know that the three ranges are[16 235] and [16 240], but if i wanted to work in HSV color space i just know that converting with rgb2hsv i get an image which each channel is a double... but i don't know if all range between 0 and 1 are used for all the three channels.... so that i cannot make a quantization without this information.
Parag basically answered your question, but if you want physical proof, you can do what chappjc suggested and just... try it yourself! Read in an image, convert it to HSV using rgb2hsv, and take a look at the distribution of values. For example, using onion.png that is part of MATLAB's system path, try something like:
im = imread('onion.png');
out = rgb2hsv(im);
str = 'HSV';
for idx = 1 : 3
disp(['Range of ', str(idx)]);
disp([min(min(out(:,:,idx))) max(max(out(:,:,idx)))]);
end
The above code will read in each channel and display the minimum and maximum in each (Hue, Saturation and Value). This is what I get:
Range of H
0 0.9991
Range of S
0.0791 1.0000
Range of V
0.0824 1.0000
As you can see, the values range between [0,1]. Have fun!