Determining H264 GOP structure - mp4

I have H264 encoded video packed in MOV container. And need to find out GOP structure M and N values as they defined here http://en.wikipedia.org/wiki/Group_of_pictures, i.e. M is the distance between two anchor frames (I or P) and N is the distance between I-frames.
Is it possible to extract M and N from SPS/PPS? Or I have to read each frame of GOP and take frame type info from the slice header?

Related

Split an uncompressed video into segments with ffmpeg on MatLab?

I have a video sequence(format Y4M) and I want to split it into sevral segment with the same GoP size.
GoP = 8;
How can I do that in MatLab using FFMPEG ?
One standard way of representing video in Matlab is a 4D matrix. The dimensions are height x width x color channels x frames. Once you have the matrix, it is easy to take time slices by specifying the range of frames that you would like.
For example, you can grab 8 frames at a time in a for-loop
%Loads video as 4D matrix
v = VideoReader('xylophone.mp4');
while hasFrame(v)
video = cat(4, video, readFrame(v));
end
%iterate over the length of the movie with step size of 8
for i=1:8:size(video, 4)-8
video_slice = video(:,:,:,i:i+7); %get the next 8 frames
% do something with the 8 frames here
% each frame is a slice across the 4th dimension
frame1 = video_slice(:,:,:,1);
end
%play movie
implay(video)
The other most common way of representing video is in a structure array. You can index a structure array with a range of values to slice 8 frames. The actual frame values in my example are stored in the structure element cdata. Depending on your structure the element might have a different name; look for an element with a 3d matrix value.
% Loads video as structure
load mri
video = immovie(D,map);
%iterate over the length of the movie with step size of 8
for i=1:8:size(video, 4)-8
video_slice = video(i:i+7); %get the next 8 frames
% do something with the 8 frames here
% to access the frame values use cdata
frame1 = video_slice(1).cdata
end
%play movie
implay(video)
The tricky part is your video format. Y4M is not supported by Matlab's VideoReader which is the most common way to load video. It is also not supported by the FFmpeg Toolbox which only provides a few media formats (MP3, AAC, mpeg4, x264, animated GIF).
There are a couple other questions that look for solutions to this problem including
how to read y4m video(get the frames) file in matlab
How to read yuv videos in matlab?
I would also check on the Matlab File Exchange, but I don't have personal experience with any of these methods.

Image Compression using SVD

I've been researching on image compression with SVD for school. However, I do not see how there will be a reduction in memory by using SVD and truncating the number of singular values used. The original image would be m x n in size, thereby using m x n x pixel-size bytes.
After SVD the resultant matrix is still m x n. Would it not then use the same amount of space?
That's because the rank-k approximation of the image requires you to store (think about saving the image into a file) only the first k singular vectors and singular values, resulting in a m x k storage space instead of m x n. Then, when you want to render the image on screen you are obviously uncompressing it back to the m x n size (as you do with any other kind of compression), but that's not the true size of the image, is only rendering.

Combine a (m x n x p) matrix (image) of 8 bit numbers to a (m x n) matrix of 24 bit numbers and vice versa

Say there is a matrix of (m x n x p), esp. a color image with R G and B channel. Each channel information is 8-bit integer.
But, for an analysis, the three 8-bit values have to be combined to get a 24-bit value and the analysis is done on the (m x n) matrix of 24-bit values.
After the analysis, the matrix has to be decomposed back to three 8-bit channels for displaying the results.
What I am doing right now:
Iterating through all the values in the matrix
Convert each decimal value to binary (using dec2bin)
Combine the three binary values together to get a 24-bit number (using strcat and bin2dec)
Code:
for i=1:m
for j=1:n
new_img(i,j) = bin2dec(strcat(...
sprintf('%.8d',str2double(dec2bin(img(i,j,1)))), ...
sprintf('%.8d',str2double(dec2bin(img(i,j,2)))), ...
sprintf('%.8d',str2double(dec2bin(img(i,j,3))))));
end
end
For the decomposition back to three 8-bits after analysis, the exact reverse process is done, still iterating through (m x n) values.
The problem is huge computation time.
I know that this is the not the correct way of doing this. Is there any matrix operation that I can do to achieve this so that the computation is done quickly?
Although I don't understand why you'd "combine" the rgb planes this way, this'll get you what you're looking for in one command.
a = bitshift(img(:,:,1),16)+...
bitshift(img(:,:,2,8)+...
img(:,:,3);
And to invert the process requires binary masking in addition to shifting back to the right.
A=zeros(size(img));
A(:,:,1)=bitshift(a,-16);
A(:,:,2)=bitshift(bitand(a,2^16-2^8),-8);
A(:,:,3)=bitand(a,2^8-2^0);

MATLAB wavread N1 N2 Values

I am a newbie to MATLAB and have a question.
In the MATLAB wavread function:
wavread(filename, [N1 N2]);
Can anyone please help me understand why the N1 & N2indices are usually chosen as 24120 & 25439 respectively for slicing the wav file?
Thanks in advance!
If you consult the wavread documentation, it's actually quite clear on what N1 and N2 are. N1 represents the beginning sample and N2 represents the ending sample, and what is returned are your audio samples between N1 and N2 for each channel.
As such, supposing your audio sampling rate was 44100 Hz. Following your post, if you did:
wavread(filename, [24120 25439]) ,
you are returning audio samples for each channel that range between the 0.5469 (24120/44100) second to 0.5768 (25439/44100) second mark in your audio file. This would return an overall matrix of 1320 x N where N is the number of audio channels in your file. The overall length of this audio sample file would be roughly 0.03 seconds.
BTW, these indices are not usually chosen to be this way. These indices are highly dependent on the length of your audio signal, as well as what you want to isolate from the audio signal itself. These indices are used to mainly to disregard irrelevant audio signal data and to only give you those audio samples where you know there is some meaningful output.
My belief is that the audio files that you are processing have very meaningful output between these time frames, which is why those indices are used quite often. As I said, this all depends on which audio files you are processing.

What Are Linear PCM Values

I am working with audio in the iPhone OS and am a bit confused.
I am currently getting input from my audio buffer in the form of pcm values ranging from
-32767 to 32768. I am hoping to perform a dbSPL conversion using the formula 20LOG10(p/pref).
I am aware that pRef is .00002 pascals, and would like to convert the pcm values to pascals.
My question is
a) what are these pcm values representing exactly.
b) how do I turn these values to pascals.
Thanks so much
You can't do this conversion without additional information. The mapping of PCM values to physical units of pressure (pascals) depends on the volume setting, characteristics of the output device (earbuds? a PA system?), and the position of the observer with respect to the output device (right next to the speaker? 100 meters away?).
To answer the first part of your question: if you were to graph the sound pressure
versus time for, say, a 1 kHz sine wave tone, the linear-quantized PCM values at the
sample times would be roughly proportional to the sound pressure variations from ambient
at that instant. ("Roughly", because input and output devices seldom have absolutely flat
response curves over the entire audio frequency range.)
Lets get some intuition for the question
what are these pcm values representing exactly ( ranging from -32767 to 32768 )
Audio is simply a curve which fluctuates above and below a zero line ... if the curve sits at or too near the zero line for long enough period of time this maps to silence ... neither your speaker surface nor your eardrum do any wobbling ... alternatively if the audio curve violently wobbles from maximum value to minimum value more often than not for a stretch of time you have maximum volume hence a greater value of pascals
Audio in the wild which your ear hears is analog ... to digitize audio this analog curve must get converted into binary data by sampling the raw audio curve to record the curve height at X samples per second ... the fundamental digital format for audio is PCM which simply maps that continuous unbroken analog curve into distinct points on a graph ... PCM audio still appears as a curve yet when you zoom in its just distinct points on the curve ... each curve point has its X and Y value where X represents time (going left to right) and Y represents amplitude (going up and down) ... however only the Y values are stored and the X values are implied meaning each successive Y sample is by definition separated in time determined by the sampling rate so for a second of time with a sample rate of 44100 Hertz you will have 44100 values of Y per second of recording ( per channel )
The number of X measurements per second we call Sample Rate (often 44,100 per second) ... the number of bits used to record the fidelity of Y we call Bit Depth ... if we devote 3 bits the universe of possible Y values must fit in one of these rows
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 0
so for 3 bits the number of possible values of Y is 2^3 or 8 distinct values which sounds very distorted since the audio curve is far from continuous which is why CD quality audio uses two bytes ( 16 bits) of information to record the value of curve height Y which gives it 2^16 distinct values of Y which equates to the scale you gave us ( -32767 to 32768 ) ... 2^16 == 65536 distinct Y values ... the original continuous unbroken analog audio curve is now digitized into 2^16 choices of height values ranging from top to bottom of audio curve which to the human ear becomes indistinguishable from the source audio curve ... when doing audio calculations using floating point often the Y value gets normalized ... into say a range of -1.0 to +1.0 ... instead of ( -32767 to 32768)
So by now it should be clear the heart of your question concerning pascals ( unit of pressure ) is orthogonal to Y value range (bit depth) and is instead a function of the shape of the audio curve together with the sheer area of the speaker surface ... for a given choice of frequency(s) the degree to which the audio curve adheres to the canonical sine curve of that frequency while consuming the full range of possible Y values will maximize the amplitude ( volume ) thus driving the pascal value
Your question is neither “iphone”, “objective-c” or “objective-c++”. But it can be answered very simple: http://en.wikipedia.org/wiki/Pulse-code_modulation
Greetings