VP9 encoding limited to 4 threads? - encoding

I am considering to use VP9 to encode my BluRays in the future, since its an open source codec. But I cannot get Handbrake or ffmpeg use more than 50% (4) of my (8) cores. The encoding time is therefore much worse than x264/5 which uses all cores.
In Handbrake I just set encoder to VP9 and CQ19. There is no difference if I add threads 8, threads 16 or threads 64 in the parameters field.
Testing ffmpeg in the command line (-c:v libvpx-vp9 -crf 19 -threads 16 -tile-columns 6 -frame-parallel 1 -speed 0) also does not use any more cpu threads.
Is the current encoder not capable of encoding on more than 4 threads or am I doing something wrong?
Linux Mint 18
handbrake 0.10.2+ds1-2build1
ffmpeg 2.8.10-0ubuntu0.16.04.1
libvpx3 1.5.0-2ubuntu1

Libvpx uses tile threading, which means you can at most have as many threads as the number of tiles. The -tile-columns option is in log2 format (so -tile-columns 6 means 64 tiles), but is also limited by the framesize. The exact details are here, it basically means that max_tiles = max(1, exp2(floor(log2(sb_cols)) - 2)), where sb_cols = ceil(width / 64.0). You can write a small script to calculate the number of tiles for a given horizontal resolution:
Width: 320 (sb_cols: 5), min tiles: 1, max tiles: 1
Width: 640 (sb_cols: 10), min tiles: 1, max tiles: 2
Width: 1280 (sb_cols: 20), min tiles: 1, max tiles: 4
Width: 1920 (sb_cols: 30), min tiles: 1, max tiles: 4
Width: 3840 (sb_cols: 60), min tiles: 1, max tiles: 8
So even for 1080p (1920 horizontal pixels), you only get 4 tiles max, so 4 threads max, i.e. a bitstream limitation. To get 8 tiles, you need at least a width of 1985 pixels (2048-64+1, which gives sb_cols=32). To get more threads than the max. number of tiles at a given resolution, you need frame-level multithreading, which libvpx doesn't implement. Other encoders, like x265/x264, do implement this.
EDIT
As some people in comments and below have already commented, more recent versions of libvpx support -row-mt 1 to enable tile row multi-threading. This can increase the number of tiles by up to 4x in VP9 (since the max number of tile rows is 4, regardless of video height). To enable this, use -tile-rows N where N is the number of tile rows in log2 units (so -tile-rows 1 means 2 tile rows and -tile-rows 2 means 4 tile rows). The total number of active threads will then be equal to $tile_rows * $tile_columns.

According to webmproject.org libvpx VP9 encoder supports multi-threading within a single column tile since 1.7.0 tag.
All you have to do is to set -row-mt 1
i.e ffmpeg -i input.mp4 -c:v libvpx-vp9 -b:v 1000K -threads 8 -speed 4 -row-mt 1 -f webm /tmp/test

Related

What is the typical DRAM row buffer size? How to find it?

How can I find DRAM row buffer size programmatically or by using already existing tools in say a *nix system ?
As an example, with a Kingston DDR4, I executed the following commands (you might need to install some packages) :
sudo modprobe eeprom
decode-dimms
These commands, among many information, give me the characteristics of my DDR stick:
---=== Memory Characteristics ===---
Maximum module speed 2132 MHz (PC4-17000)
Size 16384 MB
Banks x Rows x Columns x Bits 16 x 16 x 10 x 64
SDRAM Device Width 8 bits
Ranks 2
Rank Mix Symmetrical
AA-RCD-RP-RAS (cycles) 14-14-14-35
Supported CAS Latencies 16T, 15T, 14T, 13T, 12T, 11T, 9T
Rows and columns are actually number of address bits (you can check the decode-dimms source at https://fossies.org/linux/i2c-tools/eeprom/decode-dimms, and understand what the code is doing when you look at the DDR4 SPD information: https://en.wikipedia.org/wiki/Serial_presence_detect#DDR4_SDRAM)
Thus, if we have 10 column bits, we have 1024 columns (2^10), where each column is composed of the module width (64 as per the data above, represented as "bits"). Since we can also see that the SDRAM device width is 8x, we can deduce that the DIMM locksteps 8 SDRAM chips (those black boxes you see in your DRAM stick) to get that total width of 64 bits.
The row buffer size in my DDR4 is, therefore, 64 bits * 1024 columns = 65536 bits wide (8192 Bytes). Row buffer sizes in DDR3 and DDR4 are mostly this length, but new architectures such as HMC and HBM have different sizes.
So, in one short commandline, to return in bits (just divide by 8 to get bytes):
decode-dimms | grep "Columns x Bits" | awk -F 'x' '{print (2^$(NF-1))*$NF}
PS: mind you, this handles a single DIMM. if you have multiple DIMMS decode-dimms might return information for multiple modules.

LibPNG: smallest way to store 9-bit grayscale images

Which one produces the smallest 9-bit depth grayscale images using LibPNG?
16 bit Grayscale
8 bit GrayScale with alpha, having the 9th bit stored as alpha
Any other suggestion?
Also, from documentation it looks like in 8bit GRAY_ALPHA, alpha is 8 bit as well. Is it possible to have 8 bits of gray with only one bit of alpha?
If all 256 possible gray levels are present (or are potentially present), you'll have to use 16-bit G8A8 pixels. But if one or more gray levels is not present, you can use that spare level for transparency, and use 8-bit indexed pixels or grayscale plus a tRNS chunk to identify the transparent value.
Libpng doesn't provide a way of checking for whether a spare level is available or not, so you have to do it in your application. ImageMagick, for example, does that for you:
$ pngcheck -v rgba32.png im_opt.png
File: rgba32.png (178 bytes)
chunk IHDR at offset 0x0000c, length 13
64 x 64 image, 32-bit RGB+alpha, non-interlaced
chunk IDAT at offset 0x00025, length 121
zlib: deflated, 32K window, maximum compression
chunk IEND at offset 0x000aa, length 0
$ magick rgba32.png im_optimized.png
$ pngcheck -v im_optimized.png
File: im_optimized.png (260 bytes)
chunk IHDR at offset 0x0000c, length 13
64 x 64 image, 8-bit grayscale, non-interlaced
chunk tRNS at offset 0x00025, length 2
gray = 0x00ff
chunk IDAT at offset 0x00033, length 189
zlib: deflated, 8K window, maximum compression
chunk IEND at offset 0x000fc, length 0
There is no G8A1 format defined in the PNG specification. But the alpha channel, being all 0's or 255's, compresses very well, so it's nothing to worry about. Note that in this test case (a simple white-to-black gradient), the 32-bit RGBA file is actually smaller than the "optimized" 8-bit grayscale+tRNS
Which one produces the smallest 9-bit depth grayscale images using LibPNG?
16 bit Grayscale
8 bit GrayScale with alpha, having the 9th bit stored as alpha
The byte raw layout of both formats is very similar: G2-G1 G2-G1 ... in one case (most significant byte of each 16-bit value first), G-A G-A ... in the other. Because the filtering/prediction is done at the byte level, this means that little or no difference is to be expected between two alternatives. Because 16 bit Grayscale is in your scenario more natural, I'd opt for it.
If you go the other route, I'd suggest to experiment putting the most significant bit or the least significant bit on the alpha channel.
Also, from documentation it looks like in 8bit GRAY_ALPHA, alpha is 8 bit as well. Is it possible to have 8 bits of gray with only one bit of alpha
No. But 1 bit of alpha would mean totally opaque/totally transparent, hence you could opt for add a TRNS chunk to declare a special color as totally transparent (as pointed out in other answer, this would disallow the use of that colour)

Is it better to host our own live streaming server on AWS c1.xlarge or use a third-party service?

I have an c1.xlarge EC2 instance which according to this article has 100 MB/S upload and download speed.
And i would be streaming a video at 720p or 1080p from this server. I am running MongoDB, NGINX on my instance.
According to this article the following are the consumptions for the bandwidth
720p
Bits Per Second (down): 20+ Mbps
Bits Per Second (up): 320Kbps
Data used per 5 minute video: 37.5MB
1080p
Bits Per Second (down): 20+ Mbps
Bits Per Second (up): 320 Kbps
Data used per 5 minute video: 62MB
According to Wiki
Bitrate for 720p: ~18.3 Mbit/s
Bitrate for 1080p: ~25 Mbit/s
According to stackoverflow bitrate calculation
25 Mbit/s * 3,600 s/hr = 3.125 MB/s * 3,600 s/hr = 11,250 MB/hr ≈ 11 GB/hr
So for a minute it would be
(25 (Mbit / s)) * 1 minute = 187.5 megabytes
My assumption is that above mentioned calculation is on a per viewer base.
Q1. So is the following calculation correct that only 1 user can be hosted per minute ?
(187.5 (mb / s)) / (100 (mb / s)) = 1.87500
Q2. Should i stream from my own server or use a third-party. If third party then what do you recommend?

perl cpu profiling

I want to profile my perl script for cpu time. I found out Devel::Nytprof and Devel::SmallProf
but the first one cannot show the cpu time and the second one works bad. At least I couldn't find what I need.
Can you advise any tool for my purposes?
UPD: I need per line profiling/ Since my script takes a lot of cpu time and I want to improve the part of it
You could try your system's (not shell's internal!) time utility (leading \ is not a typo):
$ \time -v perl collatz.pl
13 40 20 10 5 16 8 4 2 1
23 70 35 106 53 160 80 40
837799 525
Command being timed: "perl collatz.pl"
User time (seconds): 3.79
System time (seconds): 0.06
Percent of CPU this job got: 97%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.94
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 171808
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 9
Minor (reclaiming a frame) page faults: 14851
Voluntary context switches: 16
Involuntary context switches: 935
Swaps: 0
File system inputs: 1120
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

how to apply RLE in binary image?

Here I have binary image,and I need to compress it using Run-length encoding RLE.I used the regular RLE algorithm and using maximum count is 16.
Instead of reducing the file size, it is increasing it. For example 5*5 matrix, 10 values repeating count is one,that is making the file bigger.
How to avoid this glitch? Is there any better way I can apply RLE partially to the matrix?
If it's for your own usage only you can create your custom image file format, and in the header you can mark if RLE is used or not, and the range of coordinates of X and Y and possible the bit planes for which it is used. But if you want to produce an image file that follows some defined image file format that uses RLE (.pcx comes into my mind) you must follow the file format specifications. If I remember correctly, in .pcx there wasn't any option to disable RLE partially.
If you are not required to use RLE and you are only looking for an easy to implement compression method, before using any compression, I suggest that you first check how many bytes your 5x5 binary matrix file takes. If the file size is 25 bytes or more, then you are saving it using at least one byte (8 bits) for each element (or alternatively you have a lot of data which is not matrix content). If you don't need to store the size, 5x5 binary matrix takes 25 bits, which is 4 bytes and 1 bit, so practically 5 bytes. I'm quite sure that there's no compression method that is generally useful for files that have size of 5 bytes. If you have matrices of different sizes, you can use eg. unsigned integer 16-bit fields (2 bytes each) for maximum matrix horizontal/vertical size of 65535 or unsigned integer 32-bit fields (4 bytes each) for maximum matrix horizontal/vertical size of 4294967295.
For example 100x100 binary matrix takes 10000 bits, which is 1250 bytes. Add 2 x 2 = 4 bytes for 16-bit size fields or 2 x 4 = 8 bytes for 32-bit size fields. After this, you can plan what would be the best compression method.