Storing lots of images on server compression - server

We have a project which will generate lots (hundreds of thousands) of .PNG images that are around 1mb. Rapid serving is not a priority as we use the images internally, not front end.
We know to use filesystem not DB to store.
We'd like to know how best to compress these images on the server to minimise long term storage costs.
linux server

They already are compressed, so you would need to recode the images into another lossless format, while preserving all of the information present in the PNG files. I don't know of a format that will do that, but you can roll your own by recoding the image data using a better lossless compressor (you can see benchmarks here), and have a separate metadata file that retains the other information from the original .png files, so that you can reconstruct the original.
The best you could get losslessly, based on the benchmarks, would be about 2/3 of their current size. You would need to test the compressors on your actual data. Your mileage may vary.

Related

Is it safe to compute a hash on an image compressed in a lossless format such as PNG, GIF, etc.?

I was wondering if any lossless image compression format such as PNG comes with some kind of uniqueness guarantee, i.e. that two different compressed binaries always decode to different images.
I want to compute the hash of images that are stored in a lossless compression format and am wondering if computing the hash of the compressed version would be sufficient.
(There are some good reasons to compute the hash on the uncompressed image but there are out of the scope of my question here.)
No, that's not true for PNG. The compression procedure have many parameters (filtering type used for each row, ZLIB compression level and settings), so a single raw image can result in many different PNG files. Even worse, PNG allows to include ancillary data (chunks) with miscelaneous info (for example, textual comments).

MATLAB: Are there any problems with many (millions) small files compared to few (thousands) large files?

I'm working on a real-time test software in MATLAB. On user input I want to extract the value of one (or a few neighbouring) pixels from 50-200 high resolution images (~25 MB).
My problem is that the total image set is to big (~2000 images) to store in RAM, consequently I need to read each of the 50-200 images from disk after each user-input which of course is way to slow!
So I was thinking about splitting the images into sub-images (~100x100 pixels) and saving these separately. This would make the image-read process quick enough.
Are there any problems I should be aware of with this approach? For instance I've read about people having trouble copying many small files, will this affect me to i.e. make the image-read slower?
rahnema1 is right - imread(...,'PixelRegion') will fasten read operation. If it is not enough for you, even if your files are not fragmented, may be it is time to think about some database?
Disk operations are always the bottleneck. First we switch to disk caches, then distributed storage, then RAID, and after some more time, we finish with in-memory databases. You should choose which access speed is reasonable.

How best to archive many JPEG files containing significant redundancy across their scenes?

Archives of JPEG files don't compress well, ostensibly because each JPEG is already highly compressed. However, when there is much redundancy between images (e.g., archived stills from a stationary camera), and the number of files is large (think thousands or more), there comes a point when failure to exploit the redundancies makes JPEG seem dramatically inefficient for files to be stored in archives.
What approach and archive format would give the best compression of JPEG files?

Why do game developers put many images into one big image?

Over the years I've often asked myself why game developers place many small images into a big one. But not only game developers do that. I also remember the good old Winamp MP3 player had a user interface design file which was just one huge image containing lots of small ones.
I have also seen some big javascript GUI libraries like ext.js using this technique. In ext.js there is a big image containing many small ones.
One thing I noticed is this: No matter how small my PNG image is, the Finder on the Mac always tells me it consumes at least 4kb. Which is heck of a lot if you have just 10 pixels.
So is this done because storing 20 or more small images into a big one is much more memory efficient versus having 20 separate files, each of them probably with it's own header and metadata?
Is it because locating files on the file system is expensive and slow, and therefore much faster to simply locate only one big image and then split it up into smaller ones, once it is loaded into memory?
Or is it lazyness, because it is tedious to think of so many file names?
Is ther a name for this technique? And how are those small images separated from the big one at runtime?
This is called spriting - and there are various reasons to do it in different situations.
For web development, it means that only one web request is required to fetch the image, which can be a lot more efficient than several separate requests. That's more efficient in terms of having less overhead due to the individual requests, and the final image file may well be smaller in total than it would have been otherwise.
The same sort of effect may be visible in other scenarios - for example, it may be more efficient to store and load a single large image file than multiple small ones, depending on the file system. That's entirely aside from any efficiencies gained in terms of the raw "total file size", and is due to the per file overhead (a directory entry, block size etc). It's a bit like the "per request" overhead in the web scenario, but due to slightly different factors.
None of these answers are right. The reason we pack multiple images into one big "sprite sheet" or "texture atlas" is to avoid swapping textures during rendering.
OpenGL and Direct-X take a performance hit when you draw from one image (texture) and the switch to another, so we pack multiple images into one big image and then we can draw several (or hundreds) of images and never switch textures. It has nothing to do with the 4K file size (or hasn't in 15 years).
Also, up until very recently, textures had to by powers of 2 (64, 128, 256) and if your game had lots of odd sized images, that's a lot of wasted memory. Packing them in a single texture could save a lot of space.
The 4kb usage is a side effect of how files are stored on disk. The smallest possible addressable bit of storage in a filesystem is a block, which is usually a fixed size of 512, 1024, 2048, etc... bytes. In your Mac's case, it's using 4k blocks. That means that even a 1-byte file will require at least 4kbytes worth of physical space to store, as it's not possible for the file system to address any storage unit SMALLER than 4k.
The reasons for these "large" blocks vary, but the big one is that the more "granular" your addressing gets (the small the blocks), the more space you waste on indexes to list which blocks are assigned to which files. If you had 1-byte sized blocks, then for every byte of data you store in a file, you'd also need to store 1+ bytes worth of usage information in the file system's metadata, and you'd end up wasting at least HALF of your storage on nothing but indexes.
The converse is true - the bigger the blocks, the more space is wasted for every smaller-than-one-block sized file you store, so in the end it comes down to what tradeoff you're willing to live with.
The reasons are a bit different in different environments.
On the web the main reason is to reduce the number of requests to the web server. Each requests creates overhead, most notably a separate round trip over the network.
When fetching from good ol' mechanical hard drives good read performance requires contiguous data. If you save data in lots of files you get extra seek-time for each file. There is also the block size to consider. Files are made out of blocks, in your case 4kB. When reading a file of one byte you need to read a whole block anyway. If you have many small images you can stuff a whole bunch of them in a single disk block and get them all in the same time as if you had only one small image in the block.
Another reason from days of yore was palletes.
If you did one image you could theme it with one pallete Colour = 14 = light grey with a hint of green.
If you did lots of little images you had to make sure you used the same pallet for every one while designing them, or you got all sort of artifacts.
Given you had one pallete then you could manipulate that, so everything currently green could be made red, by flipping one value in the palletes instead of trawling through every image.
Lots of simple animations like fire, smoke, running water are still done with this method.

image and video compression

What are similar compressors to the RAR algorithm?
I'm interested in compressing videos (for example, avi) and images (for example, jpg)
Winrar reduced an avi video (1 frame/sec) to .88% of it's original size (i.e. it was 49.8MB, and it went down to 442KB)
It finished the compression in less than 4 seconds.
So, I'm looking to a similar (open) algorithm. I don't care about decompression time.
Compressing "already compressed" formats are meaningless. Because, you can't get anything further. Even some archivers refuse to compress such files and stores as it is. If you really need to compress image and video files you need to "recompress" them. It's not meant to simply convert file format. I mean decode image or video file to some extent (not require to fully decoding), and apply your specific models instead of formats' model with a stronger entropy coder. There are several good attempts for such usages. Here is a few list:
PackJPG: Open source and fast performer JPEG recompressor.
Dell's Experimental MPEG1 and MPEG2 Compressor: Closed source and proprietry. But, you can at least test that experimental compressor strength.
Precomp: Closed source free software (but, it'll be open in near future). It recompress GIF, BZIP2, JPEG (with PackJPG) and Deflate (only generated with ZLIB library) streams.
Note that recompression is usually very time consuming process. Because, you have to ensure bit-identical restoration. Some programs even check every possible parameter to ensure stability (like Precomp). Also, their models have to be more and more complex to gain something negligible.
Compressed formats like (jpg) can't really be compressed anymore since they have reached entropy; however, uncompressed formats like bmp, wav, and avi can.
Take a look at LZMA