I'd like to provide hashes for compressed data but compression may be executed by different versions of bzip2. Does bzip2 or some bzip2 implementations guarantee bitwise identical output for the same input if called with the same compression level?
Related
From reading some stuff about TOAST I've learned Postgres uses an LZ-family compression algorithm, which it calls PGLZ. And it kicks in automatically for values larger than 2KB.
How does PGLZ compare to GZIP in terms of speed and compression ratio?
I'm curious to know if PGLZ and GZIP have similar speeds and compression rates such that doing an extra GZIP step before inserting large JSON strings as data into Postgres would be unnecessary or harmful.
It's significantly faster, but has a lower compression ratio than gzip. It's optimised for lower CPU costs.
There's definitely a place for gzip'ing large data before storing it in a bytea field, assuming you don't need to manipulate it directly in the DB, or don't mind having to use a function to un-gzip it first. You can do it with things like plpython or plperl if you must do it in the DB, but it's usually more convenient to just do it in the app.
If you're going to go to the effort of doing extra compression though, consider using a stronger compression method like LZMA.
There have been efforts to add support for gzip and/or LZMA compression to TOAST in PostgreSQL. The main problem with doing so has been that we need to maintain compatibility with on-disk format for older versions, make sure it stays compatible into the future, etc. So far nobody's come up with an implementation that's satisfied relevant core team members. See e.g. pluggable compression support. It tends to get stuck in a catch-22 where pluggable support gets rejected (see that thread for why) but nobody can agree on a suitable, software-patent-safe algorithm we should adopt as a new default method, agree on how to change the format to handle multiple compression methods, etc.
Custom compression methods are coming to reality. As reported here
https://www.postgresql.org/message-id/20180618173045.7f734aca%40wp.localdomain
synthetic tests showed that zlib gives more compression but usually
slower than pglz
As remembered here, when storing documents (suppose text or xml datatypes and EXTENDED storage) with more than 2k, it is compressed.
About table columns that was compressed, how to retrieve the compressed (binary format) of the column?
NOTE - Typical applications:
Operations as "long-term checksum of the document", like SHA256(compressed).PS: as it is a matter of convention, not needs complementar compression, inheriting the condition, SHA256(less2k? text: compressed).
Coping or downloading compressed documents directally (without CPU consume). PS: complementing operation (for "less than 2k row") with on-the-fly compression, when uniformity is required.
If this is possible at all, it would require writing a function in C.
Instead of going that way, I would recommend that you use EXTERNAL rather than EXTENDED storage and compress the data yourself before you store them in the database. That way you don't waste any space, and you can decide when to retrieve the compressed data and when to uncompress them.
I have a large file that I need to compress, however I need to ensure that the original file has the same hash value as the compressed one. I tried it on a smaller file, hash values are different but I am thinking that this might be because of metadata change. How do I ensure that the files don't change after compression?
It depends on which shash you are using. If you are using crc32 it's pretty trivial to make your hashes the same. MD5 might be possible already (I don't know the start of the art there), SHA1 will probably be doable in a few years. If you are using SHA256, better give up.
Snark about broken crypto aside, unless your hash algorithm knows specifically about your compression setup or your input file was very carefully crafted to provoke a hash collision: the hash will change before and after compression. That means any standard cryptographic hash will change upon compression.
All the hash algorithm sees are a stream of bits without any meaning. It does not know about compression schemes, and should not.
If your hash is a CRC-32, then you can insert or append four bytes to the compressed data, and set those to get the original CRC. For example, in a gzip stream you can insert a four-byte extra block in the header.
The whole point of cryptographic hashes, like MD5 noted as a tag to the question, is to make that exceedingly difficult, or practically impossible.
about integrity checking of files, I am not sure if CRC32 or MD5 checksum generate "unpredictable" hash values:
When checking if files are identical, usually CRC32 or MD5 checksum is used. It means that each file that is possibly a duplicate of another is read from the beginning to the end and a unique number will be calculated based on
its unique binary content. As a fingerprint, this number is stored and used to compare the file’s contents to other files to determine if they are truly identical. That means a tiny change in a file results in a fairly large
and "unpredictable" change in the generated hash.
This is not a proper use of the term "unpredictable". The algorithms are deterministic, which means that they will always produce the same output given the same input. Therefore they are entirely predictable.
Yes, for both a small change in the input will result in a "fairly large change" in the output, on the order of half of the bits of the output.
These checks cannot be used to determine if two files "are truly identical". They can only indicate that there is a very high probability that the two files are identical. You'd need to directly compare the two files to determine if they are truly identical.
On the other hand, if the checks differ, then you know for certain that the files differ.
I'd like to encode some video files either to MP4 and X264 format in Linux Debian.
It is very important that I encode multiple files parallel.
E.g. I want to encode two videos parallel on a Dual Code Machine and put the other videos in a queue. When a Video is finished I want the free core to encode the next video in the queue. Also even when this'd work with x264 I don't know about MP4.
What is the best approach here?
x264 supports parallel encoding but I don't know whether this is parallel encoding for multiple files or parallel encodings of different version for one single video.
Is there a way I can assign a encoding-process to core1 and another to core2?
Sincerly,
wolfen
Do you really need to encode multiple videos in parallel (are they racing?), or just not leave extra processor cores idle?
In either case, FFmpeg should work for your needs.
By default FFmpeg will use all available CPUs for any processing, allowing faster processing of single videos. However, you can also explicitly specify the number of cores to use via the -threads parameter, e.g., ffmpeg -i input.mov -threads 1 output.mov will only use one core.
It doesn't have any built-in queueing, though, you'll still have to code that aspect on your own.