Does Postgres store bytea data as hex on the server? - postgresql

To work with bytea values in PostgreSQL, I usually am serializing to and deserializing from hex. This seems to be the preferred way. However, what is actually stored on the PostgreSQL server? Is it hex, or the unhexed binary? The reason I care is that hex is obviously going to take up double the space as unhexed binary. When I say unhexed binary, I mean the hex string "00", which is 2 bytes, is just "0", which is 1 byte, as unhexed binary.
The context is I have a Postgres database and a Scylla database that are storing the exact same data in almost the exact same format. However, the total space used by Postgres is almost exactly double the space used by Scylla. For Scylla, I don't encode binary as hex. I just send raw binary over the wire. I don't expect these two databases to use the exact same amount of space. But for PostgreSQL to use double the space is quite a lot of overhead, and the nearly exact doubling really makes me suspect data is being stored as hex and not actual binary on the server (since hex uses exactly double the space as actual binary).

A bytea is stored in binary form, not hex encoded, which would be enormously wasteful. The hex representation is just the default text representation generated by the type output function.
I don't know Scylla, so I cannot explain the difference, but PostgreSQL has substantial overhead per row (23 bytes), and there is also some overhead per 8kB block.
You say in your comments that you measured the database size, which includes all the metadata and indexes. I suggest that you use pg_table_size to measure the table.
Note that PostgreSQL automatically compresses values if a table row would otherwise exceed 2000 bytes.

Related

PostgreSQL bytea network traffic double expected value

I'm investigating a bandwidth problem and stumbled on an issue with retrieving a bytea value. I tested this with PostgreSQL 10 and 14, the respective psql clients and the psycopg2 client library.
The issue is that if the size of a bytea value is eg. 10 MB (I can confirm by doing select length(value) from table where id=1), and I do select value from table where id=1, then the amount of data transferred over the socket is about 20MB. Note that the value in the database is pre-compressed (so high entropy), and the table is set to not compress the bytea value to avoid double work.
I can't find any obvious encoding issue since it's all just bytes. I can understand that the psql CLI command may negotiate some encoding so it can print the result, but psycopg2 definitely doesn't do that, and I experience the same behaviour.
I tested the same scenario with a text field, and that nearly worked as expected. I started with copy paste of lorem ipsum and it transferred the correct amount of data, but when I changed the text to be random extended ASCII values (higher entropy again), it transferred more data than it should have. I have compression disabled for all my columns so I don't understand why that would happen.
Any ideas as to why this would happen?
That is normal. By default, values are transferred as strings, so a bytea would be rendered in hexadecimal numbers, which doubles its size.
As a workaround, you could transfer such data in binary mode. The frontend-backend protocol and the C library offer support for that, but it will depend on your client API whether you can make use of that or not.

unix db2 BIGINT Vs Decimal as primary key

Need suggestion on which Datatype would give better performance if we set one of these as primary key in DB2 - BIGINT or Decimal(13,0) type?
I suspect Decimal(13,0) will have issues once the key grows to a very big size but I wanted a better answer/understanding for this.
Thanks.
Decimal does not have issues. The only thing, is that DB2 has to do more operations to retrieve the data, once is read. I mean, DB2 read the data and then it should find the decimal part (the precision) even if is 0.
On the other hand, DB2 will read the BigInt and it does not need any further process. The number is on the bufferpool.
If you are going to use integers of 13 positions (most of them), probably Decimal will be better because you are not going to use extra bytes, however decimals have extra bytes for the precision. By using decimal in this way, you are going to optimize the storage, and this will be translated in better IO, better performance. However, it depends on the other columns of your table. You have to test which of them gives you better performance.
When using compression, there are more CPU cycles to recover the information. You have to test if the performance is affected.
Use BIGINT:
Can store ~19 digits (versus 13)
Will take 8 bytes (versus maybe either 7 or 13 - see next)
Depending on platform, DECIMAL will be stored as a form of Binary Coded Decimal - for example, on the iSeries (and I can't remember if it's Packed or Zoned). Can't speak to other deployments, unfortunately.
You aren't doing math on these values (things like "next entry" don't count) - save DECIMAL/NUMERIC for measurements/values.
Note that, really, ids are just a sequence of bits - the fact that it happens to be an integer (usually) is irrelevant. It's best to consider them random data; sequential assignment is an optimization detail, there's often gaps (rollbacks, system crashes, whatever), and they're meaningless for anything other than joining.

Encoding scheme for large data sets

Assume there is a secure transport layer that securely transfer files, what if we want to transfer multiple files over this channel in one round? well it should be encoded into same file so when Bob receive that file is able to decode and see multiple files (e.g photo albums). i think ASN.1 is good for small data sets (e.g text certificates) but in large data sets such an encoding scheme can increase ciphertext size.
My question is what encoding rule you recommend for large data sets? it must be secure (audited well to be exploit free) and efficient (don't increase size in large scale)
ASN.1 is actually an efficient encoding. If you look at DER encoding of a long sequence of bytes (an OCTET STRING), you will see something like this:
04 [len] [the data bytes]
where 04 is a single byte value, and len is the length of the data expressed in a reasonably compact format. The data bytes then follow, as is. Overhead for a long stream of length n bytes is 2+ceil(log2(n)/8) bytes; in other words, for a stream of up to, say, 256 terabytes, the overhead will be at most eight bytes.
One quirk of DER is that you have to know the total length of the data before beginning to encode it. However, the BER encoding (of which DER is just a subset) can come to the rescue: a value can be split into successive chunks, without needing a prior notion of the number of chunks. Overhead per chunk will be minimal (say 4 bytes for chunks up to 64 kilobytes), allowing for a total overhead of less than 0.01%. You will have trouble doing better.
If you want to split that big stream into several internal "files" then you will need some convention, and, there again, ASN.1 is efficient:
BigStream ::= SEQUENCE OF SEQUENCE {
fileName UTF8String,
fileData OCTET STRING
}
One general header for the SEQUENCE OF (less than 8 bytes); then, for each file, the overhead beyond the data itself and the name (an UTF-8 string) will be less than 20 bytes (usually substantially less) if you use DER. For huge files such that the length is not easily computable in advance, BER encoding can be used, with an overhead which can be lowered to 0.01%, as explained above.
If you are short on size, and the data is "normal data" (with some structure), consider applying compression. Most programming framework can use zlib or already offer a zlib-compatible implementation (e.g. Java and C#/.NET both have it).
As for audit, this is not a matter of encoding but of implementation. ASN.1/BER encoding can be implemented in few lines of code, especially if you are interested only in OCTET STRING, UTF8String and SEQUENCE. We are talking about less than 1000 lines of commented Java or C# here; if you cannot audit that then you cannot audit anything.

DB2 VARCHAR unicode data storage

We are currently using VARCHAR for storing text data in DB2 however we are hitting the problem that length of VARCHAR specified is not the same as length of text because in DB2 VARCHAR length specified is UTF-8 data length which can vary depending on stored text data. For example some texts contain characters from different languages and because of it some texts with 500 characters can't be saved in VARCHAR(500) and etc.
Now we are planning to migrate to VARGRAPHIC. I need to know what are limitations of using VARGRAPHIC for storing unicode text data in DB2.
Are there any problems with using VARGRAPHIC?
DB2 doesn't check that the data is in fact double-byte String, but it assumes it must be. Usually the drivers will do proper conversions for you but you might one day bump into some bug. It is unlikely though.
If you use federated databases Vargraphic support in queries might fail completely. In overall the amount of bug reports for vargraphic data types is somewhat high. Support for it isn't probably as well tested and tried as for other data types.
Vargraphic will with unicode database (ie. UTF-8 is requirement) use big-endian UCS-2, meaning your space requirements for those columns double. Vargraphic is DB2 properietary data type. If you migrate off DB2 some day you will have to do an extra conversion.

Compressing ASCII data to fit within a UTF-32 API?

I have an API that receives Unicode data, but I only need to store ASCII in it. I'd like to compress & obfuscate (or encrypt) the string values that will be persisted in Unicode.
My desire is to either compress this schema data, or to encrypt it from prying eyes. I don't think it's possible to do both well.
Considering that I want to restrict my source data to valid, printable ASCII; how can I "compress" that original string value into a value that is either smaller, obfuscated, or both?
Here is how I imagine this working (though you may have a better way):
This source code will take a given String as input
The bytes representation of that string will be taken (UTF8, ASCII, you decide)
Some magic happens - (this is the part I need your help on)
The resulting bytes will be converted into an int or long (no decimal points)
The number will be converted into a corresponding character using this utility
http://baseanythingconvert.codeplex.com/SourceControl/changeset/view/77855#1558651
(note that utility will be used to enforce the constraint is that the "final" Unicode name must not include the following characters '/', '\', '#', '?' or '%')
Background
The Microsoft Azure Table has an API that accepts Unicode data for the storage or property names. This is a schema-free database (so columns can be created ad-hoc), therefore the schema is stored per row. The downside is that this schema data is stored on disk multiple times, and it is also transmitted over the wire, quite redundantly, in an XML blob.
In addition, I'm working on a utility that dynamically encrypts/decrypts Azure Table Data, but the schema is unencrypted. I'd like to mask or obfuscate this header information somehow.
These are just some ideas.
Isn't step 3 actually straightforward (just compress and/or encrypt the data into different bytes)? For 7-bit ASCII, you can also, before compressing and/or encrypting, store the data by packing the bits so they fit into fewer bytes.
If you can use UTF-32, UTF-8, and so on in step 5, you have access to all the characters in the Unicode Standard, up to 0x10FFFD, with some exceptions; for example, some code points are noncharacters in the Unicode Standard, such as 0xFFFF, and others are invalid characters, such as 0xD800.