Encoding scheme for large data sets

Encoding scheme for large data sets - encoding

Assume there is a secure transport layer that securely transfer files, what if we want to transfer multiple files over this channel in one round? well it should be encoded into same file so when Bob receive that file is able to decode and see multiple files (e.g photo albums). i think ASN.1 is good for small data sets (e.g text certificates) but in large data sets such an encoding scheme can increase ciphertext size.
My question is what encoding rule you recommend for large data sets? it must be secure (audited well to be exploit free) and efficient (don't increase size in large scale)

ASN.1 is actually an efficient encoding. If you look at DER encoding of a long sequence of bytes (an OCTET STRING), you will see something like this:
04 [len] [the data bytes]
where 04 is a single byte value, and len is the length of the data expressed in a reasonably compact format. The data bytes then follow, as is. Overhead for a long stream of length n bytes is 2+ceil(log2(n)/8) bytes; in other words, for a stream of up to, say, 256 terabytes, the overhead will be at most eight bytes.
One quirk of DER is that you have to know the total length of the data before beginning to encode it. However, the BER encoding (of which DER is just a subset) can come to the rescue: a value can be split into successive chunks, without needing a prior notion of the number of chunks. Overhead per chunk will be minimal (say 4 bytes for chunks up to 64 kilobytes), allowing for a total overhead of less than 0.01%. You will have trouble doing better.
If you want to split that big stream into several internal "files" then you will need some convention, and, there again, ASN.1 is efficient:
BigStream ::= SEQUENCE OF SEQUENCE {
fileName UTF8String,
fileData OCTET STRING
}
One general header for the SEQUENCE OF (less than 8 bytes); then, for each file, the overhead beyond the data itself and the name (an UTF-8 string) will be less than 20 bytes (usually substantially less) if you use DER. For huge files such that the length is not easily computable in advance, BER encoding can be used, with an overhead which can be lowered to 0.01%, as explained above.
If you are short on size, and the data is "normal data" (with some structure), consider applying compression. Most programming framework can use zlib or already offer a zlib-compatible implementation (e.g. Java and C#/.NET both have it).
As for audit, this is not a matter of encoding but of implementation. ASN.1/BER encoding can be implemented in few lines of code, especially if you are interested only in OCTET STRING, UTF8String and SEQUENCE. We are talking about less than 1000 lines of commented Java or C# here; if you cannot audit that then you cannot audit anything.

Related

Does Postgres store bytea data as hex on the server?

To work with bytea values in PostgreSQL, I usually am serializing to and deserializing from hex. This seems to be the preferred way. However, what is actually stored on the PostgreSQL server? Is it hex, or the unhexed binary? The reason I care is that hex is obviously going to take up double the space as unhexed binary. When I say unhexed binary, I mean the hex string "00", which is 2 bytes, is just "0", which is 1 byte, as unhexed binary.
The context is I have a Postgres database and a Scylla database that are storing the exact same data in almost the exact same format. However, the total space used by Postgres is almost exactly double the space used by Scylla. For Scylla, I don't encode binary as hex. I just send raw binary over the wire. I don't expect these two databases to use the exact same amount of space. But for PostgreSQL to use double the space is quite a lot of overhead, and the nearly exact doubling really makes me suspect data is being stored as hex and not actual binary on the server (since hex uses exactly double the space as actual binary).

A bytea is stored in binary form, not hex encoded, which would be enormously wasteful. The hex representation is just the default text representation generated by the type output function.
I don't know Scylla, so I cannot explain the difference, but PostgreSQL has substantial overhead per row (23 bytes), and there is also some overhead per 8kB block.
You say in your comments that you measured the database size, which includes all the metadata and indexes. I suggest that you use pg_table_size to measure the table.
Note that PostgreSQL automatically compresses values if a table row would otherwise exceed 2000 bytes.

Why you don't need to use hton/ntoh functions for sending const char*?

I'm sending data buffer over the network using sockets. The only place I see usage of converting endianess is the port number of sender/receiver. I can understand that.
The thing I can't understand is that I’m sending a const char* (using send()/sendto(), depending on transfer protocol). As far as I understand, you transfer bytes over the network in Big Endian. The machine of mine is Little Endian.
What is the trick here of not using ntoh()/hton() functions when sending that const char*?

Background: The concept of big-endian vs little-endian only applies to multi-byte integers (and floating point values); for historical reasons, different CPUs may represent the same numeric binary value in memory via different byte-orderings, and so the designers of the Berkeley sockets API created ntoh*() and hton*() to translate from the CPU's native byte-ordering (whatever that may be) to a "network-standard" format (they chose big-endian) and back, so that binary numbers can be transferred from one machine type to another without being misinterpreted by the recipient's CPU.
Crucially, this splintering of representations happens only above the byte-level; i.e. the ordering of bytes within an N-byte word may be different on different CPUs, but the ordering of bits within each byte is always the same.
A character string (as pointed to by a const char *) refers to a series of individual 8-bit bytes, and the bytes' ordering is interpreted the same way by all CPU types, so there is no need to convert their format to maintain data-compatibility across CPUs. (Note that there are different formats that can be used to represent character strings, e.g. ASCII vs UTF-8 vs UTF-16 vs etc, but that is a decision made by the software rather than a characteristic of the CPU hardware the software is running on; as such, the Berkeley sockets API doesn't try to address that issue)

Marc21 Binary Decoder with Akka-Stream

I'm trying to decode Marc21 binary data records which have the following specification concerning the field that provide the length of the record.
A Computer-generated, five-character number equal to the length of the
entire record, including itself and the record terminator. The number
is right justified and unused positions contain zeros.
I am trying to use
Akka Stream Framing.lengthField, however I just don't know how specify the size of that field. I imagine that a character is 8 bit, maybe 16 for a number, i am not sure, i wonder if that depend of the platform or language. In short, the question is is it possible to say what is the size of that field Knowing that i am in Scala/Java.
Also what does means:
The number is right justified and unused positions contain zeros"
Does that has implication on how one read the value if collected properly ?
If anyone know anything about this, please share.
EDIT1
Context:
I am trying to build a stream processing graph where the first stage would be processing the result of a sys command ran against a symphony (Vendor Cataloging system) server, which is a stream of unstructured byte chunck which as a whole represent all the Marc21 records Requested (full dump or partial dump).
By processing i mean, chunking that unstructured stream of byte into a stream of frames where the frames are the Records.
In other words, readying the bytes for one record at the time, and emitting it individually to the next stage.
The next stage will consist in emitting that record (Bytes) to apache Kafka.
Obviously the emission stage would be fully parallelize to speed up the process.
The Symphony server does not have the capability to stream a dump when requested, especially over the network. Hence, this Akka-stream based Graph processing to perform that work, for fast ingestion/production and overall streaming processing of our dumps in our overall fast data infrastructure.
EDIT2
Based on #badcook input, I wonder if ComputeFramesize could be used here. Not sure i am slightly confused by the function and what does it takes into parameters.
Little clarification would be much appreciated.

It looks like you're trying to parse MARC 21 records.
In that case I would recommend you just take a look at MARC4J and use that.
If you want to integrate it with Akka streams, or even if you want to parse MARC records your own way, I would recommend breaking up your byte steam with Framing.delimiter using the MARC 21 record terminator (ASCII control character 1D) into complete MARC records rather than try to stream and work with fragments of MARC records. It'll be a lot easier.
As for your specific questions: The MARC 21 specification uses characters rather than raw bytes when talking about its structure. It specifies two character encodings into raw bytes, UTF-8 and MARC 8, both of which are variable width encodings. Hence, no it is not true that every character is a byte. There is no single answer of how many bytes a character takes up.
"[R]ight justified and unused positions contain zeroes" is another way of saying that numbers are padded from the left with 0s. In this case this line comes from a larger quote staying that the numerical string must be 5 characters long. That means if you are trying to represent the number 1, you must represent it as 00001.

how to convert binary data into string format in j2me

I am using the gzip algorithm in j2me. After compressing the string I tried to send the compressed string as text message but the size was increasing drastically. So I used base64 encoding to convert the compressed binary to text. But while encoding the size is still increasing, please help me with an encoding technique which when used the data size remains the same.
I tried sending binary sms but as its limit is 134 characters I want to compress it before sending the sms.

You have some competing requirements here.
The fact you're considering using SMS as a transport mechanism makes me suspect that the data you have to send is quite short to start with.
Compression algorithms (in general) work best with large amounts of data and can end up creating a longer output than you started with if you start with something very short.
There are very few useful encoding changes that will leave you with output the same length as when you started. (I'm struggling to think of anything really useful right now.)
You may want to consider alternative transmission methods or alternatives to the compression techniques you have tried.

Compressing ASCII data to fit within a UTF-32 API?

I have an API that receives Unicode data, but I only need to store ASCII in it. I'd like to compress & obfuscate (or encrypt) the string values that will be persisted in Unicode.
My desire is to either compress this schema data, or to encrypt it from prying eyes. I don't think it's possible to do both well.
Considering that I want to restrict my source data to valid, printable ASCII; how can I "compress" that original string value into a value that is either smaller, obfuscated, or both?
Here is how I imagine this working (though you may have a better way):
This source code will take a given String as input
The bytes representation of that string will be taken (UTF8, ASCII, you decide)
Some magic happens - (this is the part I need your help on)
The resulting bytes will be converted into an int or long (no decimal points)
The number will be converted into a corresponding character using this utility
http://baseanythingconvert.codeplex.com/SourceControl/changeset/view/77855#1558651
(note that utility will be used to enforce the constraint is that the "final" Unicode name must not include the following characters '/', '\', '#', '?' or '%')
Background
The Microsoft Azure Table has an API that accepts Unicode data for the storage or property names. This is a schema-free database (so columns can be created ad-hoc), therefore the schema is stored per row. The downside is that this schema data is stored on disk multiple times, and it is also transmitted over the wire, quite redundantly, in an XML blob.
In addition, I'm working on a utility that dynamically encrypts/decrypts Azure Table Data, but the schema is unencrypted. I'd like to mask or obfuscate this header information somehow.

These are just some ideas.
Isn't step 3 actually straightforward (just compress and/or encrypt the data into different bytes)? For 7-bit ASCII, you can also, before compressing and/or encrypting, store the data by packing the bits so they fit into fewer bytes.
If you can use UTF-32, UTF-8, and so on in step 5, you have access to all the characters in the Unicode Standard, up to 0x10FFFD, with some exceptions; for example, some code points are noncharacters in the Unicode Standard, such as 0xFFFF, and others are invalid characters, such as 0xD800.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse