Compressing ASCII data to fit within a UTF-32 API? - unicode

I have an API that receives Unicode data, but I only need to store ASCII in it. I'd like to compress & obfuscate (or encrypt) the string values that will be persisted in Unicode.
My desire is to either compress this schema data, or to encrypt it from prying eyes. I don't think it's possible to do both well.
Considering that I want to restrict my source data to valid, printable ASCII; how can I "compress" that original string value into a value that is either smaller, obfuscated, or both?
Here is how I imagine this working (though you may have a better way):
This source code will take a given String as input
The bytes representation of that string will be taken (UTF8, ASCII, you decide)
Some magic happens - (this is the part I need your help on)
The resulting bytes will be converted into an int or long (no decimal points)
The number will be converted into a corresponding character using this utility
http://baseanythingconvert.codeplex.com/SourceControl/changeset/view/77855#1558651
(note that utility will be used to enforce the constraint is that the "final" Unicode name must not include the following characters '/', '\', '#', '?' or '%')
Background
The Microsoft Azure Table has an API that accepts Unicode data for the storage or property names. This is a schema-free database (so columns can be created ad-hoc), therefore the schema is stored per row. The downside is that this schema data is stored on disk multiple times, and it is also transmitted over the wire, quite redundantly, in an XML blob.
In addition, I'm working on a utility that dynamically encrypts/decrypts Azure Table Data, but the schema is unencrypted. I'd like to mask or obfuscate this header information somehow.

These are just some ideas.
Isn't step 3 actually straightforward (just compress and/or encrypt the data into different bytes)? For 7-bit ASCII, you can also, before compressing and/or encrypting, store the data by packing the bits so they fit into fewer bytes.
If you can use UTF-32, UTF-8, and so on in step 5, you have access to all the characters in the Unicode Standard, up to 0x10FFFD, with some exceptions; for example, some code points are noncharacters in the Unicode Standard, such as 0xFFFF, and others are invalid characters, such as 0xD800.

Related

Does Postgres store bytea data as hex on the server?

To work with bytea values in PostgreSQL, I usually am serializing to and deserializing from hex. This seems to be the preferred way. However, what is actually stored on the PostgreSQL server? Is it hex, or the unhexed binary? The reason I care is that hex is obviously going to take up double the space as unhexed binary. When I say unhexed binary, I mean the hex string "00", which is 2 bytes, is just "0", which is 1 byte, as unhexed binary.
The context is I have a Postgres database and a Scylla database that are storing the exact same data in almost the exact same format. However, the total space used by Postgres is almost exactly double the space used by Scylla. For Scylla, I don't encode binary as hex. I just send raw binary over the wire. I don't expect these two databases to use the exact same amount of space. But for PostgreSQL to use double the space is quite a lot of overhead, and the nearly exact doubling really makes me suspect data is being stored as hex and not actual binary on the server (since hex uses exactly double the space as actual binary).
A bytea is stored in binary form, not hex encoded, which would be enormously wasteful. The hex representation is just the default text representation generated by the type output function.
I don't know Scylla, so I cannot explain the difference, but PostgreSQL has substantial overhead per row (23 bytes), and there is also some overhead per 8kB block.
You say in your comments that you measured the database size, which includes all the metadata and indexes. I suggest that you use pg_table_size to measure the table.
Note that PostgreSQL automatically compresses values if a table row would otherwise exceed 2000 bytes.

Encoding scheme for large data sets

Assume there is a secure transport layer that securely transfer files, what if we want to transfer multiple files over this channel in one round? well it should be encoded into same file so when Bob receive that file is able to decode and see multiple files (e.g photo albums). i think ASN.1 is good for small data sets (e.g text certificates) but in large data sets such an encoding scheme can increase ciphertext size.
My question is what encoding rule you recommend for large data sets? it must be secure (audited well to be exploit free) and efficient (don't increase size in large scale)
ASN.1 is actually an efficient encoding. If you look at DER encoding of a long sequence of bytes (an OCTET STRING), you will see something like this:
04 [len] [the data bytes]
where 04 is a single byte value, and len is the length of the data expressed in a reasonably compact format. The data bytes then follow, as is. Overhead for a long stream of length n bytes is 2+ceil(log2(n)/8) bytes; in other words, for a stream of up to, say, 256 terabytes, the overhead will be at most eight bytes.
One quirk of DER is that you have to know the total length of the data before beginning to encode it. However, the BER encoding (of which DER is just a subset) can come to the rescue: a value can be split into successive chunks, without needing a prior notion of the number of chunks. Overhead per chunk will be minimal (say 4 bytes for chunks up to 64 kilobytes), allowing for a total overhead of less than 0.01%. You will have trouble doing better.
If you want to split that big stream into several internal "files" then you will need some convention, and, there again, ASN.1 is efficient:
BigStream ::= SEQUENCE OF SEQUENCE {
fileName UTF8String,
fileData OCTET STRING
}
One general header for the SEQUENCE OF (less than 8 bytes); then, for each file, the overhead beyond the data itself and the name (an UTF-8 string) will be less than 20 bytes (usually substantially less) if you use DER. For huge files such that the length is not easily computable in advance, BER encoding can be used, with an overhead which can be lowered to 0.01%, as explained above.
If you are short on size, and the data is "normal data" (with some structure), consider applying compression. Most programming framework can use zlib or already offer a zlib-compatible implementation (e.g. Java and C#/.NET both have it).
As for audit, this is not a matter of encoding but of implementation. ASN.1/BER encoding can be implemented in few lines of code, especially if you are interested only in OCTET STRING, UTF8String and SEQUENCE. We are talking about less than 1000 lines of commented Java or C# here; if you cannot audit that then you cannot audit anything.

PostgreSQL protocol data representation format specification?

I am reading PostgreSQL protocol document. The document specifies message flow and containment format, but doesn't mention about how actual data fields are encoded in text/binary.
For the text format, there's no mention at all. What does this mean? Should I use just SQL value expressions? Or there's some extra documentation for this? If it's just SQL value expression, does this mean the server will parse them again?
And, which part of source code should I investigate to see how binary data is encoded?
Update
I read the manual again, and I found a mention about text format. So actually there is mention about text representation, and it was my fault that missing this paragraph.
The text representation of values is whatever strings are produced and
accepted by the input/output conversion functions for the particular
data type.
There are two possible data formats - text or binary. Default is a text format - that means, so there is only server <-> client encoding transformation (or nothing when client and server use same encoding). Text format is very simple - trivial - all result data is transformed to human readable text and it is send to client. Binary data like bytea are transformed to human readable text too - hex or Base64 encoding are used. Output is simple. There is nothing to describing in doc
postgres=# select current_date;
date
────────────
2013-10-27
(1 row)
In this case - server send string "2013-10-27" to client. First four bytes is length, others bytes are data.
Little bit difficult is input, because you can separate a data from queries - depends on what API you use. So if you use most simple API - then Postgres expect SQL statement with data together. Some complex API expected SQL statement and data separately.
On second hand a using of binary format is significantly difficult due wide different specific formats for any data type. Any PostgreSQL data type has a two functions - send and recv. These functions are used for sending data to output message stream and reading data from input message stream. Similar functions are for casting to/from plain text (out/in functions). Some clients drivers are able to cast from PostgreSQL binary format to host binary formats.
Some information:
libpq API http://www.postgresql.org/docs/9.3/static/libpq.html
you can look to PostgreSQL src to send/recv and out/in function - look on bytea or date implementation src/backend/utils/adt/date.c. Implementation of libpq is interesting too src/interfaces/libpq
-
The things closest to a spec of a PostgreSQL binary format I could find were the documentation and the source code of the "libpqtypes" library. I know, a terrible state of the documentation for such a huge product.
The text representation of values is whatever strings are produced and
accepted by the input/output conversion functions for the particular
data type. In the transmitted representation, there is no trailing
null character; the frontend must add one to received values if it
wants to process them as C strings. (The text format does not allow
embedded nulls, by the way.)
Binary representations for integers use network byte order (most
significant byte first). For other data types consult the
documentation or source code to learn about the binary representation.
Keep in mind that binary representations for complex data types might
change across server versions; the text format is usually the more
portable choice.
(quoted from the documentation, link)
So the binary protocol is not stable across versions, so you probably should treat it as an implementation detail and not use the binary representation. The text representation is AFAICT just the format of literals in SQL queries.

AS400 jdbc character conversion

using jdbc (jt400) to insert data into an as400 table.
db table code page is 424. Host Code Page 424
the ebcdic 424 code page does not support many of the characters that may come from the client.
for example the sign → (Ascii 26 Hex 1A)
the result is an incorrect translation.
is there any built-in way in the toolbox to remove any of the unsupported characters?
You could try to create a logical file over your ccsid424 physical file with a different codepage. It is possible on the as/400 to create logical files with different codepages for individual columns, by adding the keyword CCSID(<num>). You can even set it to an unicode charset, e.g. CCSID(1200) for UTF-16. Of course your physical file will still only be able to store chars that are in the 424 codepage, and those will be replaced by some invalid character char, but the translation might be better that way.
There is no way to store chars, that are not in codepage 424 in a column with that codepage directly (the only way I can think of is encoding them somehow with multiple chars, but that is most likely not what you want to do, since it will bring more problems than it "solves").
If you have control over that system, and it is possible to do some bigger changes, you could do it the other way around: create a new unicode version of that physical file with a different name (I'd propose CCSID(1200), that's as close as you get to UTF-16 on as/400 afaik, and UTF-8 is not supported by all parts of the system in my experience. IBM does recommend 1200 for unicode). Than transfer all data from your old file to the new one, delete the old one (before that, backup it!), and than create a logical file over the new physical, with the name of the old physical file. In that logical file change all ccsid-bearing columns from 1200 to 424. That way, existing programs can still work on the data. Of course there will be invalid chars in the logical file now, once you insert data that is not in a subset of ccsid 424; so you will most likely have to take a look at all programs that use the new logical file.

What is the difference between serializing and encoding?

What is the difference between serializing and encoding?
When should I use each in a web service?
Serializing is about moving structured data over a storage/transmission medium in a way that the structure can be maintained. Encoding is more broad, like about how said data is converted to different forms, etc. Perhaps you could think about serializing being a subset of encoding in this example.
With regard to a web service, you will probably be considering serializing/deserializing certain data for making/receiving requests/responses - effectively transporting "messages". Encoding is at a lower level, like how your messaging/web service type/serialization mechanism works under the hood.
"Serialization" is the process of converting data (which may include arrays, objects and similar structures) into a single string so it can be stored or transmitted easily. For instance, a single database field can't store an array because database engines don't understand that concept. To be stored in a database, an array must either be broken down into individual elements and stored in multiple fields (which only works if none of the elements are themselves arrays or objects) or it must be serialized so it can be stored as a single string in a single field. Later, the string can be retrieved from the database and unserialized, yielding a copy of the original array.
"Encoding" is a very general term, referring to any format in which data can be represented, or the process of converting data into a particular representation.
For example, in most programming languages, a sequence of data elements can be encoded as an array. (Usually we call arrays "data structures" rather than "encodings", but technically all data structures are also encodings.) Likewise, the word "encoding" itself can be encoded as a sequence of English letters. Those letters themselves may be encoded as written symbols, or as bit patterns in a text encoding such as ASCII or UTF8.
The process of encoding converts a given piece of data into another representation (another encoding, if you want to be technical about it.)
As you may have already realized, serialization is an example of an encoding process; it transforms a complex data structure into a single sequence of text characters.
That sequence of characters can then be further encoded into some sort of binary representation for more compact storage or quicker transmission. I don't know what form of binary encoding is used in BSON, but presumably it is more efficient than the bit patterns used in text encodings.