What is the difference between serializing and encoding? - encoding

What is the difference between serializing and encoding?
When should I use each in a web service?

Serializing is about moving structured data over a storage/transmission medium in a way that the structure can be maintained. Encoding is more broad, like about how said data is converted to different forms, etc. Perhaps you could think about serializing being a subset of encoding in this example.
With regard to a web service, you will probably be considering serializing/deserializing certain data for making/receiving requests/responses - effectively transporting "messages". Encoding is at a lower level, like how your messaging/web service type/serialization mechanism works under the hood.

"Serialization" is the process of converting data (which may include arrays, objects and similar structures) into a single string so it can be stored or transmitted easily. For instance, a single database field can't store an array because database engines don't understand that concept. To be stored in a database, an array must either be broken down into individual elements and stored in multiple fields (which only works if none of the elements are themselves arrays or objects) or it must be serialized so it can be stored as a single string in a single field. Later, the string can be retrieved from the database and unserialized, yielding a copy of the original array.
"Encoding" is a very general term, referring to any format in which data can be represented, or the process of converting data into a particular representation.
For example, in most programming languages, a sequence of data elements can be encoded as an array. (Usually we call arrays "data structures" rather than "encodings", but technically all data structures are also encodings.) Likewise, the word "encoding" itself can be encoded as a sequence of English letters. Those letters themselves may be encoded as written symbols, or as bit patterns in a text encoding such as ASCII or UTF8.
The process of encoding converts a given piece of data into another representation (another encoding, if you want to be technical about it.)
As you may have already realized, serialization is an example of an encoding process; it transforms a complex data structure into a single sequence of text characters.
That sequence of characters can then be further encoded into some sort of binary representation for more compact storage or quicker transmission. I don't know what form of binary encoding is used in BSON, but presumably it is more efficient than the bit patterns used in text encodings.

Related

Why does ethereum uses RLP encoding

Why does ethereum uses RLP encoding encoding for serialising data i mean is there are specific reason for not using already existing formats ?? other than RLP being really compact and space efficient.
simplicity of implementation, it build on ASCII Encoding.
guaranteed absolute byte-perfect consistency. Especially for Key/Value Map data structure, we can represent it as [[k1, v1], [k2, v2], ...], but the sort of key is not stable, which mean use the same value(Map) in different implement will output different result. So RLP guaranteed the ethereum tx hash need consistency.
RLP is a deterministic scheme, whereas other schemes
may produce different results for the same input, which is not acceptable on any blockchain. Even a small change will lead to a totally
different hash and will result in data integrity problems that will render
the entire blockchain useless.
From here
Recursive Length Prefix (RLP) serialization is used extensively in
Ethereum's execution clients. RLP standardizes the transfer of data
between nodes in a space-efficient format. The purpose of RLP is to
encode arbitrarily nested arrays of binary data, and RLP is the
primary encoding method used to serialize objects in Ethereum's
execution layer. The only purpose of RLP is to encode structure;
encoding specific data types (e.g. strings, floats) is left up to
higher-order protocols; but positive RLP integers must be represented
in big-endian binary form with no leading zeroes (thus making the
integer value zero equivalent to the empty byte array). Deserialized
positive integers with leading zeroes get treated as invalid. The
integer representation of string length must also be encoded this way,
as well as integers in the payload.
The purpose of RLP (Recursive Length Prefix) is to encode arbitrarily nested arrays of binary data, and RLP is the main encoding method used to serialize objects in Ethereum. See this RLP
RLP is just an option as you mentioned, someone may be familiar with this format while others don't, but once you choose this tech, it's better to be consistent in one project for not confusing others.

Heterogeneous Data Over TCP/IP

I need to send and receive heterogeneous data from a Matlab client to a server. The data includes 32-bit integers and 64-bit IEEE floats. Remember that TCP/IP only understands characters, so I need to pack this data together into a contiguous array to be clocked out. Then after receiving the response, I need to extract the byte data from the incoming character array and form it into Matlab types. Does anyone have any idea how to do this?
The generic term for turning heterogeneous data into a stream of bytes or characters is serializing (and the reverse, deserializing).
Two widely-used formats for serializing data into text characters are XML and JSON.
If you search the Mathworks site for any of those terms, or search this site for any of those terms together with [matlab] you'll find plenty of libraries and code examples.
Or since R2016b, MATLAB actually has built-in functions for serializing to / deserializing from JSON: jsonencode and jsondecode.

PostgreSQL protocol data representation format specification?

I am reading PostgreSQL protocol document. The document specifies message flow and containment format, but doesn't mention about how actual data fields are encoded in text/binary.
For the text format, there's no mention at all. What does this mean? Should I use just SQL value expressions? Or there's some extra documentation for this? If it's just SQL value expression, does this mean the server will parse them again?
And, which part of source code should I investigate to see how binary data is encoded?
Update
I read the manual again, and I found a mention about text format. So actually there is mention about text representation, and it was my fault that missing this paragraph.
The text representation of values is whatever strings are produced and
accepted by the input/output conversion functions for the particular
data type.
There are two possible data formats - text or binary. Default is a text format - that means, so there is only server <-> client encoding transformation (or nothing when client and server use same encoding). Text format is very simple - trivial - all result data is transformed to human readable text and it is send to client. Binary data like bytea are transformed to human readable text too - hex or Base64 encoding are used. Output is simple. There is nothing to describing in doc
postgres=# select current_date;
date
────────────
2013-10-27
(1 row)
In this case - server send string "2013-10-27" to client. First four bytes is length, others bytes are data.
Little bit difficult is input, because you can separate a data from queries - depends on what API you use. So if you use most simple API - then Postgres expect SQL statement with data together. Some complex API expected SQL statement and data separately.
On second hand a using of binary format is significantly difficult due wide different specific formats for any data type. Any PostgreSQL data type has a two functions - send and recv. These functions are used for sending data to output message stream and reading data from input message stream. Similar functions are for casting to/from plain text (out/in functions). Some clients drivers are able to cast from PostgreSQL binary format to host binary formats.
Some information:
libpq API http://www.postgresql.org/docs/9.3/static/libpq.html
you can look to PostgreSQL src to send/recv and out/in function - look on bytea or date implementation src/backend/utils/adt/date.c. Implementation of libpq is interesting too src/interfaces/libpq
-
The things closest to a spec of a PostgreSQL binary format I could find were the documentation and the source code of the "libpqtypes" library. I know, a terrible state of the documentation for such a huge product.
The text representation of values is whatever strings are produced and
accepted by the input/output conversion functions for the particular
data type. In the transmitted representation, there is no trailing
null character; the frontend must add one to received values if it
wants to process them as C strings. (The text format does not allow
embedded nulls, by the way.)
Binary representations for integers use network byte order (most
significant byte first). For other data types consult the
documentation or source code to learn about the binary representation.
Keep in mind that binary representations for complex data types might
change across server versions; the text format is usually the more
portable choice.
(quoted from the documentation, link)
So the binary protocol is not stable across versions, so you probably should treat it as an implementation detail and not use the binary representation. The text representation is AFAICT just the format of literals in SQL queries.

Can I save space in my Mongodb indexes by converting ASCII strings to bytes?

I have a lot of object with language code as a key field. Since both Java and Mongodb use UTF-8 natively and since the language codes are ASCII it seems to be that they should take 1 byte per character plus the \0 terminator. So the language code "en" should take only 3 bytes in the BSON object and in the index.
Is this correct? I am wondering whether I save anything by converting my fields to a byte array like:
byte[] lcBytes = langCode.getBytes("ISO-8859-1");
before saving them to Mongodb with the Java driver?
According to the bson spec, it doesn't make a difference:
string ::= int32 (byte*) "\x00"
binary ::= int32 subtype (byte*)
In other words, the string is zero-terminated (hence wasting one byte), while the binary needs a one-byte subtype field.
Of course, a perfectly matching character set could be more efficient in that the byte array itself could be smaller (e.g. not require three bytes for a character you need very often, but only one). Then again, I hardly think it's worth the hassle, because it makes it impossible to use regex, map/reduce, js functions, etc. Maybe for very arcance charsets, but 8859-1 isn't too special.
As a sidenote, keep in mind that the index size is limited to about 1k, so you can't throw very long strings in the index (and it's not a good idea performance-wise).
If you only need to query by equality, maybe you could choose a hash instead? If you need to store very large strings (non-indexed), a compression algorithm might be a good idea.

Compressing ASCII data to fit within a UTF-32 API?

I have an API that receives Unicode data, but I only need to store ASCII in it. I'd like to compress & obfuscate (or encrypt) the string values that will be persisted in Unicode.
My desire is to either compress this schema data, or to encrypt it from prying eyes. I don't think it's possible to do both well.
Considering that I want to restrict my source data to valid, printable ASCII; how can I "compress" that original string value into a value that is either smaller, obfuscated, or both?
Here is how I imagine this working (though you may have a better way):
This source code will take a given String as input
The bytes representation of that string will be taken (UTF8, ASCII, you decide)
Some magic happens - (this is the part I need your help on)
The resulting bytes will be converted into an int or long (no decimal points)
The number will be converted into a corresponding character using this utility
http://baseanythingconvert.codeplex.com/SourceControl/changeset/view/77855#1558651
(note that utility will be used to enforce the constraint is that the "final" Unicode name must not include the following characters '/', '\', '#', '?' or '%')
Background
The Microsoft Azure Table has an API that accepts Unicode data for the storage or property names. This is a schema-free database (so columns can be created ad-hoc), therefore the schema is stored per row. The downside is that this schema data is stored on disk multiple times, and it is also transmitted over the wire, quite redundantly, in an XML blob.
In addition, I'm working on a utility that dynamically encrypts/decrypts Azure Table Data, but the schema is unencrypted. I'd like to mask or obfuscate this header information somehow.
These are just some ideas.
Isn't step 3 actually straightforward (just compress and/or encrypt the data into different bytes)? For 7-bit ASCII, you can also, before compressing and/or encrypting, store the data by packing the bits so they fit into fewer bytes.
If you can use UTF-32, UTF-8, and so on in step 5, you have access to all the characters in the Unicode Standard, up to 0x10FFFD, with some exceptions; for example, some code points are noncharacters in the Unicode Standard, such as 0xFFFF, and others are invalid characters, such as 0xD800.