Why does ethereum uses RLP encoding - encoding

Why does ethereum uses RLP encoding encoding for serialising data i mean is there are specific reason for not using already existing formats ?? other than RLP being really compact and space efficient.

simplicity of implementation, it build on ASCII Encoding.
guaranteed absolute byte-perfect consistency. Especially for Key/Value Map data structure, we can represent it as [[k1, v1], [k2, v2], ...], but the sort of key is not stable, which mean use the same value(Map) in different implement will output different result. So RLP guaranteed the ethereum tx hash need consistency.

RLP is a deterministic scheme, whereas other schemes
may produce different results for the same input, which is not acceptable on any blockchain. Even a small change will lead to a totally
different hash and will result in data integrity problems that will render
the entire blockchain useless.
From here
Recursive Length Prefix (RLP) serialization is used extensively in
Ethereum's execution clients. RLP standardizes the transfer of data
between nodes in a space-efficient format. The purpose of RLP is to
encode arbitrarily nested arrays of binary data, and RLP is the
primary encoding method used to serialize objects in Ethereum's
execution layer. The only purpose of RLP is to encode structure;
encoding specific data types (e.g. strings, floats) is left up to
higher-order protocols; but positive RLP integers must be represented
in big-endian binary form with no leading zeroes (thus making the
integer value zero equivalent to the empty byte array). Deserialized
positive integers with leading zeroes get treated as invalid. The
integer representation of string length must also be encoded this way,
as well as integers in the payload.

The purpose of RLP (Recursive Length Prefix) is to encode arbitrarily nested arrays of binary data, and RLP is the main encoding method used to serialize objects in Ethereum. See this RLP
RLP is just an option as you mentioned, someone may be familiar with this format while others don't, but once you choose this tech, it's better to be consistent in one project for not confusing others.

Related

ClickHouse: Does it make sense to use LowCardinality fields on Uint8 used as Boolean?

LowCardinality fields in ClickHouse are an optimization where the values are dictionary-encoded for faster lookups and smaller storage. As per documentation:
The efficiency of using LowCarditality data type depends on data diversity. If a dictionary contains less than 10,000 distinct values, then ClickHouse mostly shows higher efficiency of data reading and storing. If a dictionary contains more than 100,000 distinct values, then ClickHouse can perform worse in comparison with using ordinary data types.
What about UInt8 values used as Boolean? Cardinality is 2, but with such small footprint(8), would it actually provide a benefit in queries to use it?
LowCardinality has sense mostly for String type.
LowCardinality(UInt8) is always worse than UInt8.
There are very rare cases where LowCardinality makes sense for numeric types. But I would not even test it because it wasting of time.
Pointer to a LC dictionary takes (Int8-Int32) in a .bin file so it's cheaper in disk space and CPU to store numeric value itself in .bin file.
According to https://clickhouse.com/docs/en/whats-new/changelog/#new-feature ClickHouse now supports a native Bool type. It is essentially a UInt8 restricted to the values 0 and 1, but it will also serialize to and from true/false in formats such as JSON.

Encoding scheme for large data sets

Assume there is a secure transport layer that securely transfer files, what if we want to transfer multiple files over this channel in one round? well it should be encoded into same file so when Bob receive that file is able to decode and see multiple files (e.g photo albums). i think ASN.1 is good for small data sets (e.g text certificates) but in large data sets such an encoding scheme can increase ciphertext size.
My question is what encoding rule you recommend for large data sets? it must be secure (audited well to be exploit free) and efficient (don't increase size in large scale)
ASN.1 is actually an efficient encoding. If you look at DER encoding of a long sequence of bytes (an OCTET STRING), you will see something like this:
04 [len] [the data bytes]
where 04 is a single byte value, and len is the length of the data expressed in a reasonably compact format. The data bytes then follow, as is. Overhead for a long stream of length n bytes is 2+ceil(log2(n)/8) bytes; in other words, for a stream of up to, say, 256 terabytes, the overhead will be at most eight bytes.
One quirk of DER is that you have to know the total length of the data before beginning to encode it. However, the BER encoding (of which DER is just a subset) can come to the rescue: a value can be split into successive chunks, without needing a prior notion of the number of chunks. Overhead per chunk will be minimal (say 4 bytes for chunks up to 64 kilobytes), allowing for a total overhead of less than 0.01%. You will have trouble doing better.
If you want to split that big stream into several internal "files" then you will need some convention, and, there again, ASN.1 is efficient:
BigStream ::= SEQUENCE OF SEQUENCE {
fileName UTF8String,
fileData OCTET STRING
}
One general header for the SEQUENCE OF (less than 8 bytes); then, for each file, the overhead beyond the data itself and the name (an UTF-8 string) will be less than 20 bytes (usually substantially less) if you use DER. For huge files such that the length is not easily computable in advance, BER encoding can be used, with an overhead which can be lowered to 0.01%, as explained above.
If you are short on size, and the data is "normal data" (with some structure), consider applying compression. Most programming framework can use zlib or already offer a zlib-compatible implementation (e.g. Java and C#/.NET both have it).
As for audit, this is not a matter of encoding but of implementation. ASN.1/BER encoding can be implemented in few lines of code, especially if you are interested only in OCTET STRING, UTF8String and SEQUENCE. We are talking about less than 1000 lines of commented Java or C# here; if you cannot audit that then you cannot audit anything.

Can I save space in my Mongodb indexes by converting ASCII strings to bytes?

I have a lot of object with language code as a key field. Since both Java and Mongodb use UTF-8 natively and since the language codes are ASCII it seems to be that they should take 1 byte per character plus the \0 terminator. So the language code "en" should take only 3 bytes in the BSON object and in the index.
Is this correct? I am wondering whether I save anything by converting my fields to a byte array like:
byte[] lcBytes = langCode.getBytes("ISO-8859-1");
before saving them to Mongodb with the Java driver?
According to the bson spec, it doesn't make a difference:
string ::= int32 (byte*) "\x00"
binary ::= int32 subtype (byte*)
In other words, the string is zero-terminated (hence wasting one byte), while the binary needs a one-byte subtype field.
Of course, a perfectly matching character set could be more efficient in that the byte array itself could be smaller (e.g. not require three bytes for a character you need very often, but only one). Then again, I hardly think it's worth the hassle, because it makes it impossible to use regex, map/reduce, js functions, etc. Maybe for very arcance charsets, but 8859-1 isn't too special.
As a sidenote, keep in mind that the index size is limited to about 1k, so you can't throw very long strings in the index (and it's not a good idea performance-wise).
If you only need to query by equality, maybe you could choose a hash instead? If you need to store very large strings (non-indexed), a compression algorithm might be a good idea.

Compressing ASCII data to fit within a UTF-32 API?

I have an API that receives Unicode data, but I only need to store ASCII in it. I'd like to compress & obfuscate (or encrypt) the string values that will be persisted in Unicode.
My desire is to either compress this schema data, or to encrypt it from prying eyes. I don't think it's possible to do both well.
Considering that I want to restrict my source data to valid, printable ASCII; how can I "compress" that original string value into a value that is either smaller, obfuscated, or both?
Here is how I imagine this working (though you may have a better way):
This source code will take a given String as input
The bytes representation of that string will be taken (UTF8, ASCII, you decide)
Some magic happens - (this is the part I need your help on)
The resulting bytes will be converted into an int or long (no decimal points)
The number will be converted into a corresponding character using this utility
http://baseanythingconvert.codeplex.com/SourceControl/changeset/view/77855#1558651
(note that utility will be used to enforce the constraint is that the "final" Unicode name must not include the following characters '/', '\', '#', '?' or '%')
Background
The Microsoft Azure Table has an API that accepts Unicode data for the storage or property names. This is a schema-free database (so columns can be created ad-hoc), therefore the schema is stored per row. The downside is that this schema data is stored on disk multiple times, and it is also transmitted over the wire, quite redundantly, in an XML blob.
In addition, I'm working on a utility that dynamically encrypts/decrypts Azure Table Data, but the schema is unencrypted. I'd like to mask or obfuscate this header information somehow.
These are just some ideas.
Isn't step 3 actually straightforward (just compress and/or encrypt the data into different bytes)? For 7-bit ASCII, you can also, before compressing and/or encrypting, store the data by packing the bits so they fit into fewer bytes.
If you can use UTF-32, UTF-8, and so on in step 5, you have access to all the characters in the Unicode Standard, up to 0x10FFFD, with some exceptions; for example, some code points are noncharacters in the Unicode Standard, such as 0xFFFF, and others are invalid characters, such as 0xD800.

What is the difference between serializing and encoding?

What is the difference between serializing and encoding?
When should I use each in a web service?
Serializing is about moving structured data over a storage/transmission medium in a way that the structure can be maintained. Encoding is more broad, like about how said data is converted to different forms, etc. Perhaps you could think about serializing being a subset of encoding in this example.
With regard to a web service, you will probably be considering serializing/deserializing certain data for making/receiving requests/responses - effectively transporting "messages". Encoding is at a lower level, like how your messaging/web service type/serialization mechanism works under the hood.
"Serialization" is the process of converting data (which may include arrays, objects and similar structures) into a single string so it can be stored or transmitted easily. For instance, a single database field can't store an array because database engines don't understand that concept. To be stored in a database, an array must either be broken down into individual elements and stored in multiple fields (which only works if none of the elements are themselves arrays or objects) or it must be serialized so it can be stored as a single string in a single field. Later, the string can be retrieved from the database and unserialized, yielding a copy of the original array.
"Encoding" is a very general term, referring to any format in which data can be represented, or the process of converting data into a particular representation.
For example, in most programming languages, a sequence of data elements can be encoded as an array. (Usually we call arrays "data structures" rather than "encodings", but technically all data structures are also encodings.) Likewise, the word "encoding" itself can be encoded as a sequence of English letters. Those letters themselves may be encoded as written symbols, or as bit patterns in a text encoding such as ASCII or UTF8.
The process of encoding converts a given piece of data into another representation (another encoding, if you want to be technical about it.)
As you may have already realized, serialization is an example of an encoding process; it transforms a complex data structure into a single sequence of text characters.
That sequence of characters can then be further encoded into some sort of binary representation for more compact storage or quicker transmission. I don't know what form of binary encoding is used in BSON, but presumably it is more efficient than the bit patterns used in text encodings.