I need to send and receive heterogeneous data from a Matlab client to a server. The data includes 32-bit integers and 64-bit IEEE floats. Remember that TCP/IP only understands characters, so I need to pack this data together into a contiguous array to be clocked out. Then after receiving the response, I need to extract the byte data from the incoming character array and form it into Matlab types. Does anyone have any idea how to do this?
The generic term for turning heterogeneous data into a stream of bytes or characters is serializing (and the reverse, deserializing).
Two widely-used formats for serializing data into text characters are XML and JSON.
If you search the Mathworks site for any of those terms, or search this site for any of those terms together with [matlab] you'll find plenty of libraries and code examples.
Or since R2016b, MATLAB actually has built-in functions for serializing to / deserializing from JSON: jsonencode and jsondecode.
Related
I'm sending data buffer over the network using sockets. The only place I see usage of converting endianess is the port number of sender/receiver. I can understand that.
The thing I can't understand is that I’m sending a const char* (using send()/sendto(), depending on transfer protocol). As far as I understand, you transfer bytes over the network in Big Endian. The machine of mine is Little Endian.
What is the trick here of not using ntoh()/hton() functions when sending that const char*?
Background: The concept of big-endian vs little-endian only applies to multi-byte integers (and floating point values); for historical reasons, different CPUs may represent the same numeric binary value in memory via different byte-orderings, and so the designers of the Berkeley sockets API created ntoh*() and hton*() to translate from the CPU's native byte-ordering (whatever that may be) to a "network-standard" format (they chose big-endian) and back, so that binary numbers can be transferred from one machine type to another without being misinterpreted by the recipient's CPU.
Crucially, this splintering of representations happens only above the byte-level; i.e. the ordering of bytes within an N-byte word may be different on different CPUs, but the ordering of bits within each byte is always the same.
A character string (as pointed to by a const char *) refers to a series of individual 8-bit bytes, and the bytes' ordering is interpreted the same way by all CPU types, so there is no need to convert their format to maintain data-compatibility across CPUs. (Note that there are different formats that can be used to represent character strings, e.g. ASCII vs UTF-8 vs UTF-16 vs etc, but that is a decision made by the software rather than a characteristic of the CPU hardware the software is running on; as such, the Berkeley sockets API doesn't try to address that issue)
Why does ethereum uses RLP encoding encoding for serialising data i mean is there are specific reason for not using already existing formats ?? other than RLP being really compact and space efficient.
simplicity of implementation, it build on ASCII Encoding.
guaranteed absolute byte-perfect consistency. Especially for Key/Value Map data structure, we can represent it as [[k1, v1], [k2, v2], ...], but the sort of key is not stable, which mean use the same value(Map) in different implement will output different result. So RLP guaranteed the ethereum tx hash need consistency.
RLP is a deterministic scheme, whereas other schemes
may produce different results for the same input, which is not acceptable on any blockchain. Even a small change will lead to a totally
different hash and will result in data integrity problems that will render
the entire blockchain useless.
From here
Recursive Length Prefix (RLP) serialization is used extensively in
Ethereum's execution clients. RLP standardizes the transfer of data
between nodes in a space-efficient format. The purpose of RLP is to
encode arbitrarily nested arrays of binary data, and RLP is the
primary encoding method used to serialize objects in Ethereum's
execution layer. The only purpose of RLP is to encode structure;
encoding specific data types (e.g. strings, floats) is left up to
higher-order protocols; but positive RLP integers must be represented
in big-endian binary form with no leading zeroes (thus making the
integer value zero equivalent to the empty byte array). Deserialized
positive integers with leading zeroes get treated as invalid. The
integer representation of string length must also be encoded this way,
as well as integers in the payload.
The purpose of RLP (Recursive Length Prefix) is to encode arbitrarily nested arrays of binary data, and RLP is the main encoding method used to serialize objects in Ethereum. See this RLP
RLP is just an option as you mentioned, someone may be familiar with this format while others don't, but once you choose this tech, it's better to be consistent in one project for not confusing others.
I'm trying to decode Marc21 binary data records which have the following specification concerning the field that provide the length of the record.
A Computer-generated, five-character number equal to the length of the
entire record, including itself and the record terminator. The number
is right justified and unused positions contain zeros.
I am trying to use
Akka Stream Framing.lengthField, however I just don't know how specify the size of that field. I imagine that a character is 8 bit, maybe 16 for a number, i am not sure, i wonder if that depend of the platform or language. In short, the question is is it possible to say what is the size of that field Knowing that i am in Scala/Java.
Also what does means:
The number is right justified and unused positions contain zeros"
Does that has implication on how one read the value if collected properly ?
If anyone know anything about this, please share.
EDIT1
Context:
I am trying to build a stream processing graph where the first stage would be processing the result of a sys command ran against a symphony (Vendor Cataloging system) server, which is a stream of unstructured byte chunck which as a whole represent all the Marc21 records Requested (full dump or partial dump).
By processing i mean, chunking that unstructured stream of byte into a stream of frames where the frames are the Records.
In other words, readying the bytes for one record at the time, and emitting it individually to the next stage.
The next stage will consist in emitting that record (Bytes) to apache Kafka.
Obviously the emission stage would be fully parallelize to speed up the process.
The Symphony server does not have the capability to stream a dump when requested, especially over the network. Hence, this Akka-stream based Graph processing to perform that work, for fast ingestion/production and overall streaming processing of our dumps in our overall fast data infrastructure.
EDIT2
Based on #badcook input, I wonder if ComputeFramesize could be used here. Not sure i am slightly confused by the function and what does it takes into parameters.
Little clarification would be much appreciated.
It looks like you're trying to parse MARC 21 records.
In that case I would recommend you just take a look at MARC4J and use that.
If you want to integrate it with Akka streams, or even if you want to parse MARC records your own way, I would recommend breaking up your byte steam with Framing.delimiter using the MARC 21 record terminator (ASCII control character 1D) into complete MARC records rather than try to stream and work with fragments of MARC records. It'll be a lot easier.
As for your specific questions: The MARC 21 specification uses characters rather than raw bytes when talking about its structure. It specifies two character encodings into raw bytes, UTF-8 and MARC 8, both of which are variable width encodings. Hence, no it is not true that every character is a byte. There is no single answer of how many bytes a character takes up.
"[R]ight justified and unused positions contain zeroes" is another way of saying that numbers are padded from the left with 0s. In this case this line comes from a larger quote staying that the numerical string must be 5 characters long. That means if you are trying to represent the number 1, you must represent it as 00001.
I am reading PostgreSQL protocol document. The document specifies message flow and containment format, but doesn't mention about how actual data fields are encoded in text/binary.
For the text format, there's no mention at all. What does this mean? Should I use just SQL value expressions? Or there's some extra documentation for this? If it's just SQL value expression, does this mean the server will parse them again?
And, which part of source code should I investigate to see how binary data is encoded?
Update
I read the manual again, and I found a mention about text format. So actually there is mention about text representation, and it was my fault that missing this paragraph.
The text representation of values is whatever strings are produced and
accepted by the input/output conversion functions for the particular
data type.
There are two possible data formats - text or binary. Default is a text format - that means, so there is only server <-> client encoding transformation (or nothing when client and server use same encoding). Text format is very simple - trivial - all result data is transformed to human readable text and it is send to client. Binary data like bytea are transformed to human readable text too - hex or Base64 encoding are used. Output is simple. There is nothing to describing in doc
postgres=# select current_date;
date
────────────
2013-10-27
(1 row)
In this case - server send string "2013-10-27" to client. First four bytes is length, others bytes are data.
Little bit difficult is input, because you can separate a data from queries - depends on what API you use. So if you use most simple API - then Postgres expect SQL statement with data together. Some complex API expected SQL statement and data separately.
On second hand a using of binary format is significantly difficult due wide different specific formats for any data type. Any PostgreSQL data type has a two functions - send and recv. These functions are used for sending data to output message stream and reading data from input message stream. Similar functions are for casting to/from plain text (out/in functions). Some clients drivers are able to cast from PostgreSQL binary format to host binary formats.
Some information:
libpq API http://www.postgresql.org/docs/9.3/static/libpq.html
you can look to PostgreSQL src to send/recv and out/in function - look on bytea or date implementation src/backend/utils/adt/date.c. Implementation of libpq is interesting too src/interfaces/libpq
-
The things closest to a spec of a PostgreSQL binary format I could find were the documentation and the source code of the "libpqtypes" library. I know, a terrible state of the documentation for such a huge product.
The text representation of values is whatever strings are produced and
accepted by the input/output conversion functions for the particular
data type. In the transmitted representation, there is no trailing
null character; the frontend must add one to received values if it
wants to process them as C strings. (The text format does not allow
embedded nulls, by the way.)
Binary representations for integers use network byte order (most
significant byte first). For other data types consult the
documentation or source code to learn about the binary representation.
Keep in mind that binary representations for complex data types might
change across server versions; the text format is usually the more
portable choice.
(quoted from the documentation, link)
So the binary protocol is not stable across versions, so you probably should treat it as an implementation detail and not use the binary representation. The text representation is AFAICT just the format of literals in SQL queries.
What is the difference between serializing and encoding?
When should I use each in a web service?
Serializing is about moving structured data over a storage/transmission medium in a way that the structure can be maintained. Encoding is more broad, like about how said data is converted to different forms, etc. Perhaps you could think about serializing being a subset of encoding in this example.
With regard to a web service, you will probably be considering serializing/deserializing certain data for making/receiving requests/responses - effectively transporting "messages". Encoding is at a lower level, like how your messaging/web service type/serialization mechanism works under the hood.
"Serialization" is the process of converting data (which may include arrays, objects and similar structures) into a single string so it can be stored or transmitted easily. For instance, a single database field can't store an array because database engines don't understand that concept. To be stored in a database, an array must either be broken down into individual elements and stored in multiple fields (which only works if none of the elements are themselves arrays or objects) or it must be serialized so it can be stored as a single string in a single field. Later, the string can be retrieved from the database and unserialized, yielding a copy of the original array.
"Encoding" is a very general term, referring to any format in which data can be represented, or the process of converting data into a particular representation.
For example, in most programming languages, a sequence of data elements can be encoded as an array. (Usually we call arrays "data structures" rather than "encodings", but technically all data structures are also encodings.) Likewise, the word "encoding" itself can be encoded as a sequence of English letters. Those letters themselves may be encoded as written symbols, or as bit patterns in a text encoding such as ASCII or UTF8.
The process of encoding converts a given piece of data into another representation (another encoding, if you want to be technical about it.)
As you may have already realized, serialization is an example of an encoding process; it transforms a complex data structure into a single sequence of text characters.
That sequence of characters can then be further encoded into some sort of binary representation for more compact storage or quicker transmission. I don't know what form of binary encoding is used in BSON, but presumably it is more efficient than the bit patterns used in text encodings.