How to serialize/deserialize objects sent over the network in Haskell? - sockets

I see that there are many ways to serialize/deserialize Haskell objects:
Data.Serialize -> encode, decode functions
Data.Binary http://code.haskell.org/binary/
MsgPack, JSON, BSON, etc
In my application, I want to setup a simple TCP client-server, where client may send serialized Haskell record objects. How does one decide between these serialization alternatives?
Additionally, when objects serialized into strings are sent over the network using Network.Socket, strings are returned. Is there a slightly higher level library, that works at the level of whole TCP messages? In other words, is there a way to avoid writing parsing code on the receive end that:
collects results of a sequence of recv() calls,
detect that a whole object has been received, and
then parse it into a haskell type?
In my application, the objects are not expected to be too large (maybe about ~1MB max).

As for the second part of your question, two things are required:
An incremental parser that doesn't need to have the whole document in memory to start parsing, and which can be fed with the partial chunks of data arriving from the wire. Also, when the parsing succeeds it must return any "leftover data" along with the parsed value.
A source of data with "pushback capabilities", that allows you to "unread" any leftovers so that they are available to the next parsing attempt.
The most popular library providing (1) is attoparsec. As for (2), all the three main streaming libraries (conduit, io-streams, and pipes) offer some kind of pushback functionality (the latter using the auxiliary pipes-parse package). All three libraries can integrate with attoparsec parsers as well (see here, here and here).
(Another option, of course, is to prepend each message with its lenght are read only the exact number of bytes.)

To answer the first part of your question (about data serialization), I would say that everything you listed sounds fine. Since you are dealing with pretty big (1MB) serializations, I think that the most important thing is laziness. There is another serialization library, called cereal that has strict serializations, and you wouldn't want that because you'd need to build it up in memory before sending in out. I'll give a shout out to aeson (http://hackage.haskell.org/package/aeson-0.8.0.2/docs/Data-Aeson.html) which you can use GHC Generics with to get something simple like this:
data Shape = Rect Int Int | Circle Double | Other String Int
deriving (Generic)
instance FromJSON Shape -- uses a default
instance ToJSON Shape -- uses a default
And then, bam!, you've got access to the encode and decode methods. I don't know about a higher level TCP library. Hopefully, someone else will have more insight on that.

Related

Making "parse" function RESTful

I have a RESTful service for getting let's say devices. It provides very usual functionality:
GET /devices
GET /devices/:id
POST /devices
PUT /devices/:id
DELETE /devices/:id
The device object might be defined as follows:
{
id: 123,
name: "Smoke detector",
firmware: "21.0.103",
battery: "ok",
last_maintenance: "2017-07-07",
last_alarm: "2014-02-01 12:11:10",
// ...
}
There is an application that might read device state via some device specific reader. The application itself has no idea how to interpret read data, but it might ask server to do it. In our case let's assume that the data contains the following: battery status, firmware version, last alarm.
If I were implementing regular RPC service, I would create function with "parse" meaning. It means it accept the raw data and returns an updated device object (or, alternatively, only the part of the device object containing the parsed state). But I doubt that I could find a good REST solution for such function. Now I am doing it via PATCH, but I personally do not like this solution, and therefore I will not provide it here. I believe there should be good solution for such class of problems.
So the question: how should I fit my "parse" logic in REST paradigm?
POST it to a /parsed-device-state URL, which will return a 201 Created, a Location header pointing to the place where you can get the parsed data from, and if you like, return the parsed data in the 201 as well (along with an additional Content-Location header with the same value as the Location header). Or if it takes a long time to parse, use 202 Accepted, and the same Location header. The caller can then poll that provided location until the results are ready.
So the question: how should I fit my "parse" logic in REST paradigm?
How would you fit your parse logic into a web site?
You'd probably start with a bookmark. GET $BOOKMARK would return a representation of a form. The form might include an input control like a text area element that would allow the consumer to input a representation, or it might include a input control that allows the consumer to link into a file. The consumer would submit the form, and the agent would create a request from the information in the form. That would probably be a POST (you aren't likely to include an arbitrary file's representation onto the query string) to whatever resource was specified as the action of the form. The server's response would provide a representation of the result.
If parsing were a particularly slow process, then the response instead might be a representation including links to resources that could be used to track the progress of the parsing. The whole protocol in this case looks a lot like putting work on a queue, and then polling for updates.
It's the right answer to a problem that is not a great fit for HTTP:
The REST interface is designed to be efficient for large-grain hypermedia data transfer, optimizing for the common case of the Web, but resulting in an interface that is not optimal for other forms of architectural interaction.
To some degree, what you are trying to do with your function is transfer compute, which may be why it feels like you are trimming corners off of the peg to fit it in the hole.
An alternative approach, which is a better fit for HTTP, is think about transferring a representation of the behavior. The API client gets a function that understands how to parse apples into oranges, and then runs that code on the information that it keeps locally. Think java script - we get a representation of the behavior from the server (which can embed into that representation information the server has that the client will need), and then execute the result locally. Metadata in the headers describes the lifetime of the representation, in a way that is understood by any standards compliant cache.

How does a fuzzer deal with invalid inputs?

Suppose that I have a program that takes a pointer as its input. Without prior knowledge about the structure of the pointee, how does a fuzzer create valid inputs that can actually hits the internal of the program? To make this more concrete, imagine an artificial C program
int myprogram (unknow_pointer* input){
printf("%s", input->name);
}
In some situations, the tested program first checks the input format. If the input format is not good, it raises an exception. In such situations, how can a fuzzer reach program points beyond that exception-raising statement?
Most fuzzers don't know anything about the internal structure of the program. Different fuzzers dealt with this in a various ways:
Not deal with it at all. Just throw random inputs and hope to produce an input that will pass some/all checks. (for example - radamasa)
Mutate a valid input - take a known valid input, and mutate it (flip bits, remove parts, add parts, etc.) in many cases it will be valid enough to pass some or all of the checks. For example - if you want to fuzz VLC, you will take a valid movie file as the input for the fuzzer, which will provide mutations of it to VLC. Those are often called mutation based fuzzers. (for example - zzuf)
If you have prior knowledge of the input's structure, build a model of the input, and then mutate specific fields within it. A big advantage of such method is the ability to deal with very specific types of fields - checksums, hashes, sizes, etc. Those are often called generation based fuzzers. (for example - spike, sulley and their successors, peach)
However, in recent years a new kind of fuzzers was evolved - feedback based fuzzers - these fuzzers perform mutations on a valid (or not) input, and based on feedback they receive from the fuzzed program they decide how and what to mutate next. The feedback is received by instrumenting the program execution, either by injection tracing in compile time, injecting the tracing code by patching the program in runtime, or using hardware tracing mechanisms. First among them is AFL (you can read more about it here).
A fuzzer throws every sort of random combination of inputs at the attack surface. The intention is to look for any opportunity for a "golden BB" to get past the input checks and get a response that can be further explored.

Proper way to communicate with socket

Is there any design pattern or something else for the network communication using Socket.
I mean what i always do is :
I receive a message from my client or my server
I extract the type of this message (f.e : LOGIN or LOGOUT or
CHECK_TICKET etc ...)
And i test this type in a switch case statement
Then execute the suitable method for this type
this way is a little bit borring when u have a lot of type of message.
Each time i have to add a type, i have to add it in the switch case.
Plus, it take more machine operations when you have hundred or thousands type of message in your protocol (due to the switch case).
Thanks.
You could use a loop over a set of handler classes (i.e. one for each type of message supported). This is essentially the composite pattern. The Component and each Composite then become independently testable. Once written Component need never change again and the support for a new message becomes isolated to a single new class (or perhaps lambda or function pointer depending on language). Also you can add/remove/reorder Composites at runtime to the Component, if that was something you wanted from your design (alternatively if you wanted to prevent this, depending on your language you could use variadic templates). Also you could look at Chain of Responsibility.
However, if you thought that adding a case to a switch is a bit laborious, I suspect that writing a new class would be too.
P.S. I don't see a good way of avoiding steps 1 and 2.

Socket Protocol Fundamentals

Recently, while reading a Socket Programming HOWTO the following section jumped out at me:
But if you plan to reuse your socket for further transfers, you need to realize that there is no "EOT" (End of Transfer) on a socket. I repeat: if a socket send or recv returns after handling 0 bytes, the connection has been broken. If the connection has not been broken, you may wait on a recv forever, because the socket will not tell you that there's nothing more to read (for now). Now if you think about that a bit, you'll come to realize a fundamental truth of sockets: messages must either be fixed length (yuck), or be delimited (shrug), or indicate how long they are (much better), or end by shutting down the connection. The choice is entirely yours, (but some ways are righter than others).
This section highlights 4 possibilities for how a socket "protocol" may be written to pass messages. My question is, what is the preferred method to use for real applications?
Is it generally best to include message size with each message (presumably in a header), as the article more or less asserts? Are there any situations where another method would be preferable?
The common protocols either specify length in the header, or are delimited (like HTTP, for instance).
Keep in mind that this also depends on whether you use TCP or UDP sockets. Since TCP sockets are reliable you can be sure that you get everything you shoved into them. With UDP the story is different and more complex.
These are indeed our choices with TCP. HTTP, for example, uses a mix of second, third, and forth option (double new-line ends request/response headers, which might contain the Content-Length header or indicate chunked encoding, or it might say Connection: close and not give you the content length but expect you to rely on reading EOF.)
I prefer the third option, i.e. self-describing messages, though fixed-length is plain easy when suitable.
If you're designing your own protocol then look at other people's work first; there might already be something similar out there that you could either use 'as is' or repurpose and adjust. For example; ISO-8583 for financial txns, HTTP or POP3 all do things differently but in ways that are proven to work... In fact it's worth looking at these things anyway as you'll learn a lot about how real world protocols are put together.
If you need to write your own protocol then, IMHO, prefer length prefixed messages where possible. They're easy and efficient to parse for the receiver but possibly harder to generate if it is costly to determine the length of the data before you begin sending it.
The decision should depend on the data you want to send (what it is, how is it gathered). If the data is fixed length, then fixed length packets will probably be the best. If data can be easily (no escaping needed) split into delimited entities then delimiting may be good. If you know the data size when you start sending the data piece, then len-prefixing may be even better. If the data sent is always single characters, or even single bits (e.g. "on"/"off") then anything different than fixed size one character messages will be too much.
Also think how the protocol may evolve. EOL-delimited strings are good as long as they do not contain EOL characters themselves. Fixed length may be good until the data may be extended with some optional parts, etc.
I do not know if there is a preferred option. In our real-world situation (client-server application), we use the option of sending the total message length as one of the first pieces of data. It is simple and works for both our TCP and UDP implementations. It makes the logic reasonably "simple" when reading data in both situations. With TCP, the amount of code is fairly small (by comparison). The UDP version is a bit (understatement) more complex but still relies on the size that is passed in the initial packet to know when all data has been sent.

How can I get a callback when there is some data to read on a boost.asio stream without reading it into a buffer?

It seems that since boost 1.40.0 there has been a change to the way that the the async_read_some() call works.
Previously, you could pass in a null_buffer and you would get a callback when there was data to read, but without the framework reading the data into any buffer (because there wasn't one!). This basically allowed you to write code that acted like a select() call, where you would be told when your socket had some data on it.
In the new code the behaviour has been changed to work in the following way:
If the total size of all buffers in the sequence mb is 0, the asynchronous read operation shall complete immediately and pass 0 as the argument to the handler that specifies the number of bytes read.
This means that my old (and incidentally, the method shown in this official example) way of detecting data on the socket no longer works. The problem for me is that I need a way detecting this because I've layered my own streaming classes on-top of the asio socket streams and as such, I cannot just read data off the sockets that my streams will expect to be there. The only workaround I can think of right now is to read a single byte, store it and when my stream classes then request some bytes, return that byte if one is set: not pretty.
Does anyone know of a better way to implement this kind of behaviour under the latest boost.asio code?
My quick test with an official example with boost-1.41 works... So I think it still should work (if you use null_buffers)