Serialization using Protobuf-net and Protobuf C#Port vs XML - xml-serialization

I'm interested in a certain situation. I have an object in C# that I would like to serialize and deserialize.
I'm kind of conducting an experiment. I'm trying to see if switching the libraries of protobuf will have any affect on the time it takes to serialize and deserialize an object. Additionally, I'm throwing an XML serialization in the mix to see if that can compete as well, even though I'm pretty sure protobuf is faster.
In terms of speed, is there a clear, definite winner between protobuf and XML? Assuming everything was done consistently? i.e. Same path, parallel code, straightforward, etc.. And also, would the speed be affected if I switched the libraries that the protobuf is using to serialize and deserialize? (From protobuf-net to protobuf C# port?) I'm quite new to this so I do not know the answer yet, but what I heard was that protobuf is supposed to be smaller, faster, and easier than XML.
Any insight is greatly appreciated! Thanks! Off to write the tests now.

Will protobuf be quicker? Absolutely. I've profiled this many many times, all with similar results, for examples:
Performance Tests of Serializations used by WCF Bindings
https://stackoverflow.com/questions/1650419/protocol-buffers-net-protobuf-net-10x-slower-that-xml-serializer-how-come/1660990#1660990 (note the title here is misleading)
http://code.google.com/p/protobuf-net/wiki/Performance
http://www.servicestack.net/benchmarks/NorthwindDatabaseRowsSerialization.1000000-times.2010-02-06.html (doesn't include XmlSerializer, but includes most others for comparison)
and many passing comments from happy users. I honenstly haven't compared to Jon's version in a long while, and I haven't done a direct v2 comparison, but here's a key point: in most cases the final bandwidth is the limiting factor in network performance, and they are the same wire format so should be pretty much identical there. Of course, protobuf is also demonstrably cheaper to read and write too, but unless you are on a mobile device that is secondary.
The big difference between protobuf-net and the port is that the ported version (Jon's) adopts (quite reasonably) the protobuf approach (immutable/generated objects etc) which might it hard to retrofit to an existing type model - you would have to introduce a separate DTO layer and map to it. Which isn't a big problem - simply a consideration. And for that reason you might find it hard to do a direct comparison between XmlSerializer and the port; they both get your data, but the routes are very different. Conversely, protobuf-net deliberately positions itself as a very similar API to XmlSerializer etc, so it is pretty easy to do a test suite using the same objects etc - just changing the serializer.

Related

Exposing protocol-buffers interfaces

I am building a distributed system which consist of modules/application with interfaces defined by protobuf messages.
Is it a good idea to expose those protobuf messages to a client directly? ... or maybe it's better to prepare a shared library which will be responsible for translation of (let's assume) method based interface to a protobuf based for each module and clients won't be aware about protobuf at all?
It's neither a "good idea" nor a bad one. It depends on whether or not you want to impose protocol buffers onto your consumers. A large part of that decision is, then:
Who are your consumers? Do you mind exposing the protobuf specifics to them?
Will the clients be written in languages which have protobuf support?
My $0.02 is that this is a perfect use case for Protocol Buffers, since they were specifically designed with cross-system, cross-language interchange in mind. The .proto file makes for a concise, language-independent, thorough description of the data format. Of course, there are other similar/competing formats & libraries out there to consider (see: Thrift, Cap'n Proto, etc.) if you decide to head down this path.
If you are planning to define interfaces that take Google Protobuf message classes as arguments than according to this and that section in Google's Protobuf documentation it is not a good idea to expose Protobuf messages to a client directly. In short, with every version of Protobuf the generated code is likely to be not binary compatible with older code. So don't do it!
However, if you are planning to define interfaces that take byte arrays containing serialized Protobuf messages as function/method parameters then I totally agree with Matt Ball's answer.

how to do fix+fast using quickfix?

I consider to buy this connector:
MICEX FIX/FAST Market Data Adaptor http://www.b2bits.com/trading_solutions/market-data-solutions/micex-fixfast.html
But I don't like propriety software by some reasons and would prefer to replace this connector with QuickFix + DIY code.
100 usec perfomance difference is not critical for me, but I do care about features.
In particular MICEX uses FIX+Fast and referenced connector automatically decodes fast: "Hides FAST functionality from user, automatically applies FAST decoding."
The question is how to do the same with quickfix? Is it good idea at all? How easy would be to implement referenced connector using quickfix?
Have you looked at http://code.google.com/p/quickfast/ I have used it and is mostly works but it is not the best library.
I don't believe QuickFIX supports FAST. FAST is a complex compression specification for FIX messages, and to implement FAST on top of QuickFIX or any FIX engine in a performing way can be tricky.
You want to choose a FAST engine that can generate template-decoding source code, in other words, it reads the template XML file from the exchange and spills out code to parse each template. Done this way it's automatic, easy and crucial for speed as the code generated avoids recursion calls otherwise necessary to parse repeating groups.
Take a look on CoralFIX which is an intuitive FIX engine with support for FAST decoding.
Disclaimer: I am one of the developers of CoralFIX.

GWT : Type of Container

I see that there are two ways of transferring objects from server to client
Use the same domain object (Contact.java) as used in the service layer. (I do not use hibernate)
Use the HashMap to send the domain object field values in the form of
Map with the help of BeanUtilsBean class. For multiple objects, use
the List>. Similary, use the Map to submit form
values from client to server
Is there any performance advantage for option 1 over 2?.
Is there a way to hide the classname/package name that is sent to the browser if we
use option 1?.
thanks!.
You have to understand that whatever option you choose, it will need to get converted to JavaScript (+ some wrappers, etc.) - this stuff takes more time and space/bandwidth (note: I haven't done any benchmarks, this is just a [reasonable] conclusion I came up with ;)) than, say, JSON. But if you used JSON, you have to recreate the object on the server side, os it's not a silver bullet. In the end, it all depends how much performance is of an issue to you - for more insight, see this question.
I'd go with option 1: just leave it to the GWT team to pack your domain objects and transfer them between client and server. In the future (GWT 2.1), we'll have some really nice things, including a more lightweight transfer protocol - see this years presentation from Google I/O on architecting GWT apps - it's something worth keeping in mind.
PS: It's always good to do benchmarks yourself in this kind of situations - your configuration, the type of objects, etc. might yield some different results than expected.

How can I build a generic dataset-handling Perl library?

I want to build a generic Perl module for handling and analysing biomedical character separated datasets and which can, most certain, be used on any kind of datasets that contain a mixture of categorical (A,B,C,..) and continuous (1.2,3,881..) and identifier (XXX1,XXX2...). The plan is to have people initialize the module and then use some arguments to point to the data file(s), the place were the analysis reports should be placed and the structure of the data.
By structure of data I mean which variable is in which place and its name/type. And this is where I need some enlightenment. I am baffled how to do this in a clean way. Obviously, having people create a simple schema file, be it XML or some other format would be the cleanest but maybe not all people enjoy doing something like this.
The solutions I can think of are:
Create a configuration file in XML or similar and with a prespecified format.
Pass the information during initialization of the module.
Use the first row of the data as headers and try to guess types (ouch)
Surely there must be a "canonical" way of doing this that is also usable and efficient.
This doesn't answer your question directly, but have you checked CPAN? It might have the module you need already. If not, it might have similar modules -- related either to biomedical data or simply to delimited data handling -- that you can mine for good ideas, both concerning formats for metadata and your module's API.
Any of the approaches you've listed could make sense. It all depends on how complex the data structures and their definitions are. What will make something like this useful to people is whether it saves them time and effort. So, your decision will have to be answered based on what approach will best satisfy the need to make:
use of the module easy
reuse of data definitions easy
the data definition language sufficiently expressive to describe all known use cases
the data definition language sufficiently simple that an infrequent user can spend minimal time with the docs before getting real work done.
For example, if I just need to enter the names of the columns and their types (and there are only 4 well defined types), doing this each time in a script isn't too bad. Unless I have 350 columns to deal with in every file.
However, if large, complicated structure definitions are common, then a more modular reuse oriented approach is better.
If your data description language is difficult to work with, you can mitigate the issue a bit by providing a configuration tool that allows one to create and edit data schemes.
rx might be worth looking at, as well as the Data::Rx module on the CPAN. It provides schema checking for JSON, but there is nothing inherent in the model that makes it JSON-only.

Do you create your own code generators?

The Pragmatic Programmer advocates the use of code generators.
Do you create code generators on your projects? If yes, what do you use them for?
In "Pragmatic Programmer" Hunt and Thomas distinguish between Passive and Active code generators.
Passive generators are run-once, after which you edit the result.
Active generators are run as often as desired, and you should never edit the result because it will be replaced.
IMO, the latter are much more valuable because they approach the DRY (don't-repeat-yourself) principle.
If the input information to your program can be split into two parts, the part that changes seldom (A) (like metadata or a DSL), and the part that is different each time the program is run (B)(the live input), you can write a generator program that takes only A as input, and writes out an ad-hoc program that only takes B as input.
(Another name for this is partial evaluation.)
The generator program is simpler because it only has to wade through input A, not A and B. Also, it does not have to be fast because it is not run often, and it doesn't have to care about memory leaks.
The ad-hoc program is faster because it's not having to wade through input that is almost always the same (A). It is simpler because it only has to make decisions about input B, not A and B.
It's a good idea for the generated ad-hoc program to be quite readable, so you can more easily find any errors in it. Once you get the errors removed from the generator, they are gone forever.
In one project I worked on, a team designed a complex database application with a design spec two inches thick and a lengthy implementation schedule, fraught with concerns about performance. By writing a code generator, two people did the job in three months, and the source code listings (in C) were about a half-inch thick, and the generated code was so fast as to not be an issue. The ad-hoc program was regenerated weekly, at trivial cost.
So active code generation, when you can use it, is a win-win. And, I think it's no accident that this is exactly what compilers do.
Code generators if used widely without correct argumentation make code less understandable and decrease maintainability (the same with dynamic SQL by the way). Personally I'm using it with some of ORM tools, because their usage here mostly obvious and sometimes for things like searcher-parser algorithms and grammatic analyzers which are not designed to be maintained "by hands" lately. Cheers.
In hardware design, it's fairly common practice to do this at several levels of the 'stack'. For instance, I wrote a code generator to emit Verilog for various widths, topologies, and structures of DMA engines and crossbar switches, because the constructs needed to express this parameterization weren't yet mature in the synthesis and simulation tool flows.
It's also routine to emit logical models all the way down to layout data for very regular things that can be expressed and generated algorithmically, like SRAM, cache, and register file structures.
I also spent a fair bit of time writing, essentially, a code generator that would take an XML description of all the registers on a System-on-Chip, and emit HTML (yes, yes, I know about XSLT, I just found emitting it programatically to be more time-effective), Verilog, SystemVerilog, C, Assembly etc. "views" of that data for different teams (front-end and back-end ASIC design, firmware, documentation, etc.) to use (and keep them consistent by virtue of this single XML "codebase"). Does that count?
People also like to write code generators for e.g. taking terse descriptions of very common things, like finite state machines, and mechanically outputting more verbose imperative language code to implement them efficiently (e.g. transition tables and traversal code).
We use code generators for generating data entity classes, database objects (like triggers, stored procs), service proxies etc. Anywhere you see lot of repititive code following a pattern and lot of manual work involved, code generators can help. But, you should not use it too much to the extend that maintainability is a pain. Some issues also arise if you want to regenerate them.
Tools like Visual Studio, Codesmith have their own templates for most of the common tasks and make this process easier. But, it is easy to roll out on your own.
It is often useful to create a code generator that generates code from a specification - usually one that has regular tabular rules. It reduces the chance of introducing an error via a typo or omission.
Yes ,
I developed my own code generator for AAA protocol Diameter (RFC 3588).
It could generate structures and Api's for diameter messages reading from an XML file that described diameter application's grammar.
That greatly reduced the time to develop complete diameter interface (such as SH/CX/RO etc.).
in my opinion a good programming language would not need code generators because introspection and runtime code generation would be part of language e.g. in python metaclasses and new module etc.
code generators usually generate more unmanageable code in long term usage.
however, if it is absolutely imperative to use a code generator (eclipse VE for swing development is what I use at times) then make sure you know what code is being generated. Believe me, you wouldn't want code in your application that you are not familiar with.
Writing own generator for project is not efficient. Instead, use a generator such as T4, CodeSmith and Zontroy.
T4 is more complex and you need to know a .Net programming language. You have to write your template line by line and you have to complete data relational operations on your own. You can use it over Visual Studio.
CodeSmith is an functional tool and there are plenty of templates ready to use. It is based on T4 and writing your own temlate takes too much time as it is in T4. There is a trial and a commercial version.
Zontroy is a new tool with a user friendly user interface. It has its own template language and is easy to learn. There is an online template market and it is developing. Even you can deliver templates and sell them online over market.
It has a free and a commercial version. Even the free version is enough to complete a medium-scale project.
there might be a lot of code generators out there , however I always create my own to make the code more understandable and suit the frameworks and guidelines we are using
We use a generator for all new code to help ensure that coding standards are followed.
We recently replaced our in-house C++ generator with CodeSmith. We still have to create the templates for the tool, but it seems ideal to not have to maintain the tool ourselves.
My most recent need for a generator was a project that read data from hardware and ultimately posted it to a 'dashboard' UI. In-between were models, properties, presenters, events, interfaces, flags, etc. for several data points. I worked up the framework for a couple data points until I was satisfied that I could live with the design. Then, with the help of some carefully placed comments, I put the "generation" in a visual studio macro, tweaked and cleaned the macro, added the datapoints to a function in the macro to call the generation - and saved several tedious hours (days?) in the end.
Don't underestimate the power of macros :)
I am also now trying to get my head around CodeRush customization capabilities to help me with some more local generation requirements. There is powerful stuff in there if you need on-the-fly decision making when generating a code block.
I have my own code generator that I run against SQL tables. It generates the SQL procedures to access the data, the data access layer and the business logic. It has done wonders in standardising my code and naming conventions. Because it expects certain fields in the database tables (such as an id column and updated datetime column) it has also helped standardise my data design.
How many are you looking for? I've created two major ones and numerous minor ones. The first of the major ones allowed me to generate programs 1500 line programs (give or take) that had a strong family resemblance but were attuned to the different tables in a database - and to do that fast, and reliably.
The downside of a code generator is that if there's a bug in the code generated (because the template contains a bug), then there's a lot of fixing to do.
However, for languages or systems where there is a lot of near-repetitious coding to be done, a good (enough) code generator is a boon (and more of a boon than a 'doggle').
In embedded systems, sometimes you need a big block of binary data in the flash. For example, I have one that takes a text file containing bitmap font glyphs and turns it into a .cc/.h file pair declaring interesting constants (such as first character, last character, character width and height) and then the actual data as a large static const uint8_t[].
Trying to do such a thing in C++ itself, so the font data would auto-generate on compilation without a first pass, would be a pain and most likely illegible. Writing a .o file by hand is out of the question. So is breaking out graph paper, hand encoding to binary, and typing all that in.
IMHO, this kind of thing is what code generators are for. Never forget that the computer works for you, not the other way around.
BTW, if you use a generator, always always always include some lines such as this at both the start and end of each generated file:
// This code was automatically generated from Font_foo.txt. DO NOT EDIT THIS FILE.
// If there's a bug, fix the font text file or the generator program, not this file.
Yes I've had to maintain a few. CORBA or some other object communication style of interface is probably the general thing that I think of first. You have object definitions that are provided to you by the interface you are going to talk over but you still have to build those objects up in code. Building and running a code generator is a fairly routine way of doing that. This can become a fairly lengthy compile just to support some legacy communication channel, and since there is a large tendency to put wrappers around CORBA to make it simpler, well things just get worse.
In general if you have a large amount of structures, or just rapidly changing structures that you need to use, but you can't handle the performance hit of building objects through metadata, then your into writing a code generator.
I can't think of any projects where we needed to create our own code generators from scratch but there are several where we used preexisting generators. (I have used both Antlr and the Eclipse Modeling Framework for building parsers and models in java for enterprise software.) The beauty of using a code generator that someone else has written is that the authors tend to be experts in that area and have solved problems that I didn't even know existed yet. This saves me time and frustration.
So even though I might be able to write code that solves the problem at hand, I can generate the code a lot faster and there is a good chance that it will be less buggy than anything I write.
If you're not going to write the code, are you going to be comfortable with someone else's generated code?
Is it cheaper in both time and $$$ in the long run to write your own code or code generator?
I wrote a code generator that would build 100's of classes (java) that would output XML data from database in a DTD or schema compliant manner. The code generation was generally a one time thing and the code would then be smartened up with various business rules etc. The output was for a rather pedantic bank.
Code generators are work-around for programming language limitations. I personally prefer reflection instead of code generators but I agree that code generators are more flexible and resulting code obviously faster during runtime. I hope, future versions of C# will include some kind of DSL environment.
The only code generators that I use are webservice parsers. I personally stay away from code generators because of the maintenance problems for new employees or a separate team after hand off.
I write my own code generators, mainly in T-SQL, which are called during the build process.
Based on meta-model data, they generate triggers, logging, C# const declarations, INSERT/UPDATE statements, data model information to check whether the app is running on the expected database schema.
I still need to write a forms generator for increased productivity, more specs and less coding ;)
I've created a few code generators. I had a passive code generator for SQL Stored procedures which used templates. This generated generated 90% of our stored procedures.
Since we made the switch to Entity Framework I've created an active codegenerator using T4 (Text Template Transformation Toolkit) inside visual studio. I've used it to create basic repository partial classes for our entities. Works very nicely and saves a bunch of coding. I also use T4 for decorating the entity classes with certain Attributes.
I use code generation features provided by EMF - Eclipse Modeling Framework.
Code generators are really useful in many cases, especially when mapping from one format to another. I've done code generators for IDL to C++, database tables to OO types, and marshalling code just to name a few.
I think the point the authors are trying to make is that if you're a developer you should be able to make the computer work for you. Generating code is just one obvious task to automate.
I once worked with a guy who insisted that he would do our IDL to C++ mapping manually. In the beginning of the project he was able to keep up, because the rest of us were trying to figure out what to do, but eventually he became a bottleneck. I did a code generator in Perl and then we could pretty much do his "work" in a few minutes.
See our "universal" code generator based on program transformations.
I'm the architect and a key implementer.
It is worth noting that a significant fraction of this generator, is generated using this generator.
We uses Telosys code generator in our projects : http://www.telosys.org/
We have created it to reduce the development duration in recurrent tasks like CRUD screens, documentation, etc...
For us the most important thing is to be able to customize the generator's templates, in order to create new generation targets if necessary and to customize existing templates. That's why we have also created a template editor (for Velocity .vm files).
It works fine for Java/Spring/AngularJS code generator and can be adapt for other targets (PHP, C#, Python, etc )