How should I store and compress a Moose object using Perl? - perl

I have created a package using Moose and I would like to nstore some large instances. The resulting binary files are very large (500+MB) so I would like to compress them.
What is the best way for doing that?
Should I open a filehandle with bzip etc. then store using fd_nstore?

With MooseX::Storage, most of this is already done for you -- you just need to specify your serialization and I/O formats.

While compression is certainly a viable option, you might also want consider simply serializing less.
Could it be that your objects contain a lot of data that could easily be rebuilt from other data they also contain? For example, if you have attributes that are lazily build from other attributes (e.g. using Moose's lazy + builder or lazy_build), there is not much point in storing the values of those attributes at all unless the recomputation is incredibly expensive. And even then it might be worth considering, as reading lots of data off disk isn't the fastest thing either.
If you find that you want to serialize only parts of your objects, and still want to use Storable, you can define custom STORABLE_freeze and STORABLE_thaw hooks, as described in the Storable documentation.
However, there's also alternative serializers available. MooseX::Storage is one of them, and happens to support many serialization backends and formats, and can also be told easily about which attributes to serialize and which to skip for that purpose.

Have a look at Data::Serializer. It optionally uses zlib (via Compress::Zlib) or PPMd (via Compress::PPMd) to compress your serialized data.

Related

Can YAML::XS honor UseAliases=0?

UseAliases
YAML has an alias mechanism such that any given structure
in memory gets serialized once. Any other references to that structure
are serialized only as alias markers. This is how YAML can serialize
duplicate and recursive structures.
Sometimes, when you KNOW that your data is nonrecursive in nature, you
may want to serialize such that every node is expressed in full. (ie
as a copy of the original). Setting $YAML::UseAliases to 0 will allow
you to do this. This also may result in faster processing because the
lookup overhead is by bypassed.
Looking through the source for YAML::XS's LibYAML, it would seem (and empirical tests show) that this module does not honor $YAML::UseAliases=0
Is there any way to get YAML::XS to not dump out aliases (and instead flatten out the entire data structure)?
No, that's not possible in YAML::XS right now.
You can create an issue on GitHub, but no guarantee that it will be implemented ;-)

What's the best way to store a huge Map object populated at runtime to be reused by another tool?

I'm writing a Scala tool that encodes ~300 JSON Schema files into files of a different format and saves them to disk. These schemas I later re-need for instantiating JSON Data files, or better, I don't need all the schemas but only a few fields of each.
I was thinking that the best solution could be to populate a Map object (while the tool encodes the schemas) containing only the info that I need. And later re-use the Map object (in another run of the tool) as already compiled and populated map.
I've got two questions:
1. Is this really the most performant solution? and
2. How can I save the Map object, created at runtime, on disk as a file that can be later built/executed with the rest of my code?
I've read several posts about serialization and storing objects, but I'm not entirely sure whether these are the same as what I need. Also, I'm not sure is the best solution and I would like to hear an opinion from people with more experience than me.
What I would like to achieve is an elegant solution that allows me to lookup values from a map generated by another tool.
The whole process of compiling/building/executing sometimes is still confusing to me, so apologies if the question is trivial.
To Answer your question,
I think using an embedded KV Store would be more efficient considering the number of files and amount of traversal.
Here is a small Wiki on "How to use RocksJava". You can consider it as a KV store. https://github.com/facebook/rocksdb/wiki/RocksJava-Basics
You can use the below reference to serialize and de-serialize an object in Scala and put it as Key value pair in the RocksDB as I mentioned in the comment.
Convert Any type in scala to Array[Byte] and back
On how to use rocksDB, the below dependency in your build will suffice:
"org.rocksdb" % "rocksdbjni" % "5.17.2"
Thanks.

Better way to load content from web, JSON or XML?

I have an app which will load content from a website.
There will be around 100 articles during every loading.
I would like to know which way is better to load content from web if we look at:
speed
compatibility (will there be any problems with encoding if we use special characters etc.)
your experience
JSON is better if your data is huge
read more here
http://www.json.org/xml.html
Strongly recommend JSON for better performance and less bandwidth consumption.
JSON all the way. The Saad's link is an excellent resource for comparing the two (+1 to the Saad), but here is my take from experience and based on your post:
speed
JSON is likely to be faster in many ways. Firstly the syntax is much simpler, so it'll be quicker to parse and to construct. Secondly, it is much less verbose. This means it will be quicker to transfer over the wire.
compatiblity
In theory, there are no issues with either JSON or XML here. In terms of character encodings, I think JSON wins because you must use Unicode. XML allows you to use any character encoding you like, but I've seen parsers choke because the line at the top specifies one encoding and the actual data is in a different one.
experience
I find XML to be far more difficult to hand craft. You can write JSON in any text editor but XML really needs a special XML editor in order to get it right.
XML is more difficult to manipulate in a program. Parsers have to deal with more complexity: name spaces, attributes, entities, CDATA etc. So if you are using a stream based parser you need to track attributes, element content, namespace maps etc. DOM based parsers tend to produce complex graphs of custom objects (because they have to in order to model the complexity). I have to admit, I've never used a stream based JSON parser, but parsers producing object graphs can use the natural Objective-C collections.
On the iPhone, there is no built in XML DOM parser in Cocoa (you can use the C based parser - libxml2) but there is a simple to use JSON parser as of iOS 5.
In summary, if I have control of both ends of the link, I'll use JSON every time. On OS X, if I need a structured human readable document format, I'll use JSON.
You say you are loading "articles". If you mean documents containing rich text (stuff like italic and bold), then it's not clear that JSON is an option - JSON doesn't really do mixed content.
If it's pure simple structured data, and if you don't have to handle complexities like the need for the software at both ends of the communication to evolve separately rather than remaining in lock sync, then JSON is simpler and cheaper: you don't need the extra power or complexity of XML.

How a class that wraps and provides access to a single file should be designed?

MyClass is all about providing access to a single file. It must CheckHeader(), ReadSomeData(), UpdateHeader(WithInfo), etc.
But since the file that this class represents is very complex, it requires special design considerations.
That file contains a potentially huge folder-like tree structure with various node types and is block/cell based to handle fragmentation better. Size is usually smaller than 20 MB. It is not of my design.
How would you design such a class?
Read a ~20MB stream into memory?
Put a copy on temp dir and keep its path as property?
Keep a copy of big things on memory and expose them as read-only properties?
GetThings() from the file with exception-throwing code?
This class(es) will be used only by me at first, but if it ends good enough I might open-source it.
(This is a question on design, but platform is .NET and class is about offline registry access for XP)
It depends what you need to do with this data. If you only need to process it linearly one time, then it might be faster to just take the performance hit of a large file in memory.
If however you need to do various things with the file beyond a single, linear parsing, I would parse the data into a lightweight database such as SQLite and then operate on that. This way all of your file's structure is preserved and all subsequent operations on the file will be faster.
Registry access is quite complex. You are basically reading a large binary tree. The class design should rely heavily on the stored data structures. Only then you can choose an appropriate class design. To stay flexible you should model the primitives such as REG_SZ, REG_EXPAND_SZ, DWORD, SubKey, .... Don Syme has in his book Expert F# a nice section about binary parsing with binary combinators. The basic idea is that your objects know by themself how to deserialize from a binary representation. When you have a stream of bytes which is structured like this
<Header>
<Node1/>
<Node2>
<Directory1>
</Node2>
</Header>
you start with a BinaryReader to read the binary objects byte by byte. Since you know that the first thing must be the header you can pass it to the Header object
public class Header
{
static Header Deserialize(BinaryReader reader)
{
Header header = new Header();
int magic = reader.ReadByte();
if( magic == 0xf4 ) // we have a node entry
header.Insert(Node.Read( reader );
else if( magic == 0xf3 ) // directory entry
header.Insert(DirectoryEntry.Read(reader))
else
throw NotSupportedException("Invalid data");
return header;
}
}
To stay performant you can e.g. delay parsing the data up to a later time when specific properties of this or that instance are actually accessed.
Since the registry in Windows can get quite big it is not possible to read it completely into memory at once. You will need to chunk it. One solution that Windows applies is that the whole file is allocated in paged pool memory which can span several gigabytes but only the actually accessed parts are swapped out from disk into memory. That allows Windows to deal with a very large registry file in an efficient manner. You will need something similar for your reader as well. Lazy parsing is one aspect and the ability to jump around in the file without the need to read the data in between is cruical to stay performant.
More infos about paged pool and the registry can be found there:
http://blogs.technet.com/b/markrussinovich/archive/2009/03/26/3211216.aspx
Your Api design will depend on how you read the data to stay efficient (e.g. use a memory mapped file and read from different mapped regions). With .NET 4 a Memory Mapped file implementation has arrived that is quite good now but wrappers around the OS APIs exist as well.
Yours,
Alois Kraus
To support delayed loading from a memory mapped file it would make sense not to read the byte array into the object and parse it later but go one step furhter and store only the offset and length of the memory chunk from the memory mapped file. Later when the object is actually accessed you can read and deserialize the data. This way you can traverse the whole file and build a tree of objects which contain only the offsets and the reference to the memory mapped file. That should save huge amounts of memory.

How can I build a generic dataset-handling Perl library?

I want to build a generic Perl module for handling and analysing biomedical character separated datasets and which can, most certain, be used on any kind of datasets that contain a mixture of categorical (A,B,C,..) and continuous (1.2,3,881..) and identifier (XXX1,XXX2...). The plan is to have people initialize the module and then use some arguments to point to the data file(s), the place were the analysis reports should be placed and the structure of the data.
By structure of data I mean which variable is in which place and its name/type. And this is where I need some enlightenment. I am baffled how to do this in a clean way. Obviously, having people create a simple schema file, be it XML or some other format would be the cleanest but maybe not all people enjoy doing something like this.
The solutions I can think of are:
Create a configuration file in XML or similar and with a prespecified format.
Pass the information during initialization of the module.
Use the first row of the data as headers and try to guess types (ouch)
Surely there must be a "canonical" way of doing this that is also usable and efficient.
This doesn't answer your question directly, but have you checked CPAN? It might have the module you need already. If not, it might have similar modules -- related either to biomedical data or simply to delimited data handling -- that you can mine for good ideas, both concerning formats for metadata and your module's API.
Any of the approaches you've listed could make sense. It all depends on how complex the data structures and their definitions are. What will make something like this useful to people is whether it saves them time and effort. So, your decision will have to be answered based on what approach will best satisfy the need to make:
use of the module easy
reuse of data definitions easy
the data definition language sufficiently expressive to describe all known use cases
the data definition language sufficiently simple that an infrequent user can spend minimal time with the docs before getting real work done.
For example, if I just need to enter the names of the columns and their types (and there are only 4 well defined types), doing this each time in a script isn't too bad. Unless I have 350 columns to deal with in every file.
However, if large, complicated structure definitions are common, then a more modular reuse oriented approach is better.
If your data description language is difficult to work with, you can mitigate the issue a bit by providing a configuration tool that allows one to create and edit data schemes.
rx might be worth looking at, as well as the Data::Rx module on the CPAN. It provides schema checking for JSON, but there is nothing inherent in the model that makes it JSON-only.