Light weight data store in Perl - perl

My requirement is to maintain a simple data store with some rows (~1000) & columns (6)
Over period of time (2 years) I am expecting the data to grow to 1000-1500 lines/rows
I would like query, insert & update in the data store
I need this data store because this needs to be processed by another script.
I am using Perl for programming.
I have seen some threads in Stackoverflow (ex: looking for light-weight data persistence solution in perl) about this but I cannot make a decision
Anyone using light weight data store in Perl with query, insert & update capabilities ?

Go for Sqlite. It is powerful, tunable and lightweight.

Already accepted the answer, but in your case, I might just go with hashes and use Storable to write my structure to a disk. That is, if you don't have multiple people using the data at once.
The advantage is that it's all standard Perl, so it will work with almost any Perl installation. Can't get any lighter weight than this.

Probably the simplest lightweight solution would be to use DBI with DBD::SQLite.

If your data is relational and you are comfortable with SQL then I vote DBD::SQLite.
However if your data is more like documents (each entry data is self contained) or if you are not comfortable with SQL then I recommend DBM::Deep. Its interface is exactly as easy to use as regular Perl variables.
Finally, if you want to be really modern, MongoDB is very easy to install and the new Mango Perl module is very cool, just saying :-)

Related

Storing and managing Forex trading tick data

I'm building a data visualization system for Forex trading and I'm exploring ways of storing the historical Forex trading tick data that I have.
The data are in the form of currency pair (e.g. USD/CAD) chronological ticks of Ask and Bid prices. At the end of the day I need my data to be indexed in Elasticsearch and what I searching for is the best way to get them there.
I found a couple of approaches online; they start out simple but then get complicated. I'm wondering if adding that extra complexity is worth it. Some of my options are:
Storing tick data on PostgreSQL and then via a plugin sync them to Elasticsearch (here)
Storing tick data on PostgreSQL, push them to Logstash and then to Elasticsearch
Finally, storing tick data on PostgreSQL, push them to Redis, then to Logstash, and then to Elasticsearch
My intuition says that solution No 2 would be the ideal one, but what is considered best practice?
It's a good idea to store your data in a long-term storage DB, such as PostgreSQL or similar. That way you can decide at any time whether you need to change your mappings, add fields, remove fields, change their types, or what have you, and then you can easily rebuild your ES index/indices without too much trouble from your primary source of truth (i.e. PostgreSQL) and you always have clean data in ES.
I don't know ZomboDB (solution 1) so I can't really speak for it, all I know is that I'm generally not too fond of tying two different technologies together, it makes it hard to upgrade any of them in case you need/must/want to apply patches or benefit from new features in either of them.
Unless you have big and costly transformations to do on your source data, I feel that solution 3 doesn't bring much, i.e. the additional step of storing data in an intermediary Redis, doesn't bring much in my opinion (your mileage may vary here). It's a good idea to use a temporary store, such as Redis or Kafka, when you may lose data along the pipeline, but in this case, since you have your data in PostgreSQL, you don't really run the risk of losing anything. If at all, you can relaunch your pipeline and rebuild a few days of data.
That leaves solution 2, which would be fine given the information at hand. Using the Logstash JDBC input, you can easily retrieve the latest changes and forward them to ES every x minutes.
Eric from ZomboDB here. I wanted to try and answer your question as it relates to ZDB.
ZomboDB is really designed for full-text searching within Postgres. It's important to note that it's not a tool to synchronize your PG data to Elasticsearch. It's a fully-functional Postgres index type (akin to the built-in types like btree, gin, and gist) that happens to be backed by Elasticsearch. The fact that ZomboDB uses Elasticsearch is really an implementation detail.
While ZDB does provide a number of UDFs that expose access to ES' aggregate facilities, again, it's really designed for text searching.
So if your data is really just pairs of numbers, you're probably better off using ES directly -- especially if you're loading in one batch per day. There's no doubt that ZDB could provide superior aggregate performance compared to standard Postgres "GROUP BY" queries (because it passes it through to Elasticsearch), but you're paying a heavy operational penalty for a limited use-case.
If, on the other hand, your ask/bid data comes with a lot of related metadata, and:
You need PG to be your source of truth,
You need to text-search that metadata (with or without aggregation support), and
You don't want to learn ES and introduce another database system to your application, then...
... ZomboDB could be right for you.
I suspect Stack Overflow isn't the place to get into this, so feel free to contact me via the ways ZDB's github page recommends.

Proccessing 2 million records with perl

I have 2 million records on the database is it possible to bring them all and store them on perl hash reference without any problem of reaching out of memory ?
What is your reason to read them all into memory? Speed or ease of coding (i.e. treat the whole thing as a hashref).
If its the former, then sure, I think, you just need a ton of ram.
If its the latter, then there are interesting options. For example there are tied interfaces for databases that look like Perl native hashes but in reality query and return data as needed. A quick search of CPAN shows Tie::DBI, Tie::Hash::DBD and several tied interfaces for specific databases, flat-file DBs, and CSV files, including mine Tie::Array::CSV.
On the one hand, processing two million elements in a hash isn't unheard of. However, we don't know how big your records are. At any rate, it sounds like an XY problem. It may not be the best solution for the problem you're facing.
Why not use DBIx::Class so that your tables can be treated like Perl classes (which are themselves glorified data-structures)? There's a ton of documentation at DBIx::Class::Manual::DocMap. This is really what DBIx::Class is all about; letting you abstract away the SQL details of the database and treat it like a series of classes.
That completely depends on how much data your records have. Perl hashes and arrays take up more memory than you'd think although it's not crazy. But again, it totally depends on what your data looks like and how much RAM you have. Perl won't have any problems with it if you have the RAM.

looking for light-weight data persistence solution in perl

In my app I need to store some simple data both in memroy and in disk. A real database will be overkill in my case, so I need lighter one to handle the simple data persistence requirement. I do some google search by myself, and found something interesting like DBM and DBI CVS, etc. but since there are too many options there so it is difficult for me to make the actuaaly choice, so I'd like ask you here for the "best-practice" like light-weight data perisistence solution in perl.
You have several options:
Storable is a core module and is very efficient. It has some problems with portability, for example someone using an older version of Storable may not be able to read your data. Also, the endianness of the system creating and retrieving that data is important. The network order stoarge options help reduce the portability issues. You can store an arbitrary nested data structure to a file or string and restore it. Storable is supported only by Perl.
YAML is a text based format that works like storable--you can store and restore arbitrary structures to/from YAML files. YAML is nice because there are YAML libraries for several languages. It's not quite as speedy or space efficient as Storable.
JSON is a popular data exchange format with support in many languages. It is very much like YAML in both its strengths and weaknesses.
DBD::SQLite is a database driver for the DBI that allows you to keep a whole relational database in a single file. It is powerful and allows you work with many of the persistence tools that are aimed at other databases like MySQL and Postgres.
DBM::Deep is a convenient and powerful perl only module that allows efficient retrieval and modification of small parts of a large persistent data structures. Almost as easy to use as Storable, but far more efficient when dealing with only small portions of a large data structure.
Update: I realized that I should mention that I have used all of these modules and depending on your particular needs, any of them could be "the right choice".
You might want to try Tie::Storable. Then it's as simple as addressing a hash.
If you're not looking to store a ton of data and you're OK loading everything all at once at program startup, it might be the way to go.
If you're looking for something more sophisticated but still light weight, a lot of people (including myself) swear by SQLite.
If I had to do this I would probably go with DBI and DBD::SQLite, since it does not involve reading all the data into memory, but I'd just like to mention a few other ways, because "there's more than one way to do it":
The old way to do this was with DB_file and its cousins. It still works with modern versions of Perl. The drawback is that it are only useful for storing a one-dimensional hash (a hash which doesn't have any references in it). The advantage is that you can find nice books about it which don't cost very much money, and also online articles, and also I believe it doesn't involve reading the whole file into memory.
Another method is to print the contents of Data::Dumper to a file to store, and eval the contents of the file to read the data.
Yet another thing which hasn't been mentioned is KiokuDB, which looks like the cutting-edge Moose-based module, if you want to be trendy.
Do you want your data to be transparently persisted, i.e. you won't have to worry about doing a commit()-type operation after every write? I just asked a very similar question: Simple, modern, robust, transparent persistence of data strutures for Perl, and listed all the solutions I found.
If you do want transparent persistence (autocommit), then DBM::Deep may be easier to use than Storable. Here is example code that works out of the box:
use DBM::Deep;
tie my %db, 'DBM::Deep', 'file.db';
if ( exists $db{foo}->{bar} ) {
print $db{foo}->{bar}, "\n"
} else {
$db{foo}->{bar} = 'baz';
}
Look into Tie::File and submodules like Tie::File::AsHash, or Tie::Handle::CSV. All available on CPAN, fast and easy to use.
Storable lets you serialize any Perl data structure and read it back in. For in-memory storage, just use IO::Scalar to store into a string, that way you only need to write the code once and for writing to disk you just pass in another I/O handle.

Which is faster when reading large amounts of data: XML or SQLite

I will be developing a dictionary app for both Android and iPhone. The data will be embedded within the app, and it consists out of approximately 100000 words, with genus and plural form. Is it better to use a SQLite database or can I just stick to XML? Somehow SQLite sounds more efficient, but I thought let's just ask.
Thanks!
That's a bit of an apples and oranges comparison. SQLite is a whole lot more than just a file format. The answer depends on whether you just want to load everything into memory on startup (XML, or better still, CSV will probably suffice), or you want to be able to query the data, in which case SQLite is a far better choice.
You will want to be performing searches, so SQLite will definitely be quicker. You will need some kind of function to install your data from the distributed executable into SQLite, of course...
XML is better suited for storing data trees (hierarchical data structures), and for data exchange.
SQL is a better fit for data tables.
In your situation (from what little you've shared with us), SQL (and therefore SQLite) sounds much more efficient.

How to SELECT data from SQL Server using .NET 2.0 -- as simple as possible

Easy question:
I have an app that needs to make a half dozen SELECT requests to SQL Server 2005 and write the results to a flat file. That's it.
If I could use .NET 3.5, I'd create a LINQ-To-SQL model, write the LINQ expressions and be done in an hour. What is the next best approach given that I can't use .NET 3.0 or 3.5? Are ADO.NET DataReaders/DataSets the best option, or am I forgetting something else available?
Using the SqlCommand and SqlDataReader classes are your best bet. If you need to write the results to a flat file, you should use the reader directly instead of going to a DataSet, since the latter will load the result in memory before you're able to write it out to a flat file.
The SqlDataReader allows you to read out the data in a streaming fashion, making your app a lot more scalable for this situation.
As Nick K so helpfully answered on my SQL Server 2000 question on serverfault, the bcp utility is really handy for this.
You can write a batch file or quick script that call BCP with your queries and have it dump csv,sql direct to a text file!
Agree with Dave Van den Eynde's answer above, but I would say that if you're pushing a large amount of data into these files, and if your app is something that can support it, then it's worth taking a look at making an SSIS package.
Could be complete overkill for this, but it's something that is often overlooked for bulk import/export.
Alternatively, you could avoid writing code and use BCP.exe:
http://msdn.microsoft.com/en-us/library/ms162802(SQL.90).aspx