Is there a data storage where I can access data directly via array index instead of hash key? Redis? MongoDB? - mongodb

I need an external C/C++ memory efficient (!) data storage for a Java app which does not have the downside of a normal database lookup (b tree) but which uses my IDs as array index. Is there an open source solution for this? I implemented this in C++ in-memory only, but I would like to have a "storage to disc" option in case of a crash or for backup. Also Java binding would be cool.
E.g. redis looks good but when reading the docs I see that in general things are accessed by hash keys which have O(1) only in theory - or can I somehow force that the hashing scheme matches the storage index? And also lists are not appropriated as they are implemented as linked lists. Or what about mongodb?
And yes, I really need that fast read access (write can be "okayish slow" :)) - it is no premature optimization but if there is no alternative I'll try redis before rolling my own. Also Java is not possible (as I said: memory efficient ;))

With a remote key-value store, the overhead is very often dominated by the network and protocol management rather than data access itself. That's why with efficient key-value stores (like Redis for instance), almost all the operations actually have the same cost.
The Redis benchmark page contains a good illustration of this point.
In other words, in the context of an in-memory remote store, and considering only the latency, a random access array will have the same exact performance than a hash table, and even less efficient O(log n) containers like red-black trees, B-trees, etc ... will be quite close.
If you really want maximum performance, I would suggest to use an embedded (i.e. in-process) store. For instance, both BerkeleyDB and Tokyo Cabinet provide disk based random access containers for fixed-length records.

KDB is the go-to solution for this problem in the financial systems (algo trading) world. Be prepared to have your brain melted by the syntax though. Oh, and it is not open source.

Related

Which is an advantage of storing images directly in MongoDB instead of serverside folder

I suppose that storing images (or any binary data - pdfs, movies, etc. ) outside of DB (MongoDB in my case) and putting them in public server folder can be at least faster (no encoding, decoding and things around that).
But since there is such an option in MondoDB, I'd like to know advantages of using this, and use cases, when that approach is recommended.
Replication: It is pretty easy to set up a highly available replica set. So even if one machine goes down, the files would still be available. While this is possible to achieve by various means for a simple filesystem as well, the overhead for this might well eliminate the performance advantage (if there is any: MongoDB has quite sophisticated internal caching going on). Furthermore, setting up a DRBD and making sure consistency and availability requires quite more knowledge and administrative effort than with MongoDB. Plus, you'd need to have your DB be highly available as well.
Scalability: It can get quite complicated and/or costly when your files exceed the storage capacity of a single node. While in theory you can scale vertically, there is a certain point where the bang you get for the buck decreases and scaling horizontally makes more sense. However, with a filesystem approach, you'd have to manage which file is located at which node, how and when to balance and whatnot. MongoDB's GridFS in a sharded environment does this for you automatically and – more important – transparently. You neither have to reinvent the wheel nor maintain it.
Query by metadata: While in theory you can do this by an approach with a database and links to a filesystem, GridFS comes with means to insert arbitrary metadata and query by it. Again, this saves you reinventing the wheel. As an interesting example is that finding duplicates is quite easy with GridFS: a hash sum is automatically calculated for each file in GridFS. With a rather simple aggregation, you can find dupes and then deal with them accordingly.
When you have large amount of binary data and you want to take advantage of sharding, you can go with storing the binary data in mongo db using gridfs. But from performance point of view, Obviously as you pointed storing the images in a file system is a better way.

Suitable db solution for high read rate

I'll explain the use cases first.
High read rates (10000+ p/s), large dataset (lots of string codes(think promocodes) looking for matchs, strings 10 - 20chars). Needs fast response time.
First thought was memcached. However to combat downtime if memcache goes down and starts repopulating the cache from a db like mysql.... i was thinking redis for auto repopulation of cache.
Is it true that redis does not persist to the hdd but instead a flush needs to be called for it to be backed up?
My hope is to use the code string as the key making lookup super quick. Value will be an id linking it to a db record thats not needed by the api.
If i had to guess how many unique strings will be stored..... 10M + after a few months.
Iv also looked at Cassandra briefly and mongodb. Im thinking mongodb will not be enough due to it not storing entire list in memory?
Any insight into these systems is very helpful. Feel like im going around in circles.
The api is made in nodejs. (If it matters)
10K/s is definitely not a high rate for a DB like Cassandra, according that your schema is done wisely. I bet it's the same for the others.
10M unique strings per months is peanuts for modern big data systems.
Whatever big data solution you retain, you will have to design the schema acording to the type of data and operational needs.
IMO, the important ones are the following 2 questions :
What you mean by "looking for matchs"?
If you need indexing and search using substrings or regexps, you need a search engine: ElasticSearch or SOLR are great. Warning that E/S does replication and sharding but it's distribution model is still not 100% safe.
None of the systems you mentionned will provide the reactivity you seem to look for.
If you will query using static strings: a key-value store or column oriented database like Cassandra will be just the perfect fit. So all are good fit.
What is a fast response time?
With selecting the right technology and appropriate schemas all those systems will give you great response time under hundreds of milliseconds, but will it be fast enough for you?
REDIS and MemCached being in-memory will provide the faster responses.
And as a conclusion, the API being in node.js is irrelevant for the choice of your storage and indexing technology, unless you want to stick with Javascript for everything and MongoDB is more friendly for you, it can be a decent candidate depending on your search use cases.

memcached like software with disk persistence

I have an application that runs on Ubuntu Linux 12.04 which needs to store and retrieve a large number of large serialized objects. Currently the store is implemented by simply saving the serialized streams as files, where the filenames equal the md5 hash of the serialized object. However I would like to speed things up replacing the file-store by one that does in-memory caching of objects that are recently read/written, and preferably does the hashing for me.
The design of my application should not get any more complicated. Hence preferably would be a storing back-end that manages a key-value database and caching in an abstracted and efficient way. I am a bit lost with all of the key/value stores that are out there, and much of the topics/information seems to be outdated. I was initially looking at something like memcached+membase, but maybe there are better solutions out there. I looked into redis, mongodb, couchdb, but it is not quite clear to me if they fit my needs.
My most important requirements:
Transparent saving to a persistent store in a way that the most recently written/read objects are quickly available by automatically caching them in memory.
Store should survive a reboot. Hence in memory objects should be saved on disk asap.
Currently I am calculating the md5 manually. It would actually be nicer if the back-end does this for me. Hence the ability to get the hash-key when an object is stored, and be able to retrieve the object later using the hashkey.
Big plus is that if there are packages available for Ubuntu 12.04, either in universe or through launchpad or whatever.
Other than this, the software should preferably be light not be more complicated than necessary (I don't need distributed map-reduce jobs, etc)
Thanks for any advice!
I would normally suggest Redis because it will be fast and in-memory with asynch persistant store. Plus you'll find you can use their different data types for other purposes so not as single-purpose as memcached. As far as auto-hashing, I don't think it does that as you define your own keys when you store objects (as in most of them).
One downside to Redis is if you're storing a TON of binary objects, you'll be limited to available memory in RAM (unless sharding) so could reach performance limitations. In that case you may store objects on file system, hash them, and store keys in Redis and match that to filename stored on file server and you'd be fine.
--
An alternate option would be to check out ElasticSearch which is like Mongo in that it stores objects native as JSON, but it includes the Lucene search engine on top with RESTful API interface. It "warms up" data in memory for fast response, but is also a persistent store and the nicest part is it auto-shards and auto-clusters using multicast to find other nodes.
--
Hope that helps and if so, share the love! ;-)
I'd look at MongoDB. It caches things efficiently using your OS to page data in and out, and is pretty simple to setup. Redis and Memcached won't be good solutions for you because they keep everything in RAM. Other, simpler solutions like LevelDB or BDB would also probably be suitable. I don't think any database going to compute hashes automatically for you. It sounds like you already have code for this though.

Can i use RedisDB for big databases?

Can I use RedisDB for big databases instead of SQL-based DB?
I really like RedisDB architecture, and i now what RedisDB support virtual memory.
But I don't now how fast is it.
PS When I say big database, I don't mean REALLY big. Just DB of some site, that can't be placed in memory.
It all depends on your application. Redis is a Key-Value store. If your data stored in a relational format and does not lend itself well to key retrieval then it is not really the solution for you. Also, Redis has the best performance when it is stored in memory. If transactions are important for you, it is not really a good solution too. I use Redis and love it, but it seems many people are trying to shoehorn in applications that Redis is not suited for. If you are storing things like webpages, comments, tweets etc I would also consider a document store like MongoDB. You would have to add more details about your implementation.
Redis virtual memory performance depends how you use it - if your frequently used values all fit in memory, with occasional disk reads for less popular content, there won't be that much difference from having the full dataset in memory. If your access requirements are completely random and a value only gets read once after being loaded from disk it won't perform any better than reading files from disk. Of course any caching solution will have problems in that scenario.
There are a couple of things to consider when designing for redis vm though:
All keys need to fit in memory, even if their values don't. This works well if your value is a large serialized object, but not so well if your value is a single number or short string and may actually be smaller than the key.
The save process works a little differently when VM is used. I don't recall the details, but you may need to consider what is guaranteed to be written to disk when.

MongoDB vs. Redis vs. Cassandra for a fast-write, temporary row storage solution

I'm building a system that tracks and verifies ad impressions and clicks. This means that there are a lot of insert commands (about 90/second average, peaking at 250) and some read operations, but the focus is on performance and making it blazing-fast.
The system is currently on MongoDB, but I've been introduced to Cassandra and Redis since then. Would it be a good idea to go to one of these two solutions, rather than stay on MongoDB? Why or why not?
Thank you
For a harvesting solution like this, I would recommend a multi-stage approach. Redis is good at real time communication. Redis is designed as an in-memory key/value store and inherits some very nice benefits of being a memory database: O(1) list operations. For as long as there is RAM to use on a server, Redis will not slow down pushing to the end of your lists which is good when you need to insert items at such an extreme rate. Unfortunately, Redis can't operate with data sets larger than the amount of RAM you have (it only writes to disk, reading is for restarting the server or in case of a system crash) and scaling has to be done by you and your application. (A common way is to spread keys across numerous servers, which is implemented by some Redis drivers especially those for Ruby on Rails.) Redis also has support for simple publish/subscribe messenging, which can be useful at times as well.
In this scenario, Redis is "stage one." For each specific type of event you create a list in Redis with a unique name; for example we have "page viewed" and "link clicked." For simplicity we want to make sure the data in each list is the same structure; link clicked may have a user token, link name and URL, while the page viewed may only have the user token and URL. Your first concern is just getting the fact it happened and whatever absolutely neccesary data you need is pushed.
Next we have some simple processing workers that take this frantically inserted information off of Redis' hands, by asking it to take an item off the end of the list and hand it over. The worker can make any adjustments/deduplication/ID lookups needed to properly file the data and hand it off to a more permanent storage site. Fire up as many of these workers as you need to keep Redis' memory load bearable. You could write the workers in anything you wish (Node.js, C#, Java, ...) as long as it has a Redis driver (most web languages do now) and one for your desired storage (SQL, Mongo, etc.)
MongoDB is good at document storage. Unlike Redis it is able to deal with databases larger than RAM and it supports sharding/replication on it's own. An advantage of MongoDB over SQL-based options is that you don't have to have a predetermined schema, you're free to change the way data is stored however you want at any time.
I would, however, suggest Redis or Mongo for the "step one" phase of holding data for processing and use a traditional SQL setup (Postgres or MSSQL, perhaps) to store post-processed data. Tracking client behavior sounds like relational data to me, since you may want to go "Show me everyone who views this page" or "How many pages did this person view on this given day" or "What day had the most viewers in total?". There may be even more complex joins or queries for analytic purposes you come up with, and mature SQL solutions can do a lot of this filtering for you; NoSQL (Mongo or Redis specifically) can't do joins or complex queries across varied sets of data.
I currently work for a very large ad network and we write to flat files :)
I'm personally a Mongo fan, but frankly, Redis and Cassandra are unlikely to perform either better or worse. I mean, all you're doing is throwing stuff into memory and then flushing to disk in the background (both Mongo and Redis do this).
If you're looking for blazing fast speed, the other option is to keep several impressions in local memory and then flush them disk every minute or so. Of course, this is basically what Mongo and Redis do for you. Not a real compelling reason to move.
All three solutions (four if you count flat-files) will give you blazing fast writes. The non-relational (nosql) solutions will give you tunable fault-tolerance as well for the purposes of disaster recovery.
In terms of scale, our test environment, with only three MongoDB nodes, can handle 2-3k mixed transactions per second. At 8 nodes, we can handle 12k-15k mixed transactions per second. Cassandra can scale even higher. 250 reads is (or should be) no problem.
The more important question is, what do you want to do with this data? Operational reporting? Time-series analysis? Ad-hoc pattern analysis? real-time reporting?
MongoDB is a good option if you want the ability to do ad-hoc analysis based on multiple attributes within a collection. You can put up to 40 indexes on a collection, though the indexes will be stored in-memory, so watch for size. But the result is a flexible analytical solution.
Cassandra is a key-value store. You define a static column or set of columns that will act as your primary index right up front. All queries run against Cassandra should be tuned to this index. You can put a secondary on it, but that's about as far as it goes. You can, of course, use MapReduce to scan the store for non-key attribution, but it will be just that: a serial scan through the store. Cassandra also doesn't have the notion of "like" or regex operations on the server nodes. If you want to find all customers where the first name starts with "Alex", you'll have to scan through the entire collection, pull the first name out for each entry and run it through a client-side regex.
I'm not familiar enough with Redis to speak intelligently about it. Sorry.
If you are evaluating non-relational platforms, you might also want to consider CouchDB and Riak.
Hope this helps.
Just found this: http://blog.axant.it/archives/236
Quoting the most interesting part:
This second graph is about Redis RPUSH vs Mongo $PUSH vs Mongo insert, and I find this graph to be really interesting. Up to 5000 entries mongodb $push is faster even when compared to Redis RPUSH, then it becames incredibly slow, probably the mongodb array type has linear insertion time and so it becomes slower and slower. mongodb might gain a bit of performances by exposing a constant time insertion list type, but even with the linear time array type (which can guarantee constant time look-up) it has its applications for small sets of data.
I guess everything depends at least on data type and volume. Best advice probably would be to benchmark on your typical dataset and see yourself.
According to the Benchmarking Top NoSQL Databases (download here)
I recommend Cassandra.
If you have the choice (and need to move away from flat fies) I would go with Redis. Its blazingly fast, will comfortably handle the load you're talking about, but more importantly you won't have to manage the flushing/IO code. I understand its pretty straight forward but less code to manage is better than more.
You will also get horizontal scaling options with Redis that you may not get with file based caching.
I can get around 30k inserts/sec with MongoDB on a simple $350 Dell. If you only need around 2k inserts/sec, I would stick with MongoDB and shard it for scalability. Maybe also look into doing something with Node.js or something similar to make things more asynchronous.
The problem with inserts into databases is that they usually require writing to a random block on disk for each insert. What you want is something that only writes to disk every 10 inserts or so, ideally to sequential blocks.
Flat files are good. Summary statistics (eg total hits per page) can be obtained from flat files in a scalable manner using merge-sorty map-reducy type algorithms. It's not too hard to roll your own.
SQLite now supports Write Ahead Logging, which may also provide adequate performance.
I have hand-on experience with mongodb, couchdb and cassandra. I converted a lot of files to base64 string and insert these string into nosql.
mongodb is the fastest. cassandra is slowest. couchdb is slow too.
I think mysql would be much faster than all of them, but I didn't try mysql for my test case yet.