Why are voldermort reads so fast? - voldemort

The voldermort doc https://www.project-voldemort.com/voldemort/ says
Voldemort combines in memory caching with the storage system so that a separate caching tier is not required (instead the storage system itself is just fast)
What does it do with the storage system besides using SSD that makes it so fast?

The storage mechanism is a hash table, key-value pairs.
This is akin to an indexed relational table with two fields: key and data.
No transactions, atomic operations, constraints, or triggers.
Operations don't require complex locking. Single value conflict resolution.
In memory storage or caching is efficient, since each value is stored independently from a row, i.e. no empty fields in a "row".
Data partitioning distribution allows any free server with the data to respond, within restrictions.
As a storage system this is very efficient.
The drawback is the lack of features, the ones that make it fast, must be implemented on the front end, leading to an imbalance in processing loads in the layers.
For write-once(or rarely more) read-many objects, like blog posts, where most of the higher-level functions are not needed, this is a better high-traffic solution.

Related

How Mongo DB or any nosql DB (Hbase, Cassandra) is scalable and having advantage over traditional RDBMS?

I am still not able to relate in real-time how nosql is beneficial whereas we have indexes too in traditional RDBMS's. If someone can suggest columnar databases advantages in real application particularly in terms of using structure, semistructured or unstructured data.
Largely, it depends on what you want your datastore to do. If you want to be able to scale to meet storage or operational demands, a RDBMS can only take you so far.
It comes down how you can scale to meet demand. A RDBMS is really only capable of scaling vertically. That is, add more RAM, add more disk, etc. A distributed (NoSQL) database makes scaling easier by allowing you to add more machine instances. This is known as scaling horizontally.
Here's an example using Cassandra:
Let's say I have a 3 node cluster, and my keyspace (database) is also configured with a replication factor (RF) of 3. This means that each node is responsible for 100% of the data. I load my data, and it takes up 100GB of disk space (on each node). Now, while I might have 300GB of data total in my cluster, a single copy of my data is 100GB.
So my product team comes to me and says they need to double the amount of data they have. I know that I built their 3 node cluster with 200GB drives. If I did nothing, those drives would pretty much fill-up (and if they didn't they wouldn't leave room for much else).
Now it's up to me to scale the cluster to meet their space demands. I'll start by adding 3 new nodes to the cluster (for a total of 6), but I'll leave my RF at 3. This makes each node responsible for 50% of the data, or 50GB. When my product team loads more data to meet their "doubling" requirement, each node should climb back up to about 100GB. A single copy of the data is now 200GB. But with each node responsible for 50%, each 200GB drive still only has 100GB.
Example #2:
Let's say that the cluster above with 6 nodes is capable of supporting an operational load of 10,000 operations per second (ops). My product team comes to me again, saying that for the holiday season they project needing to support 20,000 ops. As the current cluster can only support half of that, it will choke under the intense throughput, and one or more nodes may crash.
As Cassandra scales linearly, the way to achieve this is to (again) double the size of the cluster. So I increase it from 6 nodes to 12 nodes, while still maintaining my RF of 3. After running some performance testing, they verify that it can indeed support 20,000 ops. As a single copy of my data is 200GB, the total data footprint remains 600GB. With 12 nodes, each node is now responsible for only 25% of the data, or 50GB.
So scalability is the advantage. But how about modeling the data? The main idea in distributed database modeling, is two-fold:
Build a table structure which is keyed to distribute well. We don't want uneven amounts of data on each node.
Build the key on the table so that it matches our query requirements.
One of the drawbacks of a NoSQL database, is that your query patterns become restricted. In an effort to cut down on network time, you want to ensure that your query can be served by a single node.
This usually means using natural keys, as those are more in-line with what you are asking of your data. Surrogate keys (alpha, numerical, or both) distribute well, but aren't really useful for querying. User "Bob Jones" might be id "3582346556230" in my system. But when I want to query Bob's data, I'll probably never want to ask for it by "3582346556230," because that doesn't mean anything to the application or the context in which the data is used.
Also, you want your data to have structure. Unstructured data is un-queryable data. Simple as that. If you want unstructured data to be queryable, you need to parse-out its identifying aspects to be used as keys. You don't want to "search" or run SELECT * FROM queries. Full table scans in NoSQL databases are even more resource consuming than their RDBMS counterparts, because they have to check each node, sort through replicas, and thus incurs extra network time.
NoSQL databases give you the ability to scale (for increases in data or demand). But it's important to note that their scalability can make some things (which a RDBMS might be good at), more difficult than you're used to.
The R in RDBMS, relational, is the biggest thing missing from Mongo. There's very little to no way to make the database understand how entries in different tables collections relate to each other. One of the big strengths of RDBMSs is the ability to define constraints which the database will enforce, most typically foreign key constraints which ensure that an id in one table refers to an existing id in another table.
One requirement for the database to be able to enforce such constraints is obviously that everything needs to go through one source of truth and there needs to be one central entity cross-checking the data; it cannot be decentralised since discrepancies between two different primary sources can lead to data inconsistencies.
In Mongo, each data blob is pretty much independent. It doesn't refer to other entries in any way enforced by the database. Mongo also has weak to no ACID guarantees, meaning there's little protection against race condition inserts or updates. In a word: Mongo makes little guarantees with regards to data consistency and mostly offloads these kinds of concerns to the application layer. That allows it to work more decentralised.
E.g. a good way to scale Mongo is to have many secondary servers which replicate a primary server for read-only access. There's no guarantee that the primary and secondaries will be in sync at any given time, it may take a couple of seconds for data written to the primary to trickle to the secondaries. But this allows you to have a virtually unlimited number of secondary read-only servers, which is great for scaling a database under heavy read load.
The way specifically Mongo handles its clusters also allows it to have a very high uptime, as the cluster will reorganise itself into primaries and secondaries automatically if a server goes down. This even allows for rolling maintenance without any client downtime.
Not having to enforce complex constraints or transaction consistency during writing also allows a more fire-and-forget style of writing to the database, which can be much faster. Again, at the cost of allowing inconsistent data. Which is why most writing pretty much means atomically updating a single document in a collection with no guarantees about other documents, which is something of a different paradigm than RDBMS transactional updates across many tables.
I would not recommend Mongo for storing things like a financial ledger, which heavily relies on transactional guarantees for consistency. However, things like Twitter are a perfect case for it: many independent snippets of data which must be read by a massive number of clients.

When to use dynamoDB -UseCases

I've tried to figure out what will be the best use cases that suit for Amazon dynamoDB.
When I googled most of the blogs says DyanmoDb will be used only for a large amount of data (BigData).
I'm having a background of relational DB. NoSQL DB is new for me.So when I've tried to relate this to normal relation DB knowledge.
Most of the concepts related to DynamoDb is to create a schema-less table with partition keys/sort keys. And try to query them based on the keys.Also, there is no such concept of stored procedure which makes queries easier and simple.
If we managing such huge Data's doing such complex queries each and every time to retrieve data will be the correct approach without a stored procedure?
Note: I've maybe had a wrong understanding of the concept. So, please anyone clear my thoughts here
Thanks in advance
Jay
In short, systems like DynamoDB are designed to support big data sets (too big to fit a single server) and high write/read throughput by scaling horizontally, as opposed to scaling vertically, which is the more common approach for relational databases historically.
The main approach to support horizontal scalability is by partitioning data, i.e. a data set is split into multiple pieces and distributed among multiple servers. This way it may use more storage and more IOPS, allowing bigger data sets and higher read/write throughput.
However, data partitioning makes it difficult to support complex queries, such as joins etc., as data is distributed among multiple physical servers. As for stored procedures, they are not supported for the same reason - historically the idea behind stored procedures is data locality, i.e. they run on the server near the data without network operations, however, if data is distributed among multiple servers, this benefit disappears (at least in the form of stored procedure).
Therefore the most efficient way to query data from such systems is by record key, as data partitioning is based on a key and it's easy to figure out where a record lives physically for a given key. While many such systems also support secondary indexes, they are usually restricted in some way or expensive and may not be enough to satisfy requirements in a complex software solution. A quite common approach is to have a complementary indexing/query solution (I've seen solutions based on Elasticsearch and Solr), which allows running complex queries over some fragments of records to figure out a record key, which then used to load the record.

How does the storage backend influence Datomic?

How should I pick the backend storage service for Datomic?
Is it a matter of preference to select, say, DynamoDB instead of Postgres, or does each option have different tradeoffs? If so, what are they?
Storage Services Requirements
Datomic' storage services should generally meet 3 requirements:
Implement key-value store semantics: efficient read/write access using indexed keys’ values
Support consistent reads. e.g. read your own writes. Ideally, no-contention/lock-free reads.
Support conditional puts. e.g. optimistic locking + snapshot isolation.
Datomic uses storages services to store blocks of sorted, compressed datoms, similar to the way traditional database systems use file systems and the requirements above are pretty much the API between the underlying storage service and Datomic. So the choice in storage services depend on how well they support those three requirements.
Write Scalability
Datomic doesn't usually put a lot of write pressure on the underlying storage service since there's only one component writing to it, the Transactor. Also, Datomic uses a background indexing job to integrate novelty into storage once enough of it has been accumulated (by default ~32MB but can be configured) which further reduces the constant write load. The only thing Datomic immediately writes is the transaction log.
Read Scalability
Datomic uses multiple layers of caching i.e. memcached and peers cache so in ideal circumstances i.e. when the working set fits in memory, the systems won't put a lot o read pressure either.
System Load
If your system doesn't require huge write scalability and your application data tends to fit in memory, then the choice of a particular storage service is irrelevant except, of course, for their operational capabilities (backups, admin tools, etc.) which have nothing to do with Datomic.
If, on the other hand, you system does require huge write scalability or you have a great number of peers, each of them working with more data than can fit in their memory (forcing a lot of data segments to be brought from storage), you'll require a storage system that can horizontally scale e.g. DynamoDB. As mentioned in one of the comments, if you need arbitrary write scalability, Datomic is not the right system for you anyway.

Is there a data storage where I can access data directly via array index instead of hash key? Redis? MongoDB?

I need an external C/C++ memory efficient (!) data storage for a Java app which does not have the downside of a normal database lookup (b tree) but which uses my IDs as array index. Is there an open source solution for this? I implemented this in C++ in-memory only, but I would like to have a "storage to disc" option in case of a crash or for backup. Also Java binding would be cool.
E.g. redis looks good but when reading the docs I see that in general things are accessed by hash keys which have O(1) only in theory - or can I somehow force that the hashing scheme matches the storage index? And also lists are not appropriated as they are implemented as linked lists. Or what about mongodb?
And yes, I really need that fast read access (write can be "okayish slow" :)) - it is no premature optimization but if there is no alternative I'll try redis before rolling my own. Also Java is not possible (as I said: memory efficient ;))
With a remote key-value store, the overhead is very often dominated by the network and protocol management rather than data access itself. That's why with efficient key-value stores (like Redis for instance), almost all the operations actually have the same cost.
The Redis benchmark page contains a good illustration of this point.
In other words, in the context of an in-memory remote store, and considering only the latency, a random access array will have the same exact performance than a hash table, and even less efficient O(log n) containers like red-black trees, B-trees, etc ... will be quite close.
If you really want maximum performance, I would suggest to use an embedded (i.e. in-process) store. For instance, both BerkeleyDB and Tokyo Cabinet provide disk based random access containers for fixed-length records.
KDB is the go-to solution for this problem in the financial systems (algo trading) world. Be prepared to have your brain melted by the syntax though. Oh, and it is not open source.

mongoDB vs relational databases when data can't fit into memory?

First of all, I apologize for my potentially shallow understanding of NoSQL architecture (and databases in general) so try to bear with me.
I'm thinking of using mongoDB to store resources associated with an UUID. The resources can be things such as large image files (tens of megabytes) so it makes sense to store them as files and store just links in my database along with the associated metadata. There's also the added flexibility of decoupling the actual location of the resource files, so I can use a different third party to store the files if I need to.
Now, one document which describes resources would be about 1kB. At first I except a couple hundred thousands of resource documents which would equal some hundreds of megabytes in database size, easily fitting into server memory. But in the future I might have to scale this into the order of tens of MILLIONS of documents. This would be tens of gigabytes which I can't squeeze into server memory anymore.
Only the index could still fit in memory being around a gigabyte or two. But if I understand correctly, I'd have to read from disk every time I did a lookup on an UUID. Is there a substantial speed benefit from mongoDB over a traditional relational database in such a situation?
BONUS QUESTION: is there an existing, established way of doing what I'm trying to achieve? :)
MongoDB doesn't suddenly become slow the second the entire database no longer fits into physical memory. MongoDB currently uses a storage engine based on memory mapped files. This means data that is accessed often will usually be in memory (OS managed, but assume a LRU scheme or something similar).
As such it may not slow down at all at that point or only slightly, it really depends on your data access patterns. Similar story with indexes, if you (right) balance your index appropriately and if your use case allows it you can have a huge index with only a fraction of it in physical memory and still have very decent performance with the majority of index hits happening in physical memory.
Because you're talking about UUID's this might all be a bit hard to achieve since there's no guarantee that the same limited group of users are generating the vast majority of throughput. In those cases sharding really is the most appropriate way to maintain quality of service.
This would be tens of gigabytes which I can't squeeze into server
memory anymore.
That's why MongoDB gives you sharding to partition your data across multiple mongod instances (or replica sets).
In addition to considering sharding, or maybe even before, you should also try to use covered indexes as much as possible, especially if it fits your Use cases.
This way you do not HAVE to load entire documents into memory. Your indexes can help out.
http://www.mongodb.org/display/DOCS/Retrieving+a+Subset+of+Fields#RetrievingaSubsetofFields-CoveredIndexes
If you have to display your entire document all the time based on the id, then the general rule of thumb is to attempt to keep e working set in memory.
http://blog.boxedice.com/2010/12/13/mongodb-monitoring-keep-in-it-ram/
This is one of the resources that talks about that. There is a video on mongodb's site too that speaks about this.
By attempting to size the ram so that the working set is in memory, and also looking at sharding, you will not have to do this right away, you can always add sharding later. This will improve scalability of your app over time.
Again, these are not absolute statements, these are general guidelines, that you should think through your usage patterns and make sure that they ar relevant to what you are doing.
Personally, I have not had the need to fit everything in ram.