What are the uses of a key-value database? - nosql

I've heard about highly scalable key-value databases, like Amazon DynamoDB and Kyoto Cabinet.
But I can't see an use for everyday problems.
Ex.: Suppose I have a "post object", which has an id, a title, a content and some tags, and I want to query the database to find all posts with some particular tag. How can I do it with a key-value database?

Key value stores are extremely fast (typically in memory) and provide eventual consistency. Many of the standard database features such as ACID may not exist but at the same time none of the limitations such as too many writes etc., also exist.
The idea is to think outside of the traditional key-value store as you would do in a hashtable and instead look at this form of storage as a de-normalized form of storage that is designed based on the queries you want to fire at it.
In your specific question I would use the tags as keys and the post-id as values. Then bingo...

Related

How is it possible for DynamoDB to support both Key-Value and Document database properties at the same time

As per DynamoDB's documentation it supports both key-value and document-oriented properties of NoSQL even though other NoSQL databases fall only under only one type either Key-Value or Document or Graph or Column-oriented
Also it says
Amazon DynamoDB is "built on the principles of Dynamo"[3] and is a hosted service within the AWS infrastructure. However, while Dynamo is based on leaderless replication, DynamoDB uses single-leader replication.
And Dynamo is
A set of techniques that together can form a highly available key-value structured storage system[1] or a distributed data store
So when DynamoDB is built on the principles of Dynamo which is not related to Document-oriented storage system and since it is mandatory for a developer to create a primary key and the table requires key for every item how and in what sense DynamoDB is called a Document-oriented database ?
Can a DB fall under two types of NoSQL databases in the first place ?
First, it is important to realize that "Dynamo" was an earlier nosql database designed by Amazon, and its design was made public in 2007 (e.g., see https://www.allthingsdistributed.com/2007/10/amazons_dynamo.html). Other people later took this design, and other contemporary designs (like Google's BigTable) and improved on them, resulting in projects such as Cassandra (2008). Amazon's DynamoDB was only released in 2012, based on many ideas from those other systems (and especially Cassandra) and had very little in common with the original "Dynamo". So almost anything you can say about the original "Dynamo" would not be relevant when you discuss the modern DynamoDB.
Now regarding your main question:
A key-value store holds for each key a single value. Arguably, if the value can be an entire document, you can call this database a "document store". In this sense, DynamoDB is a document store. The DynamoDB API lets you conveniently store a JSON document as the value, and also read or writes part of this document directly instead of reading or writing the entire document (although, you actually pay for reading and writing the entire document).
You should note that DynamoDB, like Cassandra and BigTable (and unlike the original "Dynamo") actually gives you more than that: Each so-called "partition key" can hold not just one value (or document), but a sorted list of such values. I mentioned this interesting feature, which I don't know how to call, in my question
How do you call the data model of DynamoDB and Cassandra?

Designing mongodb data model: embedding vs. referencing

I'm writing an application that gathers statistics of users across multiple social networks accounts. I have a collection of users and I would like to store the statistics information of each user.
Now, I have two options:
Create a collection that stores users statistics documents, and add a reference object to each of the user documents that links it to the corresponding document in the statistics collection.
Embed a statistics document in each of the users document.
Besides for query performance (which I'm less concerned about):
what are the pros and cons of each of these approaches?
What should I take into account if I choose to use references rather than embedding the information inside the user document?
The shape of the data is determined by the application itself.
There’s a good chance that when you are working with the users data, you probably need statistics details.
The decision about what to put in the document is pretty much determined by how the data is used by the application.
The data that is used together as users documents is a good candidate to be pre-joined or embedded.
One of the limitations of this approach is the size of the document. It should be a maximum of 16 MB.
Another approach is to split data between multiple collections.
One of the limitations of this approach is that there is no constraint in MongoDB, so there are no foreign key constraints as well.
The database does not guarantee consistency of the data. Is it up to you as a programmer to take care that your data has no orphans.
Data from multiple collections could be joined by applying the lookup operator. But, a collection is a separate file on disk, so seeking on multiple collections means seeking from multiple files, and that is, as you are probably guessing, slow.
Generally speaking, embedded data is the preferable approach.

How to design the tables/entities in NoSQL DB?

I do my first steps in NoSQL databases, thus I would like to hear the best practices about implementing the following requirement.
Let suppose I have a messages database, which is powered by MongoDB engine. This DB contains a collection of documents, where each document has the following fields:
time stamp;
message author/source;
message content.
Now, I want to build a list of authors/sources in order to add some metadata about each source. In the case of the classical RDBMS, I would define a table tblSources where I would store the names of the message sources and all additional meta-data (or links to the relevant tables) for each author.
What is the right approach to such task in NoSQL/MongoDB world?
It really depends on how you want to use the data. NoSQL dbs are generally not designed with fast joins in mind but they are still capable of doing joins and storing foreign keys.
Your options here are really
duplicate data aka store the author metadata in every document. This might be better in the case where you are really trying to optimize lookups and use Mongo as a key value store
Join on foreign key - this is pretty similar to how you would use a RDBMS

When should I create a new collections in MongoDB?

So just a quick best practice question here. How do I know when I should create new collections in MongoDB?
I have an app that queries TV show data. Should each show have its own collection, or should they all be store within one collection with relevant data in the same document. Please explain why you chose the approach you did. (I'm still very new to MongoDB. I'm used to MySql.)
The Two Most Popular Approaches to Schema Design in MongoDB
Embed data into documents and store them in a single collection.
Normalize data across multiple collections.
Embedding Data
There are several reasons why MongoDB doesn't support joins across collections, and I won't get into all of them here. But the main reason why we don't need joins is because we can embed relevant data into a single hierarchical JSON document. We can think of it as pre-joining the data before we store it. In the relational database world, this amounts to denormalizing our data. In MongoDB, this is about the most routine thing we can do.
Normalizing Data
Even though MongoDB doesn't support joins, we can still store related data across multiple collections and still get to it all, albeit in a round about way. This requires us to store a reference to a key from one collection inside another collection. It sounds similar to relational databases, but MongoDB doesn't enforce any of key constraints for us like most relational databases do. Enforcing key constraints is left entirely up to us. We're good enough to manage it though, right?
Accessing all related data in this way means we're required to make at least one query for every collection the data is stored across. It's up to each of us to decide if we can live with that.
When to Embed Data
Embed data when that embedded data will be accessed at the same time as the rest of the document. Pre-joining data that is frequently used together reduces the amount of code we have to write to query across multiple collections. It also reduces the number of round trips to the server.
Embed data when that embedded data only pertains to that single document. Like most rules, we need to give this some thought before blindly following it. If we're storing an address for a user, we don't need to create a separate collection to store addresses just because the user might have a roommate with the same address. Remember, we're not normalizing here, so duplicating data to some degree is ok.
Embed data when you need "transaction-like" writes. Prior to v4.0, MongoDB did not support transactions, though it does guarantee that a single document write is atomic. It'll write the document or it won't. Writes across multiple collections could not be made atomic, and update anomalies could occur for how many ever number of scenarios we can imagine. This is no longer the case since v4.0, however it is still more typical to denormalize data to avoid the need for transactions.
When to Normalize Data
Normalize data when data that applies to many documents changes frequently. So here we're talking about "one to many" relationships. If we have a large number of documents that have a city field with the value "New York" and all of a sudden the city of New York decides to change its name to "New-New York", well then we have to update a lot of documents. Got anomalies? In cases like this where we suspect other cities will follow suit and change their name, then we'd be better off creating a cities collection containing a single document for each city.
Normalize data when data grows frequently. When documents grow, they have to be moved on disk. If we're embedding data that frequently grows beyond its allotted space, that document will have to be moved often. Since these documents are bigger each time they're moved, the process only grows more complex and won't get any better over time. By normalizing those embedded parts that grow frequently, we eliminate the need for the entire document to be moved.
Normalize data when the document is expected to grow larger than 16MB. Documents have a 16MB limit in MongoDB. That's just the way things are. We should start breaking them up into multiple collections if we ever approach that limit.
The Most Important Consideration to Schema Design in MongoDB is...
How our applications access and use data. This requires us to think? Uhg! What data is used together? What data is used mostly as read-only? What data is written to frequently? Let your applications data access patterns drive your schema, not the other way around.
The scope you've described is definitely not too much for "one collection". In fact, being able to store everything in a single place is the whole point of a MongoDB collection.
For the most part, you don't want to be thinking about querying across combined tables as you would in SQL. Unlike in SQL, MongoDB lets you avoid thinking in terms of "JOINs"--in fact MongoDB doesn't even support them natively.
See this slideshare:
http://www.slideshare.net/mongodb/migrating-from-rdbms-to-mongodb?related=1
Specifically look at slides 24 onward. Note how a MongoDB schema is meant to replace the multi-table schemas customary to SQL and RDBMS.
In MongoDB a single document holds all information regarding a record. All records are stored in a single collection.
Also see this question:
MongoDB query multiple collections at once

Pseudo primary keys in MongoDB - bad idea?

http://code.flickr.com/blog/page/4/
This blog post is from the devs at Flickr, and outlines their simplified approach to generating GUIDs for photos in a sharded database environment using mysql.
I am working on an app that uses MongoDB for data store that has a similar requirement for items stored in embedded documents. Basically, a document in the collection represents a list of items, and then individual items inside that document each need to have some kind of identifier as well for lookup purposes. I'd rather not put items in a different collection since the list keys that aren't items are really just metadata and don't need to have their own collection. Ideally it should be one document.
I was thinking the kind of approach detailed in the blog post could be implemented to solve this problem - one endpoint that generates GUIDs for these entries and saves the last used value. The problem is that I am not certain if this approach introduces problems when sharding the data store in mongo. I don't have any experience distributing Mongo over several machines. I assume I could have the application layer check this endpoint when the data is saved and set the _id key appropriate, but I don't know how this would affect queries against the data set.
Would be setting up this kind of GUID system be a flawed idea? I realize it runs counter to some of the principles of NoSQL in general, but since the documents are embedded, what alternative is there?
I think ObjectID is the way to go. They are stored much more compactly than GUID/UUID and maintain a roughly increasing order which has benefits for indexing. It is also designed to be generated client-side without the need for a ticket server as described in the article. The only real downside vs their solution is that they use 12 bytes while an int64 uses 8 (GUIDs/UUIDs use 16 in binary or 32 in hex plus a few bytes of overhead). One other potential downside (which is more likily to be a benefit in most cases) is that because the creation time is encoded in the ObjectId if they are used for publicly visible identifiers it can leak possibly unwanted information to users such as when another user signed up for your service.