I need an advice about NoSQL/MongoDb and data/models structure - mongodb

Recently I'm exploring NoSQL Databases. I need an advice about how to store data in the most optimal and efficient way for a given problem. I'm targeting MongoDB, now. However it should be the same with CouchDB.
Let's say we have these 3 Models:
Story:
id
title
User:
id
name
Vote:
id
story_id
user_id
I want to be able to ask the database these questions:
Who has voted for this Story?
What this User has Voted for?
I'm doing simple joins while working with a relational DB. The question is, how should I store the data for those objects in order to be most efficient.
For example, if I store the Vote objects as a subcollection of Stories it wont be easy to get the info - "What a user has voted for".

I would suggest storing votes as a list of story _ids in each user. That way you can find out what stories a user has voted for just by looking at the list. To get the users who have voted for a story you can do something like:
db.users.find({stories: story_id})
where story_id is the _id of the story in question. If you create an index on the stories field both of those queries will be fast.

don't worry if your queries are efficient until it starts to matter
according to below quote, you're doing it wrong
The way I have been going about the
mind switch is to forget about the
database alltogether. In the
relational db world you always have to
worry about data normalization and
your table structure. Ditch it all.
Just layout your web page. Lay them
all out. Now look at them. Your
already 2/3 there. If you forget the
notion that database size matters and
data shouldn't be duplicated than your
3/4 there and you didnt even have to
write any code! Let your views dictate
your Models. You don't have to take
your objects and make them 2
dimensional anymore as in the
relational world. You can store
objects with shape now.
how-to-think-in-data-stores-instead-of-databases

Ok, you haven given a normalized data model as you would do in an SQL setup.
In my understanding you don't do this in MongoDB. You could store references, but you do not for performance reasons in the general case.
I'm not an expert in the NoSQL area in no way, but why don't you simply follow your needs and store the user (ids) that have voted for a story in the stories collection and the story (ids) a user has voted for in the users collection?

In CouchDB this is very simple. One view emits:
function(doc) {
if(doc.type == "vote") {
emit(doc.story_id, doc.user_id);
}
}
Another view emits:
function(doc) {
if(doc.type == "vote") {
emit(doc.user_id, doc.story_id);
}
}
Both are queries extremely fast since there is no join. If you do need user data or story data, CouchDB supports multi-document fetch. Also quite fast and is one way to do a "join".

I've been looking into MongoDB and CouchDB a lot lately, but my insight is limited. Still, when thinking about storing the votes inside the story document, you might have to worry about hitting the 4MB document size limit. Even if you don't, you might be constantly increasing the size of the document enough to cause it to get moved and thus slowing down your writes (see how documents are sized in MongoDB).
As for CouchDB, these kinds of things are quite simple, elegant, and quite fast once the view indexes are calculated. Personally, however, I have hesitated to do a similar project in CouchDB because of benchmarks showing it progressively slowing down to a considerable degree as the database grows (and the view indexes grow). I'd love to see some more recent benchmarks showing CouchDB performance as database size increases. I WANT to try MongoDB or CouchDB, but SQL still seems so efficient and logical, so I'll stay with it until the project fits the temptation just right.

Related

NoSQL how to lookup id in a collection

NoSQL noob here. I'm building an app using Firestore NoSQL. I'm looping through items where every item has a owner id (creator user id).
I want to display owner's name on the listing page. In traditional SQL, i have foreign key so I can just make reference to say, Item.Owner.FirstName
What's the best practice in NoSQL? Should I be saving owner name as a field at the time of saving the item? or do a lookup of each owner id to get user object whilst i'm looping through items?
Second option sounds expensive so i'm assuming 1st way is the way to go. Unless there's a better, more accepted way?
Both will work. You either reference the data in the other document in whatever way you see fit, or you duplicate information into the document that you intend to query to build the display. You just have to decide what which problem you want to deal with:
If you duplicate data among documents (known as "denormalization"), then you'll have to put effort into keeping them all up to date with each other, if that's what you require. So, writing one document might actually turn into writing multiple documents.
If you normalize your data with no duplication, then each of your queries will require more queries to get the related data from other documents. This could result in a drop in performance and an increase in cost for apps with heavy read loads.
Since we don't know the performance requirements and usage behavior of your app, there is no way to give specific advice. You will have to think carefully about which problem you want to have, perhaps based on complexity, performance, and overall cost.

Mongodb: about performance and schema design

After learning about performance and schema design in MongoDB, I still can´t figure out how would I make the schema design in an application when performance is a must.
Let´s imagine if we have to make YouTube to work with MongoDB as its database. How would you make the schema?
OPTION 1: two collections (videos collection and comments collection)
Pros: adding, deleting and editing comments affects only the comments collection, therefore these operations would be more efficient.
Cons: Retrieving videos and comments would be 2 different queries to the database, one for videos and one for comments.
OPTION 2: single collection (videos collection with the comments embedded)
Pros: You retrieve videos and its comments with a single query.
Cons: Adding, deleting and editing comments affect the video Document, therefore these operations would be less efficient.
So what do you think? Are my guesses true?
As a caller in the desert, I have to say that embedding should only be used under very special circumstances:
The relation is a "One-To(-Very)-Few" and it is absolutely sure that no document will ever exceed this limit. A good example would be the relation between "users" and "email addresses" – a user is unlikely to have millions of them and there isn't even a problem with artificial limits: setting the maximum number of addresses as user can have to, say 50 hardly would cause a problem. It may be unlikely that a video gets millions of comments, but you don't want to impose an artificial limit on it, right?
Updates do not happen very often. If documents increase in size beyond a certain threshold, they might be moved, since documents are guaranteed to be never fragmented. However, document migrations are expensive and you want to prevent them.
Basically, all operations on comments become more complicated and hence more expensive - a bad choice. KISS!
I have written an article about the above, which describes the respective problems in greater detail.
And furthermore, I do not see any advantage in having the comments with the videos. The questions to answer would be
For a given user, what are the videos?
What are the newest videos (with certain tags)?
For a given video, what are the comments?
Note that the only connection between videos and comments here is about a given video, so you already have the _id or something else to positively identify the video. Furthermore, you don't want to load all comments at once, especially if you have a lot of them, since this would decrease UX because of long load times.
Let's say it is the _id. So, with it, you'd be able to have paged comments easily:
db.comments.find({"video_id": idToFind})
.skip( (page-1) * pageSize )
.limit( pageSize )
hth
As usual the answer is, it depends. As as a rule of thumb you should favour embedding, unless you need to regularly query the embedded objects on its own or if the embedded array is likely to get too large(>~100 records). Using this guideline, there are a few questions you need to ask regarding your application.
How is your application going to access the data ? Are you only ever going to show the comments on the same page as the associated video ? Or do you want to provide the options to show all comments for a given user across all movies ? The first scenario favours embedding (one collection), whereas you probably would be better of with two collections in the second scenario.
Secondly, how many comments do you expect for each video ? Taking the analogy of IMDB, you could easily expect more than 100 comments for a popular video, so that means you are better off creating two separate collections as the embedded array of comments would grow large quite quickly. I wouldn't be too concerned about the overhead of an application join, they are generally comparable in speed compared to a server-side join in a relational database provided your collections are properly indexed.
Finally, how often are users going to update their comments after their initial post ? If you lock the comments after 5 minutes like on StackOverflow users may not update their comments very often. In that case the overhead of updating or deleting comments in the video collection will be negligible and may even be outweigh the cost of performing a second query in a separate comments collection.
You should use embedded for better performance. Your I/O's will be lesser. In worst case? it might take a bit long to persist the document in the DB but it wont take much time to retrieve it.
You should either compromise persistence over reads or vise versa depending on your application needs.
Hence it is important to choose your db wisely.

Mass Update NoSQL Documents: Bad Practice?

I'm storing two collections in a MongoDB database:
==Websites==
id
nickname
url
==Checks==
id
website_id
status
I want to display a list of check statuses with the appropriate website nickname.
For example:
[Google, 200] << (basically a join in SQL-world)
I have thousands of checks and only a few websites.
Which is more efficient?
Store the nickname of the website within the "check" directly. This means if the nickname is ever changed, I'll have to perform a mass update of thousands of documents.
Return a multidimensional array where the site ID is the key and the nickname is the value. This is to be used when iterating through the list of checks.
I've read that #1 isn't too bad (in the NoSQL) world and may, in fact, be preferred? True?
If it's only a few websites I'd go with option 1 - not as clean and normalized as in the relational/SQL world but it works and much less painful than trying to emulate joins with MongoDB. The thing to remember with MongoDB or any other NoSQL database is that you are generally making some kind of trade off - nothing is for free. I personally really value the schema-less document oriented data design and for the applications I use it for I readily make the trade-offs (like no joins and transactions).
That said, this is a trade-off - so one thing to always be asking yourself in this situation is why am I using MongoDB or some other NoSQL database? Yes, it's trendy and "hot", but I'd make certain that what you are doing makes sense for a NoSQL approach. If you are spending a lot of time working around the lack of joins and foreign keys, no transactions and other things you're used to in the SQL world I'd think seriously about whether this is the best fit for your problem.
You might consider a 3rd option: Get rid of the Checks collection and embed the checks for each website as an array in each Websites document.
This way you avoid any JOINs and you avoid inconsistencies, because it is impossible for a Check to exist without the Website it belongs to.
This, however, is only recommended when the checks array for each document stays relatively constant over time and doesn't grow constantly. Rapidly growing documents should be avoided in MongoDB, because everytime a document doubles its size, it is moved to a different location in the physical file it is stored in, which slows down write-operations. Also, MongoDB has a 16MB limit per document. This limit exists mostly to discourage growing documents.
You haven't said what a Check actually is in your application. When it is a list of tasks you perform periodically and only make occasional changes to, there would be nothing wrong with embedding. But when you collect the historical results of all checks you ever did, I would rather recommend to put each result(set?) in an own document to avoid document growth.

What is the correct noSQL collection structure for the following case?

As someone who got used to thinking in relational terms, I am trying to get a grasp of thinking in the "noSQL way".
Assume the following scenario:
We have a blog (eg. 9gag.com) with many posts and registered users. Every post can be liked by each user. We would like to build a recommendation engine, so we need to track:
all posts viewed by a user
all posts liked by a user
Posts have: title, body, category. Users have: username, password, email, other data.
In a relational DB we would have something like: posts, users, posts_users_views (post_id, users_id, view_date), posts_users_likes (post_id, user_id, like_date).
Question
What is the "right" structure would be in a document/column oriented noSQL database?
Clarification: Should we save an array of all viewed/liked post ids in users (or user ids in posts)? If so, won't we have a problem with a row size getting huge?
In CouchDB you could have separate documents for the user, post, view and like. Showing the views/likes by user can be arranged by the "view" (materialized map/reduce query) with map function emitting an array key [user_id, post_id]. As the result you will get the sorted dictionary (ordered lexicographically by the key), so taking all the views per user='ID' is the query with keys starting from [ID] to [ID,{}]. You can optimize it, but the basic solution is very simple.
In CouchDB wiki there is a comment on using relationally modeled design and view collation mechanism (which can substitute some simple joins). To get some intuition I rather advice to study the problems of post and comments, which is also very simple, but not that much trivial as view and likes :)
There may be no NoSQL way, but I think most of the map/reduce systems share similar type of thinking. The CouchDB is a good tool to start, because it is very limited :) It is difficult to do any queries inefficient in distributed environment and its map and reduce query functions cannot have side effects (they are generating the materialized view, incrementally when the document set is changed, and the result should not depend on the order of document updates).

MongoDB - Denormalization / model opinion

I've been getting in to mongo, but coming from RDBMS background facing the probably obvious questions with regards to denormalisation and general data modelling.
If I have a document type with an array of sub docs, each sub doc has a status code.
In The relational world I would add a foreign key to the record, StatusId, simple.
In mongodb, would you denormalise the key pieces of data from the "status" e.g. Code and desc and hold objectid referencing another collection of proper status. I guess the next question is one of design, if the status doc is modified I'd then need to modified the denormalised data?
Another question on the same theme is how would you model a transaction table, say I have events and people, the events could be quite granular, say time sheets which over time may lead to many records. Based on what I've seen, this would seem like a good candidate for a child / sub array of docs, of course that could be indexed for speed.
Therefore is it possible to query / find just the sub array or part of it? And given the 16mb limit for doc size, and I just limited the transaction history of the person? Or should the transaction history be a separate collection with a onjid referencing the person?
Thanks for any input
Sam
Or should the transaction history be a separate collection with a onjid referencing the person?
Probably, I think this S/O question may help you understand why.
if the status doc is modified I'd then need to modified the denormalised data?
Yes this is standard trade-off in MongoDB. You will encounter this question a lot. You may need to leverage a Queue structure to ensure that data remains consistent across multiple collections.
Therefore is it possible to query / find just the sub array or part of it?
This is a tough one specific to MongoDB. With the basic query syntax, you have only limited support for dealing with arrays of objects. The new "Aggregration Framework" is actually much better here, but it's not available in a stable build.
All your "how to model this or that" can't really be answered, because good schema design depends on so many factors (access patters, hardware characteristics, is cluster used, etc).
if the status doc is modified I'd then need to modified the denormalised data?
Usually yes, that's the drawback of denormalisation. But sometimes you don't have to (some social network site stores user name with a photo tag and doesn't update it when user changes his name).
to query / find just the sub array or part of it?
It is not currently possible to fetch only a part of array (unless using map/reduce, of course).
And given the 4mb limit
Where did you get this from? It's 16mb at the moment.
While it's true that schema design does take into account many factors, the need to denormalize data usually comes up somewhere. I tend to take advantage of denormalization in my apps that use MongoDB because I feel it lends itself well storing denormalized data:
no additional column maintenance
support for hashes and arrays as field types (perfect for storing denormalized fields)
speedy, non-blocking writes make syncing data less expensive
document size growth only marginally affects performance up to limits (for the most part)
There are a few gems that help you manage denormalized data, including setting it up and keeping it in sync. If you're using Mongoid, you try mongoid_alize. DISCLAIMER: I am the author and maintainer of mongoid_alize.