DB Schema Design - mongodb

Currently working on trying to model a DB schema in MongoDB. The bit I'm getting stuck on is where an employee must indicate their times that they are available to work.
I.e.
Monday:
9AM-12PM, 2:00PM-6:00PM
Tuesday:
8AM-10AM, 12:00PM-2:00PM, 4:00PM-6:00PM
etc.
I could just have an embedded field in my schema with a list of times, but I'm not sure if thats the best solution to this.
Opinions?

There is no universal rule when it comes to schema design. I would store a list of numerical ranges, where a range's unit is seconds from the beginning of workweek. This way it would be possible to use mongo to search directly for available personnel in a single query. Date manipulation should not be a problem on a modern platform.

Related

Designing MongoDB collections vs relational approach

I have difficulties in designing my MongoDB collections to fit my requirements. I only used SQL databases in my previous projects and am quite new to the NoSQL concept of MongoDB. My current learning project to get that concept is to store and retrieve statistics of games played (enhanced leaderboard example). In a relational database I would create the following tables:
Matches
:_id
:game_id (reference to the type of game played)
:startedAt
:endedAt
Results
:match_id
:player_id (reference to the users collection)
:field_id
:value
A match can have n players and each can have n results. Depending on the type of game, multiple result values indicated by field_id need to be entered for every player (e.g. number of points and whether the user won or not -> two fields = two rows in the results table).
As I understood that in MongoDB the concept is to store related information in one collection I tried to ignore what I did with relational DBs in the past year and created the following collection structure:
Matches
:_id
:game_id
:startedAt
:endedAt
:players [{
:player_id
:results [{
:field_id
:value
}]
}]
However I now have difficulties to calculate the overall results for a specific player. Queries as "calculate total sum of points player A had in game B" are quite complex and I fear that the performance is very bad. Therefore I would still prefer the relational model for this case. But as I wanted to learn the concepts of a NoSQL database I still wonder, whether I just misconcepted the DB and there is a good way to structure the data in a single collection for the queries.
Any recommendations are highly appreciated.
I am new to MongoDB, recently started learning but here is what I know so far:
NOSQL databases, such as MongoDB, are mainly used for their scalability and flexibility. For simple and small projects, I don't see a benefit.
The case you described is a classic case that SQL should be used.
If I was chosen to create that database and I HAD to use MongoDB, I would do it this way:
1) Keep the collection you created for the matches and
2) add a new collection based on the players.
The 2nd collection would be used for the leaderboards and for everything player-based. This means that there would be duplicate data but there is no other way to deal with the player searches, the ones you will need to do for the leaderboards.
Maybe it could work if only the latest matches were saved but still I see no benefit.
As I mentioned before I am in the learning process too, so I am not 100% sure.
Good luck with your project.

How to store query output in temp db?

I am really new to the programming but I am studying it. I have one problem which I don't know how to solve.
I have collection of docs in mongoDB and I'm using Elasticsearch to query the fields. The problem is I want to store the output of search back in mongoDB but in different DB. I know that I have to create temporary DB which has to be updated with every search result. But how to do this? Or give me documentation to read so I could learn it. I will really appreciate your help!
Mongo does not natively support "temp" collections.
A typical thing to do here is to not actually write the entire results output to another DB since that would be utterly pointless since Elasticsearch does its own caching as such you don't need any layer over the top.
As well, due to IO concerns it is normally a bad idea to write say a result set of 10k records to Mongo or another DB.
There is a feature request for what you talk of: https://jira.mongodb.org/browse/SERVER-3215 but no planning as of yet.
Example
You could have a table of results.
Within this table you would have a doc that looks like:
{keywords: ['bok', 'mongodb']}
Each time you search and scroll through each result item you would write a row to this table populating the keywords field with keywords from that search result. This would be per search result per search result list per search. It would probably be best to just stream each search result to MongoDB as they come in. I have never programmed Python (though I wish to learn) so an example in pseudo:
var elastic_results = [{'elasticresult'}];
foreach(elastic_results as result){
//split down the phrases in this result and make a keywords array
db.results_collection.insert(array_formed_from_splitting_down_result); // Lets just lazy insert no need for batch or trying to shrink the amount of data to one go or whatever, lets just stream it in.
}
So as you go along your results you basically just mass insert as fast a possible create a sort of "stream" of input to MongoDB. It can do this quite well.
This should then give you a shardable list of words and language verbs to process things like MRs on and stuff to aggregate statistics about them.
Without knowing more and more about your scenario this is pretty much my best answer.
This does not use the temp table concept but instead makes your data permanent which is fine by the sounds of it since you wish to use Mongo as a storage engine for further tasks.
Actually there is MongoDB river plugin to work with Elasticsearch...
db.your_table.find().forEach(function(doc) { b.another_table.insert(doc); } );

Is MongoDB a good fit for this?

In a system I'm building, it's essentially an issue tracking system, but with various issue templates. Some issue types will have different formats that others.
I was originally planning on using MySQL with a main issues table and an issues_meta table that contains key => value pairs. However, I'm thinking NoSQL (MongoDB) might be the better option.
Can MongoDB provide me with the ability to generate "standard"
reports, like # of issues by type, # of issues by type by month, # of
issues assigned per person, etc? I ask this because I've read a few
sources that said Mongo was bad at reporting.
I'm also planning on storing my audit logs in Mongo, since I want a single "table" for all actions (Modifications to any table). In Mongo I can store each field that was changed easily, since it is schemaless. Is this a bad idea?
Anything else I should know, and will Mongo work for what I want?
I think MongoDB will be a perfect match for that use case.
MongoDB collections are heterogeneous, meaning you can store documents with different fields in the same bag. So different reporting templates won't be a show stopper. You will be able to model a full issue with a single document.
MongoDB would be a good fit for logging too. You may be interested in capped collections.
Should you need to have relational association between documents, you can do have it too.
If you are using Ruby, I can recommend you Mongoid. It will make it easier. Also, it has support for versioning of documents.
MongoDB will definitely work (and you can use capped collections to automatically drop old records, if you want), but you should ask yourself, does it fit to this task well? For use case you've described it is better option to use Redis (simple and fast enough) or Riak (if you care a lot about your log data).

Best NoSql for querying date ranges?

Given a store which is a collection of JSON documents in the (approximate) form of:
{
PeriodStart: 18/04/2011 17:10:49
PeriodEnd: 18/04/2011 17:15:54
Count: 12902
Max: 23041 Min: 0
Mean: 102.86 StdDev: 560.97
},
{
PeriodStart: 18/04/2011 17:15:49
PeriodEnd: 18/04/2011 17:20:54
Count: 10000
Max: 23041 Min: 0
Mean: 102.86 StdDev: 560.97
}... etc
If I want to query the collection for given date range (say all documents from last 24 hours), which would give me the easiest querying operations to do this?
To further elaborate on requirements:
Its for an application monitoring service, so strict CAP/ACID isn't necessarily required
Performance isn't a primary consideration either. Read/writes would be at most 10s per second which could be handled by an RDBMS anyway
Ability to handle changing document schema's would be desirable
Ease of querying ability of lists/sets is important (ad-hoc queries an advantage)
I may not have your query requirements down exactly, as you didn't specify. However, if you need to find any documents that start or end in a particular range, then you can apply most of what is written below. If that isn't quite what you're after, I can be more helpful with a bit more direction. :)
If you use CouchDB, you can create your indexes by splitting up the parts of your date into an array. ([year, month, day, hour, minute, second, ...])
Your map function would probably look similar to:
function (doc) {
var date = new Date(doc.PeriodStart);
emit([ date.getFullYear(), date.getMonth(), date.getDate(), date.getHours(), date.getMinutes() ] , null]);
}
To perform any sort of range query, you'd need to convert your start and end times into this same array structure. From there, your view query would have params called startkey and endkey. They would would receive the array parameters for start and end respectively.
So, to find the documents that started in the past 24 hours, you would send a querystring like this in addition to the full URI for the view itself:
// start: Apr 17, 2011 12:30pm ("24 hours ago")
// end: Apr 18, 2011 12:30pm ("today")
startkey=[2011,04,17,12,30]&endkey=[2011,04,18,12,30]
Or if you want everything from this current year:
startkey=[2011]&endkey=[2011,{}]
Note the {}. When used as an endkey: [2011,{}] is identical to [2012] when the view is collated. (either format will work)
The extra components of the array will simply be ignored, but the further specificity you add to your arrays, the more specific your range can be. Adding reduce functions can be really powerful here, if you add in the group_level parameter, but that's beyond the scope of your question.
[Update edited to match edit to original question]
Short answer, (almost) any of them will work.
BigTable databases are a great platform for monitoring services (log analysis, etc). I prefer Cassandra (Super Column Families, secondary indexes, atomic increment coming soon), but HBase will work for you too. Structure the date value so that its lexicographic ordering is the same as the date ordering. Fixed-length strings following the format "YYYYMMDDHHmmss" work nicely for this. If you use this string as your key, range queries will be very simple to perform.
Handling changing schema is a breeze - just add more columns to the column family. They don't need to be defined ahead of time.
I probably wouldn't use graph databases for this problem, as it'll probably summarize to traversing a linked list. However, I don't have a ton of experience with graph databases, so take this advice with a grain of salt.
[Update: some of this is moot since the question was edited, but I'm keeping it for posterity]
Is this all you're doing with this database? The big problem with selecting a NoSQL database isn't finding one that supports one query requirement well. The problem is finding one that supports all of your query requirements well. Also, what are your operational requirements? Can you accept a single point of failure? What kind of setup/maintenance overhead are you willing to tolerate? Can you sacrifice low latency for high-throughput batch operations, or is realtime your gig?
Hope this helps!
It seems to me that the easiest way to implement what you want is performing a range query in a search engine like ElasticSearch.
I, for one, certainly would not want to write all the map/reduce code for CouchDB (because I did in the past). Also, based on my experience (YMMV), range queries will outperform CouchDB's views and use much less resources for large datasets.
Not to mention you can compute interesting statistics with „date histogram“ facets in ElasticSearch.
ElasticSearch is schema-free, JSON based, so you should be able to evaluate it for your case in a very short time.
I've decided to go with Mongo for the time being.
I found that setup/deployment was relatively easy, and the C# wrapper was adequate for what we're trying to do (and in the cases where its not we can resort to javascript queries easily).
What you want is whichever one gives you access to some kind of spatial index. Most of these work off of B-Trees and/or hashes, neither of which is particularly good for spatial indexing.
Now, if your definition of "last 24 hours" is simply "starts or ends within the last 24 hours" then a B-Tree may be find (you do two queries, one on PeriodStart and then one on PeriodEnd, both being within range of the time window).
But if the PeriodStart to PeriodEnd is longer than the time window, then neither of these will be as much help to you.
Either way, that's what you're looking for.
This question explains how to query a date range in CouchDB. You would need your data to be in a lexicographically sortable state, in all the examples I've seen.
Since this is tagged Redis and nobody has answered that aspect I'm going to put forth a solution for it.
Step one, store your documents under a given redis key, as a hash or perhaps as a JSON string.
Step two, add the redis key (lets call it a DocID) in a sorted set, with the timestamp converted to a UNIX timestamp. For example where r is a redis Connection instance in the Python redis client library:
mydocs:Doc12 => [JSON string of the doc]
In Python:
r.set('mydocs:Doc12', JSONStringOfDocument)
timeindex:documents, DocID, Timestamp:
In Python:
r.zadd('timeindex:documents', 'Doc12', timestamp)
In effect you are building an index of documents based on UNIX timestamps.
To get documents from a range of time, you use zrange (or zrevrange if you want the order reversed) to get the list of Document IDs in that window. Then you can retrieve the documents from the db as normal. Sorted sets are pretty fast in Redis. Further advantages are that you can do set operations such as "documents in this window but not this window", and indeed even store the results in Redis automatically for later use.
One example of how this would be useful is that in your example documents you have a start and end time. If you made an index of each as above, you could get the intersection of the set of documents that start in a given range and the set of documents that end in a given range, and store the resulting set in a new key for later re-use. This would be done via zinterstore
Hopefully, that helps someone using Redis for this.
Mongodb is very positive for queries, i think that it's useful because has a lot of functions. I use mongodb for GPS distance, text search and pipeline model (aggregation includes)

Relations in Document-oriented database?

I'm interested in document-oriented databases, and I'd like to play with MongoDB. So I started a fairly simple project (an issue tracker), but am having hard times thinking in a non-relational way.
My problems:
I have two objects that relate to each other (e.g. issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}} - here I have a user related to the issue). Should I create another document 'user' and reference it in 'issue' document by its id (like in relational databases), or should I leave all the user's data in the subdocument?
If I have objects (subdocuments) in a document, can I update them all in a single query?
I'm totally new to document-oriented databases, and right now I'm trying to develop sort of a CMS using node.js and mongodb so I'm facing the same problems as you.
By trial and error I found this rule of thumb: I make a collection for every entity that may be a "subject" for my queries, while embedding the rest inside other objects.
For example, comments in a blog entry can be embedded, because usually they're bound to the entry itself and I can't think about a useful query made globally on all comments. On the other side, tags attached to a post might deserve their own collection, because even if they're bound to the post, you might want to reason globally about all the tags (for example making a list of trending topics).
In my mind this is actually pretty simple. Embedded documents can only be accessed via their master document. If you can envision a need to query an object outside the context of the master document, then don't embed it. Use a ref.
For your example
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
I would make issue and reporter each their own document, and reference the reporter in the issue. You could also reference a list of issues in reporter. This way you won't duplicate reporters in issues, you can query them each separately, you can query reporter by issue, and you can query issues by reporter. If you embed reporter in issue, you can only query the one way, reporter by issue.
If you embed documents, you can update them all in a single query, but you have to repeat the update in each master document. This is another good reason to use reference documents.
The beauty of mongodb and other "NoSQL" product is that there isn't any schema to design. I use MongoDB and I love it, not having to write SQL queries and awful JOIN queries! So to answer your two questions.
1 - If you create multiple documents, you'll need make two calls to the DB. Not saying it's a bad thing but if you can throw everything into one document, why not? I recall when I used to use MySQL, I would create a "blog" table and a "comments" table. Now, I append the comments to the record in the same collection (aka table) and keep building on it.
2 - Yes ...
The schema design in Document-oriented DBs can seems difficult at first, but building my startup with Symfony2 and MongoDB I've found that the 80% of the time is just like with a relational DB.
At first, think it like a normal db:
To start, just create your schema as you would with a relational Db:
Each Entity should have his own Collection, especially if you'll need to paginate the documents in it.
(in Mongo you can somewhat paginate nested document arrays, but the capabilities are limited)
Then just remove overly complicated normalization:
do I need a separate category table? (simply write the category in a column/property as a string or embedded doc)
Can I store comments count directly as an Int in the Author collection? (then update the count with an event, for example in Doctrine ODM)
Embedded documents:
Use embedded documents only for:
clearness (nested documents like: addressInfo, billingInfo in the User collection)
to store tags/categories ( eg: [ name: "Sport", parent: "Hobby", page: "/sport"
] )
to store simple multiple values (for eg. in User collection: list of specialties, list of personal websites)
Don't use them when:
the parent Document will grow too large
when you need to paginate them
when you feel the entity is important enough to deserve his own collection
Duplicate values across collection and precompute counts:
Duplicate some columns/attributes values from a Collection to another if you need to do a query with each values in the where conditions. (remember there aren't joins)
eg: In the Ticket collection put also the author name (not only the ID)
Also if you need a counter (number of tickets opened by user, by category, ecc), precompute them.
Embed references:
When you have a One-to-Many or Many-to-Many reference, use an embedded array with the list of the referenced document ids (see MongoDB DB Ref).
You'll need to use an Event again to remove an id if the referenced document get deleted.
(There is an extension for Doctrine ODM if you use it: Reference Integrity)
This kind of references are directly managed by Doctrine ODM: Reference Many
Its easy to fix errors:
If you find late that you have made a mistake in the schema design, its quite simply to fix it with few lines of Javascript to run directly in the Mongo console.
(stored procedures made easy: no need of complex migration scripts)
Waring: don't use Doctrine ODM Migrations, you'll regret that later.
Redid this answer since the original answer took the relation the wrong way round due to reading incorrectly.
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
As to whether embedding some important information about the user (creator) of the ticket is a wise decision or not depends upon the system specifics.
Are you giving these users the ability to login and report issues they find? If so then it is likely you might want to factor that relation off to a user collection.
On the other hand, if that is not the case then you could easily get away with this schema. The one problem I see here is if you wish to contact the reporter and their job role has changed, that's somewhat awkward; however, that is a real world dilemma, not one for the database.
Since the subdocument represents a single one-to-one relation to a reporter you also should not suffer fragmentation problems mentioned in my original answer.
There is one glaring problem with this schema and that is duplication of changing repeating data (Normalised Form stuff).
Let's take an example. Imagine you hit the real world dilemma I spoke about earlier and a user called Nigel wants his role to reflect his new job position from now on. This means you have to update all rows where Nigel is the reporter and change his role to that new position. This can be a lengthy and resource consuming query for MongoDB.
To contradict myself again, if you were to only have maybe 100 tickets (aka something manageable) per user then the update operation would likely not be too bad and would, in fact, by manageable for the database quite easily; plus due to the lack of movement (hopefully) of the documents this would be a completely in place update.
So whether this should be embedded or not depends heavily upn your querying and documents etc, however, I would say this schema isn't a good idea; specifically due to the duplication of changing data across many root documents. Technically, yes, you could get away with it but I would not try.
I would instead split the two out.
If I have objects (subdocuments) in a document, can I update them all in a single query?
Just like the relation style in my original answer, yes and easily.
For example, let's update the role of Nigel to MD (as hinted earlier) and change the ticket status to completed:
db.tickets.update({'reporter.username':'Nigel'},{$set:{'reporter.role':'MD', status: 'completed'}})
So a single document schema does make CRUD easier in this case.
One thing to note, stemming from your English, you cannot use the positional operator to update all subdocuments under a root document. Instead it will update only the first found.
Again hopefully that makes sense and I haven't left anything out. HTH
Original Answer
here I have a user related to the issue). Should I create another document 'user' and reference it in 'issue' document by its id (like in relational databases), or should I leave all the user's data in the subdocument?
This is a considerable question and requires some background knowledge before continuing.
First thing to consider is the size of a issue:
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
Is not very big, and since you no longer need the reporter information (that would be on the root document) it could be smaller, however, issues are never that simple. If you take a look at the MongoDB JIRA for example: https://jira.mongodb.org/browse/SERVER-9548 (as a random page that proves my point) the contents of a "ticket" can actually be quite considerable.
The only way you would gain a true benefit from embedding the tickets would be if you could store ALL user information in a single 16 MB block of contigious sotrage which is the maximum size of a BSON document (as imposed by the mongod currently).
I don't think you would be able to store all tickets under a single user.
Even if you was to shrink the ticket to, maybe, a code, title and a description you could still suffer from the "swiss cheese" problem caused by regular updates and changes to documents in MongoDB, as ever this: http://www.10gen.com/presentations/storage-engine-internals is a good reference for what I mean.
You would typically witness this problem as users add multiple tickets to their root user document. The tickets themselves will change as well but maybe not in a drastic or frequent manner.
You can, of course, remedy this problem a bit by using power of 2 sizes allocation: http://docs.mongodb.org/manual/reference/command/collMod/#usePowerOf2Sizes which will do exactly what it says on the tin.
Ok, hypothetically, if you were to only have code and title then yes, you could store the tickets as subdocuments in the root user without too many problems, however, this is something that comes down to specifics that the bounty assignee has not mentioned.
If I have objects (subdocuments) in a document, can I update them all in a single query?
Yes, quite easily. This is one thing that becomes easier with embedding. You could use a query like:
db.users.update({user_id:uid,'tickets.code':'asdf-1'}, {$set:{'tickets.$.title':'Oh NOES'}})
However, to note, you can only update ONE subdocument at a time using the positional operator. As such this means you cannot, in a single atomic operation, update all ticket dates on a single user to 5 days in the future.
As for adding a new ticket, that is quite simple:
db.users.update({user_id:uid},{$push:{tickets:{code:asdf-1,title:"Whoop"}}})
So yes, you can quite simply, depending on your queries, update the entire users data in a single call.
That was quite a long answer so hopefully I haven't missed anything out, hope it helps.
I like MongoDB, but I have to say that I will use it a lot more soberly in my next project.
Specifically, I have not had as much luck with the Embedded Document facility as people promise.
Embedded Document seems to be useful for Composition (see UML Composition), but not for aggregation. Leaf nodes are great, anything in the middle of your object graph should not be an embedded document. It will make searching and validating your data more of a struggle than you'd want.
One thing that is absolutely better in MongoDB is your many-to-X relationships. You can do a many-to-many with only two tables, and it's possible to represent a many-to-one relationship on either table. That is, you can either put 1 key in N rows, or N keys in 1 row, or both. Notably, queries to accomplish set operations (intersection, union, disjoint set, etc) are actually comprehensible by your coworkers. I have never been satisfied with these queries in SQL. I often have to settle for "two other people will understand this".
If you've ever had your data get really big, you know that inserts and updates can be constrained by how much the indexes cost. You need fewer indexes in MongoDB; an index on A-B-C can be used to query for A, A & B, or A & B & C (but not B, C, B & C or A & C). Plus the ability to invert a relationship lets you move some indexes to secondary tables. My data hasn't gotten big enough to try, but I'm hoping that will help.