MongoDB: What's a good way to get a list of all unique tags? - mongodb

What's the best way to keep track of unique tags for a collection of documents millions of items large? The normal way of doing tagging seems to be indexing multikeys. I will frequently need to get all the unique keys, though. I don't have access to mongodb's new "distinct" command, either, since my driver, erlmongo, doesn't seem to implement it, yet.

Even if your driver doesn't implement distinct, you can implement it yourself. In JavaScript (sorry, I don't know Erlang, but it should translate pretty directly) can say:
result = db.$cmd.findOne({"distinct" : "collection_name", "key" : "tags"})
So, that is: you do a findOne on the "$cmd" collection of whatever database you're using. Pass it the collection name and the key you want to run distinct on.
If you ever need a command your driver doesn't provide a helper for, you can look at http://www.mongodb.org/display/DOCS/List+of+Database+Commands for a somewhat complete list of database commands.

I know this is an old question, but I had the same issue and could not find a real solution in PHP for it.
So I came up with this:
http://snipplr.com/view/59334/list-of-keys-used-in-mongodb-collection/

John, you may find it useful to use Variety, an open source tool for analyzing a collection's schema: https://github.com/jamescropcho/variety
Perhaps you could run Variety every N hours in the background, and query the newly-created varietyResults database to retrieve a listing of unique keys which begin with a given string (i.e. are descendants of a specific parent).
Let me know if you have any questions, or need additional advice.
Good luck!

Related

nosql wishlist models - Struggle between reference and embedded document

I got a question about modeling wishlists using mongodb and mongoose. The idea is I need a user beeing able to have many different wishlists which contain many wishes, each wish making a reference to a single article
I was thinking about it and because a wishlist only belong to a single user I thought using embedded document for that.
Same for the wish beeing embedded to a wishlist.
So I got something like that
var UserSchema = new Schema({
...
wishlists: [wishlistSchema]
...
})
var WishlistSchema = new Schema({
...
wishes: [wishSchema]
...
})
but my question is what to do with the article ? should I use a reference or should I copy the article's data in an embedded document.
If I use embedded document I got an update problem. When the article's price change, to update every wish referencing this article become a struggle. But to access those wishes's article is a piece of cake.
If I use reference, The update is not a problem anymore but I got a probleme when I filter the wish depending on their article criteria ( when I filter the wishes depending on a price, category etc .. ).
I think the second way is probably the best but I don't know how if it's possible to build a query to filter the wish depending on the article's field. I tried a lot of things using population but nothing works very well when you need to populate depending on a nested object field. ( for exemple getting wishes where their article respond to certain conditions ).
Is this kind of query doable ?
Sry for the loooong question and for my bad English :/ but any advice would be great !
In my experience in dealing with NoSQL database (mongo, mainly), when designing a collection, do not think of the relations. Instead, think of how you would display, page, and retrieve the documents.
I would prefer embedding and updating multiple schema when there's a change, as opposed to doing a ref, for multiple reasons.
Get would be fast and easy and filter is not a problem (like you've said)
Retrieve operations usually happen a lot more often than updates and with proper indexing, you wouldn't really have to bother about performance.
It leverages on NoSQL's schema-less nature and you'll be less prone restructuring due to requirement changes (new sorting, new filters, etc)
Paging would be a lot less of a hassle, and UI would not be restricted with it's design with paging and limit.
Joining could become expensive. Redundant data might be a hassle to update but it's always better than not being able to display a data in a particular way because your schema is normalized and joining is difficult.
I'd say that the rule of thumb is that only split them when you do not need to display them together. It is not impossible to join them back if you do, but definitely more troublesome.

does mongodb support subcollections

Is there such a concept as subcollections in mongo, something like a subdocument but supports the full crud api? I would like to organize my db to perform queries like this:
db.games.pong.leaderbords.find({score:{'$gt':99}})
In other words is there any hierarchy to collections or must I create a fully descriptive name for each collection, as in:
db.pongLeaderboard.find({score:{'$gt':99}});
EDIT: As Nicolas pointed out the dot notation is allowable. What I meant to ask is can I do this and have games and pong be proper collections in their own right so I could still do something like
db.games.find({name:'pacman'})
It's perfectly legal for collection names to contain the . character, so it's entirely possible to organise your data such that your first query is correct.
I'm fairly sure that creating the collection games.pong.leaderbords does not create games, games.pong or pong though, so this might not answer your problem.
This type of question is terribly easy to answer for yourself though - just type your query in MongoDB and see what happens. If it fails, it's not possible. If it doesn't, it's possible. MongoDB is good that way. Not only does that give you a definite answer, it also gives you an immediate answer without having to wait for someone to try and type your query and post the result.
What I meant to ask is can I do this and have games and pong be proper collections in their own right so I could still do something like
If I understand you right then no. Those two "subcollections" will have no reference to each other as such, unless you have actually created a games collection with data in (db.games.ping.xx doesn't count) then this will not work.

Mongo pagination

I have a use case where I need to get list of Objects from mongo based off a query. But, to improve performance I am adding Pagination.
So, for first call I get list of say 10 Objects, in next I need 10 more. But I cannot use offset and pageSize directly because the first 10 objects displayed on the page may have been modified [ deleted ].
Solution is to find Object Id of last object passed and retrieve next 10 objects after that ObjectId.
Please help how to efficiently do it using Morphia mongo.
Using morphia you can do this by the following command.
datastore.find(YourClass.class).field(id).smallerThan(lastId).limit(10).order("-ts");
Since you are querying for retrieving the items after the last retrieved id, you won't be bothered to deal with deleted items.
One thing I have thought up of is that you will have the same problem as with using skip() here unless you intend to change how your interface works.
Using ranged queries like this demands that you use a different kind of interface since it is must harder to detect now exactly what page you are on and how many pages exist in the future, especially if you are doing this to avoid problems with conventional paging.
The default type of interface to arise from this type of paging is merely a infinitely scrolling page, think of YouTube video comments or Facebook wall feed or even Google+. There is no physical pagination or "pages", instead you have a get more button.
This is the type of interface you will need to use to get ranged paging working better.
As for the query #cubbuk gives a good example:
datastore.find(YourClass.class).field(id).smallerThan(lastId).limit(10).order("-ts");
Except it should be greaterThan(lastId) since you want to find everything above that last _id. I would also sort by _id unless you make your OjbectIds sometime before you insert a record, if this is the case then you can use a specific timestamp set on insert instead.

MongoDB schema design -- Choose two collection approach or embedded document

I am trying to design a simple application where in I have two entities Notebook and Note. So Notebook can contain multiple Notes.In RDBMS I could have two tables and have One to Many
relationship between them. I am not sure in MongoDB whether I should not take a two collection
approach or I should embed notes in Notebook collection. What would you suggest?
That seems like a perfectly reasonable situation to use a single collection called Notebook, and each Notebook document contains embedded Notes. You can easily index on embedded documents.
If a Notebook document has a 'notes' key, and value is a list of notes:
{
"notes": [
{"created_on": Date(1343592000000), text: "A note."}
]
}
# create index
db.notebook.ensureIndex({"notes.created_on" : -1})
My opinion is to try and embed as much as possible, and then choose to reference another collection via an id as a second option when the reference needs to be to a more general set of data that is shared and might change. For instance, a collection of category documents which many other collections reference. And the category can be updated over time. But in your case, a note should always belong to a note book
You should ask yourself what kind of queries you will need to run on it. The "by default" approach is to embed them, but there are cases (that will depend on how you plan on using them) where a more relational approach is applicable. So the simple answer is "probably, but you should probably think about it" :)

In what scenarios would I need to use the CREATEREF, DEREF and REF keywords?

This question is about why I would use the above keywords. I've found plenty of MSDN pages that explain how. I'm looking for the why.
What query would I be trying to write that means I need them? I ask because the examples I have found appear to be achievable in other ways...
To try and figure it out myself, I created a very simple entity model using the Employee and EmployeePayHistory tables from the AdventureWorks database.
One example I saw online demonstrated something similar to the following Entity SQL:
SELECT VALUE
DEREF(CREATEREF(AdventureWorksEntities3.Employee, row(h.EmployeeID))).HireDate
FROM
AdventureWorksEntities3.EmployeePayHistory as h
This seems to pull back the HireDate without having to specify a join?
Why is this better than the SQL below (that appears to do exactly the same thing)?
SELECT VALUE
h.Employee.HireDate
FROM
AdventureWorksEntities3.EmployeePayHistory as h
Looking at the above two statements, I can't work out what extra the CREATEREF, DEREF bit is adding since I appear to be able to get at what I want without them.
I'm assuming I have just not found the scenarios that demostrate the purpose. I'm assuming there are scenarios where using these keywords is either simpler or is the only way to accomplish the required result.
What I can't find is the scenarios....
Can anyone fill in the gap? I don't need entire sets of SQL. I just need a starting point to play with i.e. a brief description of a scenario or two... I can expand on that myself.
Look at this post
One of the benefits of references is that it can be thought as a ‘lightweight’ entity in which we don’t need to spend resources in creating and maintaining the full entity state/values until it is really necessary. Once you have a ref to an entity, you can dereference it by using DEREF expression or by just invoking a property of the entity
TL;DR - REF/DEREF are similar to C++ pointers. It they are references to persisted entities (not entities which have not be saved to a data source).
Why would you use such a thing?: A reference to an entity uses less memory than having the DEFEF'ed (or expanded; or filled; or instantiated) entity. This may come in handy if you have a bunch of records that have image information and image data (4GB Files stored in the database). If you didn't use a REF, and you pulled back 10 of these entities just to get the image meta-data, then you'd quickly fill up your memory.
I know, I know. It'd be easier just to pull back the metadata in your query, but then you lose the point of what REF is good for :-D