In MongoDB, when to use a simple subdocument, when an array with 2-field elements? - mongodb

Background
I am storing table rows as MongoDb documents, with each column having a name. Let's say table has these columns of interest: Identifier, Person, Date, Count. The MongoDb document also has some extra fields separate from the table data, represented by timestamp. Columns are not fixed (which is why I use schema-free database to store them in the first place).
There will be need to do various complex, but so far unspecified queries. I am not very concerned about performance, though query performance may conceivably become a bottleneck. Once inserted, documents will not be modifed (a new document with same Identifier will be created instead), and insertions are not very frequent (let's say, 1000 new MongoDb documents per day). So amount of data will steadily grow over time.
Example
The straight-forward approach is having a collection of MongoDb documents like:
{
_id: XXXX,
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
data: {
Identifier: "AB002",
Person: "John002",
Date: ISODate("2013-11-16T21:26:17Z"),
Count: 1
}
}
Now I have seen an alternative approach (for example in accepted answer of this question), using array with two fields per object:
{
_id: XXXX,
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
data: [
{ field: "Identifier", value: "AB002" },
{ field: "Person", value: "John001" },
{ field: "Date", value: ISODate("2013-11-16T21:26:17Z") },
{ field: "Count", value: 1 }
]
}
Questions
Does the 2nd approach make any sense at all?
If yes, then how to choose which to use? Especially, are there some specific kinds of queries which are easy/cheap with one approach, hard/costly with another? Any "rules of thumb" on which way to go, or pro-con lists for both? Example real-life cases of one aproach being inconvenient would be especially valuable.

In your specific example the First version is a lot more appropriate and simple. You have to think in terms of how you would query your document.
It is a lot simpler to query your database like this: db.collection.find({"data.Identifier": "AB002"})
Although I'm not 100% sure why you even need the inner document. Why can't structure your document like:
{
_id: "AB002",
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
Person: "John002",
Date: ISODate("2013-11-16T21:26:17Z"),
Count: 1
}
Pros of first example:
Simple to query
Enforces unique keys, but your data won't have two columns with the same name anyway
I would assume mongoDB would generate better query plans because the structure is a lot more simple (haven't tested)
Pros of second example:
Allows multiple entries with the same key/field, but I don't feel that is useful in your case
A single index on the array can be used for all of its entries regardless of their field name

I don't think that the situation in the other example here and yours are the same. In the other example, they're creating a list of items with one of two answers, which would be more appropriately in an array, and the goal is to return a list of subdocuments that match the criteria. In your example, you're really just describing an object since they all hold different types of information, and you won't need to retrieve searchable bits of the subdocuments.

Related

MongoDB - Manipulating multi-level arrays in a document

I am currently building an app with Meteor and MongoDB. I have a 3 level document structure with array in array:
{
_id: "shtZFiTeHrPKyJ8vR",
description: "Some title",
categories: [{
id: "shtZFiTeHrPKyJ8vR",
name: "Foo",
options: [{
id: "shtZFiTeHrPKyJ8vR",
name: "bar",
likes: ["abc", "bce"]
}]
}]
}
Now, the document could be manipulated at any level. Means:
description could be changed
categories can be added / removed / renamed
options can be added / removed / renamed
users can like options, so they must be added or removed
1 and 2 is quite easy. It is also relatively easy to add or remove a new option:
MyCollection.update({ _id: id, "categories.id": categoryId }, {
$push: {
"categories.$.options": {
id: Random.id
name: optionName
}
}
});
But manipulating the options hash requires to do that on javascript objects. That means I first need to find my document, iterate over the options and then write them back.
At least that's what I am doing right now. But I don't like that approach.
What I was thinking about is splitting the collection, at least to put the likes into it's own collection referencing the origin document.
Or is there another way? I don't really like both of my possible solutions.
For this kind of query one would normally use a the Mongo position operator. Although from the docs.
Nested Arrays
The positional $ operator cannot be used for queries
which traverse more than one array, such as queries that traverse
arrays nested within other arrays, because the replacement for the $
placeholder is a single value
Thus the only way to natively do what you want is by using specific indexes.
db.test.update({},{$pull:{"categories.0.options.0.likes":"abc"}})
Unfortunately Mongo does not allow to easily get the index of a match nested document.
I would normally say that once your queries become that difficult it's probably a good idea to revisit the way you store data. Also with that many arrays to which you will be pushing data, Mongo will probably be relocating a lot of documents. This is definitely something that you want to minimize.
So at this point you will need to separate your data out into different documents and even collections.
Your first documents would look like this:
{
_id: "shtZFiTeHrPKyJ8vR",
description: "Some title",
categories: [{
id: "shtZFiTeHrPKyJ8vR",
name: "Foo",
options: ["shtZFiTeHrPKyJ8vR"]
}]
}
This way you can easily add/remove options as you mentioned in your question. You would then need a second collection with documents that represent each option.
{
_id: "shtZFiTeHrPKyJ8vR",
name: "bar",
likes: ["abc", "bce"]
}
You can learn more about references here. This is similar to what you mentioned in your comment. The benefit of this is that you are already reducing the potential amount of relocation. Depending on how you use your data you may even be reducing network usage.
Now doing updates on the likes is easy.
MyCollection.update({ _id: id}, {
$push: {likes: "value"}
});
This does, however, require you to make two queries to the db. Although on the flip side you do a lot less on the client side and a lot less bandwidth is used.
Some other questions you need to ask yourself is if that depth of nesting is really needed. There might be an easier way to go about achieving your goal that doesn't require it to become so complicated.

searching with multiple parameters with mongodb

How is fine-grained search achiveable with mongodb, without the use of external engines? Take this object as example
{
genre: 'comedy',
pages: 380,
year: 2013,
bestseller: true,
author: 'John Doe'
}
That is being searched by the following:
db.books.find({
pages: { '&gt': 100 },
year: { '&gt': 2000 },
bestseller: true,
author: "John Doe"
});
Pretty straightforward so far. Now suppose that there are a bit more values in the document, and that I am making more refined searches and I have a pretty big collection.
First thing I would do is to create indexes. But, how does it work? I have read that the index intersection, as defined in here https://jira.mongodb.org/browse/SERVER-3071 is not doable. That means that if I set the index to "year" and "pages" I will not really optimize the AND operations in searches.
So how can the searches be optimized for having many parameters?
Thanks in advance.
It seems like you are asking about compound indexes in mongodb. Compound indexes allow you to create a single index on multiple fields in a document. By creating compound indexes you can make these large/complex queries while still using an index.
On a more general note, if you create a basic index on a field that is highly selective, your search can end up being very quick. Using your example, if you had an index on author, the query engine would use that index to find all the entries where author == "John Doe". Presumably there are not that many books with that specific author relative to the number of books in the entire collection. So, even if the rest of your query is fairly complex, it is only evaluated over those few documents with the matching author. Thus, by structuring your indexes properly you can get a significant performance gain without having to have any complex indexes.

MongoDB: Speed of field ("inside record") search in comporation with speed of search in "global scope"

My question may be not very good formulated because I haven't worked with MongoDB yet, so I'd want to know one thing.
I have an object (record/document/anything else) in my database - in global scope.
And have a really huge array of other objects in this object.
So, what about speed of search in global scope vs search "inside" object? Is it possible to index all "inner" records?
Thanks beforehand.
So, like this
users: {
..
user_maria:
{
age: "18",
best_comments :
{
goodnight:"23rr",
sleeptired:"dsf3"
..
}
}
user_ben:
{
age: "18",
best_comments :
{
one:"23rr",
two:"dsf3"
..
}
}
So, how can I make it fast to find user_maria->best_comments->goodnight (index context of collections "best_comment") ?
First of all, your example schema is very questionable. If you want to embed comments (which is a big if), you'd want to store them in an array for appropriate indexing. Also, post your schema in JSON format so we don't have to parse the whole name/value thing :
db.users {
name:"maria",
age: 18,
best_comments: [
{
title: "goodnight",
comment: "23rr"
},
{
title: "sleeptired",
comment: "dsf3"
}
]
}
With that schema in mind you can put an index on name and best_comments.title for example like so :
db.users.ensureIndex({name:1, 'best_comments.title:1})
Then, when you want the query you mentioned, simply do
db.users.find({name:"maria", 'best_comments.title':"first"})
And the database will hit the index and will return this document very fast.
Now, all that said. Your schema is very questionable. You mention you want to query specific comments but that requires either comments being in a seperate collection or you filtering the comments array app-side. Additionally having huge, ever growing embedded arrays in documents can become a problem. Documents have a 16mb limit and if document increase in size all the time mongo will have to continuously move them on disk.
My advice :
Put comments in a seperate collection
Either do document per comment or make comment bucket documents (say,
100 comments per document)
Read up on Mongo/NoSQL schema design. You always query for root documents so if you end up needing a small part of a large embedded structure you need to reexamine your schema or you'll be pumping huge documents over the connection and require app-side filtering.
I'm not sure I understand your question but it sounds like you have one record with many attributes.
record = {'attr1':1, 'attr2':2, etc.}
You can create an index on any single attribute or any combination of attributes. Also, you can create any number of indices on a single collection (MongoDB collection == MySQL table), whether or not each record in the collection has the attributes being indexed on.
edit: I don't know what you mean by 'global scope' within MongoDB. To insert any data, you must define a database and collection to insert that data into.
Database 'Example':
Collection 'table1':
records: {a:1,b:1,c:1}
{a:1,b:2,d:1}
{a:1,c:1,d:1}
indices:
ensureIndex({a:ascending, d:ascending}) <- this will index on a, then by d; the fact that record 1 doesn't have an attribute 'd' doesn't matter, and this will increase query performance
edit 2:
Well first of all, in your table here, you are assigning multiple values to the attribute "name" and "value". MongoDB will ignore/overwrite the original instantiations of them, so only the final ones will be included in the collection.
I think you need to reconsider your schema here. You're trying to use it as a series of key value pairs, and it is not specifically suited for this (if you really want key value pairs, check out Redis).
Check out: http://www.jonathanhui.com/mongodb-query

MongoDB - Query embbeded documents

I've a collection named Events. Each Eventdocument have a collection of Participants as embbeded documents.
Now is my question.. is there a way to query an Event and get all Participants thats ex. Age > 18?
When you query a collection in MongoDB, by default it returns the entire document which matches the query. You could slice it and retrieve a single subdocument if you want.
If all you want is the Participants who are older than 18, it would probably be best to do one of two things:
Store them in a subdocument inside of the event document called "Over18" or something. Insert them into that document (and possibly the other if you want) and then when you query the collection, you can instruct the database to only return the "Over18" subdocument. The downside to this is that you store your participants in two different subdocuments and you will have to figure out their age before inserting. This may or may not be feasible depending on your application. If you need to be able to check on arbitrary ages (i.e. sometimes its 18 but sometimes its 21 or 25, etc) then this will not work.
Query the collection and retreive the Participants subdocument and then filter it in your application code. Despite what some people may believe, this isnt terrible because you dont want your database to be doing too much work all the time. Offloading the computations to your application could actually benefit your database because it now can spend more time querying and less time filtering. It leads to better scalability in the long run.
Short answer: no. I tried to do the same a couple of months back, but mongoDB does not support it (at least in version <= 1.8). The same question has been asked in their Google Group for sure. You can either store the participants as a separate collection or get the whole documents and then filter them on the client. Far from ideal, I know. I'm still trying to figure out the best way around this limitation.
For future reference: This will be possible in MongoDB 2.2 using the new aggregation framework, by aggregating like this:
db.events.aggregate(
{ $unwind: '$participants' },
{ $match: {'age': {$gte: 18}}},
{ $project: {participants: 1}
)
This will return a list of n documents where n is the number of participants > 18 where each entry looks like this (note that the "participants" array field now holds a single entry instead):
{
_id: objectIdOfTheEvent,
participants: { firstName: 'only one', lastName: 'participant'}
}
It could probably even be flattened on the server to return a list of participants. See the officcial documentation for more information.

Full Join/Intersection in couchdb

I have some documents which have 2 sets of attributes: tag and lieu. Here is an example of what they look like:
{
title: "doc1",
tag: ["mountain", "sunny", "forest"],
lieu: ["france", "luxembourg"]
},
{
title: "doc2",
tag: ["sunny", "lake"],
lieu: ["france", "germany"]
},
{
title: "doc3",
tag: ["sunny"],
lieu: ["belgium", "luxembourg", "france"]
}
How can I map/reduce and query my DB to be able to retrieve only the intersection of documents that match these criteria:
lieu: ["france", "luxembourg"]
tag: ["sunny"]
Returns: doc1 and doc3
I cannot figure out any format map/reduce could return to be able to have only one query. What I am doing now is: emit every lieu/tag as key and the documents' id related as value, then reduce for every keys have an array of docs' ids. Then from my app I query this view, on the app side do an intersection of the documents (only take the docs that have the 3 keys (luxembourg, france and sunny) and then requery couchdb with these docs' ids to retrieve the actual docs. I feel that's not the right/best way to do it?
I am using lists to do the intersection job, it works quite well. But I still need to do an other request to get the documents using the documents ids. Any idea what could I do differently to retrieve the documents directly?
Thank you!
This is going to be awkward. The basic idea is that you have to build a view where the map function emits every possible combination of tags and countries as the key, and there's no reduce function. This way, looking for ["france","luxembourg"] would return all documents that emitted that key (and therefore are in the intersection), because views without a reduce function return the emitting document for every entry. This way, you only have to do one request.
This causes a lot of emits to happen, but you can lower that number by sorting the tags both when emitting and when searching (automatically turn ["luxembourg","france"] into ["france","luxembourg"]), and by taking advantage of the ability of CouchDB to query prefixes (this means that emitting ["belgium","france","luxembourg"] will let you match searches for ["belgium"] and ["belgium","france"]).
In your example above, for the countries, you would only emit:
// doc 1
emit(["luxembourg"],null);
emit(["france","luxembourg"],null);
// doc 2
emit(["germany"],null);
emit(["france","germany"],null);
// doc 3
emit(["luxembourg"],null);
emit(["belgium","luxembourg"],null);
emit(["france","luxembourg"],null);
emit(["belgium","france","luxembourg"],null);
Anyway, for complex queries like this one, consider looking into a CouchDB-Lucene combination.