I have an arbitrary tree structure.
Example data structure:
root
|--node1
| |--node2
| | |--leaf1
| |
| |--leaf2
|
|--node3
|--leaf3
Each node and leaf has 2 properties: id and name.
The important queries:
1.: A leaf id is given. The query should return the whole path from root to that leaf, with all node's id and name properties.
It's not important if the return value is an sorted array of nodes or if it's an object where the nodes are nested.
Example: If the id of leaf2 is given, the query should return: root(id, name), node1(id, name), leaf2(id, name).
2.: Given any node id: Get the whole (sub)tree. Here it would be nice to retrieve a single object where each node has a children array.
Thoughts, trials and errors:
1.: First I tried to simply model the tree as a single JSON document, but then the query would become impossible: There's no way to find out at which nesting level the leaf is. And if I knew the whole path of ids from root to the leaf, I'd had to use a projection with multiple positional operators and that's not supported by MongoDB at the moment. Additionally it's not possible to index the leaf ids because the nesting can be infinite.
2.: Next idea was to use a flat data design, where each node has an array which contains the node's ancestor ids:
{
id: ...,
name: ...,
ancestors: [ rootId, node1Id, ... ]
}
This way I'd have to do 2 queries, to get the whole path from root to some node or leaf, which is quite nice.
Questions:
If I choose data model 2.: How can I get the whole tree, or a subtree?
Getting all descendants is easy: find({ancestors:"myStartingNodeId"}). But those will of course be not sorted or nested.
Is there a way using the aggregation framework or a completely different data model to solve this problem?
Thank you!
MongoDB is not a graph database and doesn't provide graph traversal operations, so there is no direct solution.
You can use data model you have described in point 2. (nodes with ancestor list), the query find({ancestors:"myStartingNodeId"}) and sort/nest the results in your application code.
Another possibility is to use data model where _id (or some other field) represents full path, for example 'root.node1.node2'. Then graph queries can be transformed to substring queries and correct ordering can be achieved (I hope) just by sorting by this _id.
Update: btw. there are some tree structure patterns described in MongoDB docs: Model Tree Structures in MongoDB
Here's what data structure I finally came up with. It's optimized for read queries. Some write queries (like moving subtrees) can be painful.
{
id: "...",
ancestors: ["parent_node_id", ..., "root_node_id"], // order is important!
children: ["child1_id", "child2_id", ...]
}
Benefits:
Easy to get all documents for a sub-tree
Easy to get all documents from some node to the root
Easy to check if some document is parent/child/ancestor/descendant of some node
Children are sorted. Can be easily moved by changing the children array order
How to use it:
Get by ID: findOne({ id: "..." })
Get Parent: findOne({ children: "..." })
Get all Ancestors: first do Get by ID, then grab the ancestors array and look for all documents that match the given list of IDs
Get all Children: find({ 'ancestors.0': "..." })
Get all Descendants: find({ ancestors: "..." })
Get all Descendants up to x generations: find({ $and: [ {ancestors: "..."}, {ancestors: {$size: x}} ] })
Drawbacks:
The application code has to take care of correct order.
The application code has to build nested objects (maybe this is possible using MongoDB Aggregation framework).
Every insert must be done using 2 queries.
Moving whole sub-trees between nodes must update a lot of documents.
You can use graphLookup
Documentation:
https://docs.mongodb.com/manual/reference/operator/aggregation/graphLookup/
Related
The software I am currently working with can only run aggregate queries or simple find_one's. I am new to mongodb ,so I am having difficulty figuring out if I can do what I would like to do.
The Question:
Is it possible to run a lookup query on an object id when that object id may be in one of many collections?
The setup:
I have a main collection, this main collection is essentially an array of other ObjectID's that apply to this object. This collection (call it Main_Config) consists of three ObjectID's.
Client
General_Config
Role_Config
The Client, General_Config, and Main_Config can all have an enforced schema I would like the Role_Config to also have an enforced schema. This is where the issue comes into play, the Role_Config, may take 3 or more possible schemas. My idea was to create a collection for every possible schema, however if I do this I will not know to what collection the Role_Config ObjectID belongs to. Is there a way to lookup an ObjectID that may exist in one of many collections?
There is no findInAnyCollection() type of function. In your model you will have to manually code a loop and look it up.
One approach: In your main config collection, we have docs with this field:
otherIds = [ {coll: "ROLE", key: "5fb8057f08c09fb8dfe8d310"}, {coll: "GENERAL", key: "GENERAL_72f2b2922ed98800bd0e"}, ...]
Putting it all together:
db.AA.drop();
db.BB.drop();
db.CC.drop();
db.AA.insert({_id:0, otherIds: [ {coll:"BB", key:0}, {coll:"BB", key:1}, {coll:"CC", key:2}]});
db.BB.insert({_id:0, foo:"bar", baz:"bin"});
db.BB.insert({_id:1, foo:"ion", baz:"kjlkj"});
db.BB.insert({_id:2, foo:"POPPO", baz:"UHUH"});
db.CC.insert({_id:0, data: "wfwefw"});
db.CC.insert({_id:1, data: "jj"});
db.CC.insert({_id:2, data: "mm"});
doc = db.AA.findOne();
doc['otherIds'].forEach(function(item) {
var other = db[item['coll']].findOne({_id:item['key']});
printjson(other);
});
As I can get specific field values ​​by querying with MongoId a document whose structure is as follows:
{
_id: ObjectId(xxx),
name: xxx,
subs: [{
_id: ObjectId(yyy),
name: yyy,
subs: [{
_id: ObjectId(zzz),
name: zzz,
subs: [...]
}]
}]
}
I need that, given a specific ObjectId obtaining an element that have the fields: _id, name
Note: a single document with all information.
regards
It seems like, with your current data model, that you would need to retrieve the top level document and then recursively search for the object you want inside of the document you retrieved.
There are many ways to store hierarchical data and I would suggest you take a look at the 10gen docs here, they are pretty thorough but I will summarize for you below.
Let's take a look at some of the ideas presented in that document starting with the good parts and drawbacks to your current approach.
Full Tree in Single Document (What you are doing now)
Pros:
Single document to fetch per page
One location on disk for whole tree
You can see full structure easily
Cons
Hard to search
Hard to get back partial results
Can get unwieldy if you need a huge tree. Further there is a limit on the size of documents in MongoDB – 16MB in v1.8 (limit may rise in future versions).
Your question relates directly to the problem of searching so, depending on your use case and desire to optimize, you probably either want to take the Array of Ancestors or the Materialized Path approach instead.
Array of Ancestors
With this approach you would store the ancestors of a document in an array and create an index on that field so that it was quickly and easily searchable.
Materialized Path
With this approach you would store the full path to the document as a string of ancestors and query via regular expression.
Make sense?
If you are building a tree structure in mongoid. I would suggest looking into
Mongoid Tree
it's nice gem that would give you all or a least most of things you need.
From Read Me:
class Node
include Mongoid::Document
include Mongoid::Tree
end
Node.root
Node.roots
Node.leaves
node.root
node.parent
node.children
node.ancestors
node.ancestors_and_self
node.descendants
node.descendants_and_self
node.siblings
node.siblings_and_self
node.leaves
My question may be not very good formulated because I haven't worked with MongoDB yet, so I'd want to know one thing.
I have an object (record/document/anything else) in my database - in global scope.
And have a really huge array of other objects in this object.
So, what about speed of search in global scope vs search "inside" object? Is it possible to index all "inner" records?
Thanks beforehand.
So, like this
users: {
..
user_maria:
{
age: "18",
best_comments :
{
goodnight:"23rr",
sleeptired:"dsf3"
..
}
}
user_ben:
{
age: "18",
best_comments :
{
one:"23rr",
two:"dsf3"
..
}
}
So, how can I make it fast to find user_maria->best_comments->goodnight (index context of collections "best_comment") ?
First of all, your example schema is very questionable. If you want to embed comments (which is a big if), you'd want to store them in an array for appropriate indexing. Also, post your schema in JSON format so we don't have to parse the whole name/value thing :
db.users {
name:"maria",
age: 18,
best_comments: [
{
title: "goodnight",
comment: "23rr"
},
{
title: "sleeptired",
comment: "dsf3"
}
]
}
With that schema in mind you can put an index on name and best_comments.title for example like so :
db.users.ensureIndex({name:1, 'best_comments.title:1})
Then, when you want the query you mentioned, simply do
db.users.find({name:"maria", 'best_comments.title':"first"})
And the database will hit the index and will return this document very fast.
Now, all that said. Your schema is very questionable. You mention you want to query specific comments but that requires either comments being in a seperate collection or you filtering the comments array app-side. Additionally having huge, ever growing embedded arrays in documents can become a problem. Documents have a 16mb limit and if document increase in size all the time mongo will have to continuously move them on disk.
My advice :
Put comments in a seperate collection
Either do document per comment or make comment bucket documents (say,
100 comments per document)
Read up on Mongo/NoSQL schema design. You always query for root documents so if you end up needing a small part of a large embedded structure you need to reexamine your schema or you'll be pumping huge documents over the connection and require app-side filtering.
I'm not sure I understand your question but it sounds like you have one record with many attributes.
record = {'attr1':1, 'attr2':2, etc.}
You can create an index on any single attribute or any combination of attributes. Also, you can create any number of indices on a single collection (MongoDB collection == MySQL table), whether or not each record in the collection has the attributes being indexed on.
edit: I don't know what you mean by 'global scope' within MongoDB. To insert any data, you must define a database and collection to insert that data into.
Database 'Example':
Collection 'table1':
records: {a:1,b:1,c:1}
{a:1,b:2,d:1}
{a:1,c:1,d:1}
indices:
ensureIndex({a:ascending, d:ascending}) <- this will index on a, then by d; the fact that record 1 doesn't have an attribute 'd' doesn't matter, and this will increase query performance
edit 2:
Well first of all, in your table here, you are assigning multiple values to the attribute "name" and "value". MongoDB will ignore/overwrite the original instantiations of them, so only the final ones will be included in the collection.
I think you need to reconsider your schema here. You're trying to use it as a series of key value pairs, and it is not specifically suited for this (if you really want key value pairs, check out Redis).
Check out: http://www.jonathanhui.com/mongodb-query
Here's the deal. Let's suppose we have the following data schema in MongoDB:
items: a collection with large documents that hold some data (it's absolutely irrelevant what it actually is).
item_groups: a collection with documents that contain a list of items._id called item_groups.items plus some extra data.
So, these two are tied together with a Many-to-Many relationship. But there's one tricky thing: for a certain reason I cannot store items within item groups, so -- just as the title says -- embedding is not the answer.
The query I'm really worried about is intended to find some particular groups that contain some particular items (i.e. I've got a set of criteria for each collection). In fact it also has to say how much items within each found group fitted the criteria (no items means group is not found).
The only viable solution I came up with this far is to use a Map/Reduce approach with a dummy reduce function:
function map () {
// imagine that item_criteria came from the scope.
// it's a mongodb query object.
item_criteria._id = {$in: this.items};
var group_size = db.items.count(item_criteria);
// this group holds no relevant items, skip it
if (group_size == 0) return;
var key = this._id.str;
var value = {size: group_size, ...};
emit(key, value);
}
function reduce (key, values) {
// since the map function emits each group just once,
// values will always be a list with length=1
return values[0];
}
db.runCommand({
mapreduce: item_groups,
map: map,
reduce: reduce,
query: item_groups_criteria,
scope: {item_criteria: item_criteria},
});
The problem line is:
item_criteria._id = {$in: this.items};
What if this.items.length == 5000 or even more? My RDBMS background cries out loud:
SELECT ... FROM ... WHERE whatever_id IN (over 9000 comma-separated IDs)
is definitely not a good way to go.
Thank you sooo much for your time, guys!
I hope the best answer will be something like "you're stupid, stop thinking in RDBMS style, use $its_a_kind_of_magicSphere from the latest release of MongoDB" :)
I think you are struggling with the separation of domain/object modeling from database schema modeling. I too struggled with this when trying out MongoDb.
For the sake of semantics and clarity, I'm going to substitute Groups with the word Categories
Essentially your theoretical model is a "many to many" relationship in that each Item can belong Categories, and each Category can then possess many Items.
This is best handled in your domain object modeling, not in DB schema, especially when implementing a document database (NoSQL). In your MongoDb schema you "fake" a "many to many" relationship, by using a combination of top-level document models, and embedding.
Embedding is hard to swallow for folks coming from SQL persistence back-ends, but it is an essential part of the answer. The trick is deciding whether or not it is shallow or deep, one-way or two-way, etc.
Top Level Document Models
Because your Category documents contain some data of their own and are heavily referenced by a vast number of Items, I agree with you that fully embedding them inside each Item is unwise.
Instead, treat both Item and Category objects as top-level documents. Ensure that your MongoDb schema allots a table for each one so that each document has its own ObjectId.
The next step is to decide where and how much to embed... there is no right answer as it all depends on how you use it and what your scaling ambitions are...
Embedding Decisions
1. Items
At minimum, your Item objects should have a collection property for its categories. At the very least this collection should contain the ObjectId for each Category.
My suggestion would be to add to this collection, the data you use when interacting with the Item most often...
For example, if I want to list a bunch of items on my web page in a grid, and show the names of the categories they are part of. It is obvious that I don't need to know everything about the Category, but if I only have the ObjectId embedded, a second query would be necessary to get any detail about it at all.
Instead what would make most sense is to embed the Category's Name property in the collection along with the ObjectId, so that pulling back an Item can now display its category names without another query.
The biggest thing to remember is that the key/value objects embedded in your Item that "represent" a Category do not have to match the real Category document model... It is not OOP or relational database modeling.
2. Categories
In reverse you might choose to leave embedding one-way, and not have any Item info in your Category documents... or you might choose to add a collection for Item data much like above (ObjectId, or ObjectId + Name)...
In this direction, I would personally lean toward having nothing embedded... more than likely if I want Item information for my category, i want lots of it, more than just a name... and deep-embedding a top-level document (Item) makes no sense. I would simply resign myself to querying the database for an Items collection where each one possesed the ObjectId of my Category in its collection of Categories.
Phew... confusing for sure. The point is, you will have some data duplication and you will have to tweak your models to your usage for best performance. The good news is that that is what MongoDb and other document databases are good at...
Why don't use the opposite design ?
You are storing items and item_groups. If your first idea to store items in item_group entries then maybe the opposite is not a bad idea :-)
Let me explain:
in each item you store the groups it belongs to. (You are in NOSql, data duplication is ok!)
for example, let's say you store in item entries a list called groups and your items look like :
{ _id : ....
, name : ....
, groups : [ ObjectId(...), ObjectId(...),ObjectId(...)]
}
Then the idea of map reduce takes a lot of power :
map = function() {
this.groups.forEach( function(groupKey) {
emit(groupKey, new Array(this))
}
}
reduce = function(key,values) {
return Array.concat(values);
}
db.runCommand({
mapreduce : items,
map : map,
reduce : reduce,
query : {_id : {$in : [...,....,.....] }}//put here you item ids
})
You can add some parameters (finalize for instance to modify the output of the map reduce) but this might help you.
Of course you need to have another collection where you store the details of item_groups if you need to have it but in some case (if this informations about item_groups doe not exist, or don't change, or you don't care that you don't have the most updated version of it) you don't need them at all !
Does that give you a hint about a solution to your problem ?
I have some documents which have 2 sets of attributes: tag and lieu. Here is an example of what they look like:
{
title: "doc1",
tag: ["mountain", "sunny", "forest"],
lieu: ["france", "luxembourg"]
},
{
title: "doc2",
tag: ["sunny", "lake"],
lieu: ["france", "germany"]
},
{
title: "doc3",
tag: ["sunny"],
lieu: ["belgium", "luxembourg", "france"]
}
How can I map/reduce and query my DB to be able to retrieve only the intersection of documents that match these criteria:
lieu: ["france", "luxembourg"]
tag: ["sunny"]
Returns: doc1 and doc3
I cannot figure out any format map/reduce could return to be able to have only one query. What I am doing now is: emit every lieu/tag as key and the documents' id related as value, then reduce for every keys have an array of docs' ids. Then from my app I query this view, on the app side do an intersection of the documents (only take the docs that have the 3 keys (luxembourg, france and sunny) and then requery couchdb with these docs' ids to retrieve the actual docs. I feel that's not the right/best way to do it?
I am using lists to do the intersection job, it works quite well. But I still need to do an other request to get the documents using the documents ids. Any idea what could I do differently to retrieve the documents directly?
Thank you!
This is going to be awkward. The basic idea is that you have to build a view where the map function emits every possible combination of tags and countries as the key, and there's no reduce function. This way, looking for ["france","luxembourg"] would return all documents that emitted that key (and therefore are in the intersection), because views without a reduce function return the emitting document for every entry. This way, you only have to do one request.
This causes a lot of emits to happen, but you can lower that number by sorting the tags both when emitting and when searching (automatically turn ["luxembourg","france"] into ["france","luxembourg"]), and by taking advantage of the ability of CouchDB to query prefixes (this means that emitting ["belgium","france","luxembourg"] will let you match searches for ["belgium"] and ["belgium","france"]).
In your example above, for the countries, you would only emit:
// doc 1
emit(["luxembourg"],null);
emit(["france","luxembourg"],null);
// doc 2
emit(["germany"],null);
emit(["france","germany"],null);
// doc 3
emit(["luxembourg"],null);
emit(["belgium","luxembourg"],null);
emit(["france","luxembourg"],null);
emit(["belgium","france","luxembourg"],null);
Anyway, for complex queries like this one, consider looking into a CouchDB-Lucene combination.