Index strategy for queries with dynamic match criteria - mongodb

I have a collection which is going to hold machine data as well as mobile data, the data is captured on channel and is maintained at single level no embedding of object , the structure is like as follows
{
"Id": ObjectId("544e4b0ae4b039d388a2ae3a"),
"DeviceTypeId":"DeviceType1",
"DeviceTypeParentId":"Parent1",
"DeviceId":"D1",
"ChannelName": "Login",
"Timestamp": ISODate("2013-07-23T19:44:09Z"),
"Country": "India",
"Region": "Maharashtra",
"City": "Nasik",
"Latitude": 13.22,
"Longitude": 56.32,
//and more 10 - 15 fields
}
Most of the queries are aggregation queries, as used for Analytics dashboard and real-time analysis , the $match pipeline is as follows
{$match:{"DeviceTypeId":{"$in":["DeviceType1"]},"Timestamp":{"$gte":ISODate("2013-07-23T00:00:00Z"),"$lt":ISODate("2013-08-23T00:00:00Z")}}}
or
{$match:{"DeviceTypeParentId":{"$in":["Parent1"]},"Timestamp":{"$gte":ISODate("2013-07-23T00:00:00Z"),"$lt":ISODate("2013-08-23T00:00:00Z")}}}
and many of my DAL layer find queries and findOne queries are mostly on criteria DeviceType or DeviceTypeParentId.
The collection is huge and its growing, I have used compound index to support this queries, indexes are as follows
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "DB.channel_data"
},
{
"v" : 1,
"key" : {
"DeviceType" : 1,
"Timestamp" : 1
},
"name" : "DeviceType_1_Timestamp_1",
"ns" : "DB.channel_data"
},
{
"v" : 1,
"key" : {
"DeviceTypeParentId" : 1,
"Timestamp" : 1
},
"name" : "DeviceTypeParentId_1_Timestamp_1",
"ns" : "DB.channel_data"
}
]
Now we are going to add support for match criteria on DeviceId and if I follow same strategy as I did for DeviceType and DeviceTypeParentId is not good,as I fell by my current approach I'm creating many indexes and all most all will be same and huge.
So is their any good way to do indexing . I have read a bit about Index Intersection but not sure how will it be helpful.
If any wrong approach is followed by me please point it out as this is my first project and first time I am using MongoDB.

Those indexes all look appropriate for your queries, including the new one you're proposing. Three separate indexes supporting your three kinds of queries are the overall best option in terms of fast queries. You could put indexes on each field and let the planner use index intersection, but it won't be as good as the compound indexes. The indexes are not the same since they support different queries.
I think the real question is, are the (apparently) large memory footprint of the indices actually a problem at this point? Do you have a lot of page faults because of paging indexes and data out of disk?

Related

dynamic size of subdocument mongodb

I'm using mongodb and mongoose for my web application. The web app is used for registration for swimming competitions and each competition can have X number of races. My data structure as of now:
{
"_id": "1",
"name": "Utmanaren",
"location": "town",
"startdate": "20150627",
"enddate": "20150627"
"race" : {
"gender" : "m"
"style" : "freestyle"
"length" : "100"
}
}
Doing this i need to determine and define the number of races for every competition. A solution i tried is having a separate document and having a Id for which competition a races belongs to, like below.
{
"belongsTOId" : "1"
"gender" : "m"
"style" : "freestyle"
"length" : "100"
}
{
"belongsTOId" : "1"
"gender" : "f"
"style" : "butterfly"
"length" : "50"
}
Is there a way of creating and defining dynamic number of races as a subdocument while using Mongodb?
Thanks!
You have basically two approaches of modelling your data structure; you can either design a schema where you can reference or embed the races document.
Let's consider the following example that maps swimming competition and multiple races relationships. This demonstrates the advantage of embedding over referencing if you need to view many data entities in context of another. In this one-to-many relationship between competition and race data, the competition has multiple races entities:
// db.competition schema
{
"_id": 1,
"name": "Utmanaren",
"location": "town",
"startdate": "20150627",
"enddate": "20150627"
"races": [
{
"gender" : "m"
"style" : "freestyle"
"length" : "100"
},
{
"gender" : "f"
"style" : "butterfly"
"length" : "50"
}
]
}
With the embedded data model, your application can retrieve the complete swimming competition information with just one query. This design has other merits as well, one of them being data locality. Since MongoDB stores data contiguously on disk, putting all the data you need in one document ensures that the spinning disks will take less time to seek to a particular location on the disk. The other advantage with embedded documents is the atomicity and isolation in writing data. To illustrate this, say you want to remove a competition which has a race "style" property with value "butterfly", this can be done with one single (atomic) operation:
db.competition.remove({"races.style": "butterfly"});
For more details on data modelling in MongoDB, please read the docs Data Modeling Introduction, specifically Model One-to-Many Relationships with Embedded Documents
The other design option is referencing documents follow a normalized schema where the race documents contain a reference to the competition document:
// db.race schema
{
"_id": 1,
"competition_id": 1,
"gender": "m",
"style": "freestyle",
"length": "100"
},
{
"_id": 2,
"competition_id": 1,
"gender": "f",
"style": "butterfly",
"length": "50"
}
The above approach gives increased flexibility in performing queries. For instance, to retrieve all child race documents where the main parent entity competition has id 1 will be straightforward, simply create a query against the collection race:
db.race.find({"competiton_id": 1});
The above normalized schema using document reference approach also has an advantage when you have one-to-many relationships with very unpredictable arity. If you have hundreds or thousands of race documents per given competition, the embedding option has so many setbacks in as far as spacial constraints are concerned because the larger the document, the more RAM it uses and MongoDB documents have a hard size limit of 16MB.
If your application frequently retrieves the race data with the competition information, then your application needs to issue multiple queries to resolve the references.
The general rule of thumb is that if your application's query pattern is well-known and data tends to be accessed only in one way, an embedded approach works well. If your application queries data in many ways or you unable to anticipate the data query patterns, a more normalized document referencing model will be appropriate for such case.
Ref:
MongoDB Applied Design Patterns: Practical Use Cases with the Leading NoSQL Database By Rick Copeland
You basically want to update the data, so you should upsert the data which is basically an update on the subdocument key.
Keep an array of keys in the main document.
Insert the sub-document and add the key to the list or update the list.
To push single item into the field ;
db.yourcollection.update( { $push: { "races": { "belongsTOId" : "1" , "gender" : "f" , "style" : "butterfly" , "length" : "50"} } } );
To push multiple items into the field it allows duplicate in the field;
db.yourcollection.update( { $push: { "races": { $each: [ { "belongsTOId" : "1" , "gender" : "f" , "style" : "butterfly" , "length" : "50"}, { "belongsTOId" : "2" , "gender" : "m" , "style" : "horse" , "length" : "70"} ] } } } );
To push multiple items without duplicated items;
db.yourcollection.update( { $addToSet: { "races": { $each: [ { "belongsTOId" : "1" , "gender" : "f" , "style" : "butterfly" , "length" : "50"}, { "belongsTOId" : "2" , "gender" : "m" , "style" : "horse" , "length" : "70"} ] } } } );
$pushAll deprecated since version 2.4, so we use $each in $push instead of $pushAll.
While using $push you will be able to sort and slice items. You might check the mongodb manual.

Conflict in choosing the perfect index in MongoDB query optimizer

My problem is related to the query optimizer of MongoDB and how it picks the perfect index to use. I realized that under some conditions the optimizer doesn't pick the perfect existing index and rather continues using the one that is close enough.
Consider having a simple dataset like:
{ "_id" : 1, "item" : "f1", "type" : "food", "quantity" : 500 }
{ "_id" : 2, "item" : "f2", "type" : "food", "quantity" : 100 }
{ "_id" : 3, "item" : "p1", "type" : "paper", "quantity" : 200 }
{ "_id" : 4, "item" : "p2", "type" : "paper", "quantity" : 150 }
{ "_id" : 5, "item" : "f3", "type" : "food", "quantity" : 300 }
{ "_id" : 6, "item" : "t1", "type" : "toys", "quantity" : 500 }
{ "_id" : 7, "item" : "a1", "type" : "apparel", "quantity" : 250 }
{ "_id" : 8, "item" : "a2", "type" : "apparel", "quantity" : 400 }
{ "_id" : 9, "item" : "t2", "type" : "toys", "quantity" : 50 }
{ "_id" : 10, "item" : "f4", "type" : "food", "quantity" : 75 }
and then want to issue a query as following:
db.inventory.find({"type": "food","quantity": {$gt: 50}})
I go ahead and create the following index:
db.inventory.ensureIndex({"quantity" : 1, "type" : 1})
The statistics of cursor.explain() confirms that this index has the following performance: ( "n" : 4, "nscannedObjects" : 4, "nscanned" : 9). It scanned more indexes than the perfect matching number. Considering the fact that "type" is a higher selective attribute with an identified match, it is surely better to create the following index instead:
db.inventory.ensureIndex({ "type" : 1, "quantity" : 1})
The statistics also confirms that this index performs better: ("n" : 4, "nscannedObjects" : 4, "nscanned" : 4). Meaning the second index needs exactly scanning the same number of indexes as the matched documents.
However, I observed if I don't delete the first index, the query optimizer continues using the first index, although the better index is got created.
According to the documentation, every time a new index is created the query optimizer consider it to make the query plan, but I don't see this happening here.
Can anyone explain how the query optimizer really works?
Considering the fact that "type" is a higher selective attribute
Index selectivity is a very important aspect, but in this case, note that you're using an equality query on type and a range query on quantity which is the more compelling reason to swap the order of indices, even if selectivity was lower.
However, I observed if I don't delete the first index, the query optimizer continues using the first index, although the better index is got created. [...]
The MongoDB query optimizer is largely statistical. Unlike most SQL engines, MongoDB doesn't attempt to reason what could be a more or less efficient index. Instead, it simply runs different queries in parallel from time to time and remembers which one was faster. The faster strategy will then be used. From time to time, MongoDB will perform parallel queries again and re-evaluate the strategy.
One problem of this approach (and maybe the cause of the confusion) is that there's probably not a big difference with such a tiny dataset - it's often better to simply scan elements than to use any kind of index or search strategy if the data isn't large compared to the prefetch / page size / cache size and pipeline length. As a rule of thumb, simple lists of up to maybe 100 or even 1,000 elements often don't benefit from indexing at all.
Like for doing anything greater, designing indexes requires some forward thinking. The goal is:
Efficiency - fast read / write operations
Selectivity - minimize records scanning
Other requirements - e.g. how are sorts handled?
Selectivity is the primary factor that determines how efficiently an index can be used. Ideally, the index enables us to select only those records required to complete the result set, without the need to scan a substantially larger number of index keys (or documents) in order to complete the query. Selectivity determines how many records any subsequent operations must work with. Fewer records means less execution time.
Think about what queries will be used most frequently by the application. Use explain command and specifically see the executionStats:
nReturned
totalKeysExamined - if the number of keys examined very large than the returned documents? We need some index to reduce it.
Look at queryPlanner, rejectedPlans. Look at winningPlan which shows the keyPattern which shows which keys needed to indexed. Whenever we see stage:SORT, it means that the key to sort is not part of the index or the database was not able to sort documents based on the sort order specified in the database. And needed to perform in-memory sort. If we add the key based on which the sort happens, we will see that the winningPlan's' stage changes from SORT to FETCH. The keys in the index needs to be specified based on the range of the data for them. e.g.: the class will have lesser volume than student. Doing this needs us to have a trade-off. Although the executionTimeMillis will be very less but the docsExamined and keysExamined will be relatively a little large. But this trade-off is worth making.
There is also a way to force queries to use a particular index but this is not recommended to be a part of deployment. The command in concern is the .hint() which can be chained after find or sort for sorting etc. It requires the actual index name or the shape of the index.
In general, when building compound indexes for:
- equality field: field on which queries will perform an equality test
- sort field: field on which queries will specify a sort
- range field: field on which queries perform a range test
The following rules of thumb should we keep in mind:
Equality fields before range fields
Sort fields before range fields
Equality fields before sort fields

Why does MongoDB show slower performance than MySQL?

My system config : OSx RAM:-8GB ,2.5 Gz i5
Both database table has 1 mill rows and same data . I am executing same aggregate query at both database.
db.temp.aggregate([
{ "$match": { ITEMTYPE: 'like' } },
{ "$group" : {_id :{ cust_id2: "$ActorID", cust_id: "$ITEMTYPE"}, numberofActorID : {"$sum" : 1}}},
{ "$sort": { numberofActorID: -1 } },
{ "$limit" : 5 }
]);
I had created covering index
db.temp.ensureIndex( { "ITEMTYPE": 1, "ActorID": 1 } );
and selectivity of "like" is 80%
Time Results are
sqlWithout sqlWithIndex mongoWithout mongoWithIndex
958 644 3043 4243
I didn't upgrade system parameter(not even sharding) of MongoDB
Please suggest me why mongoDB is slow and how i can improve this problem.
{
"stages" : [
{
"$cursor" : {
"query" : {
"ITEMTYPE" : "like"
},
"fields" : {
"ActorID" : 1,
"ITEMTYPE" : 1,
"_id" : 0
},
"plan" : {
"cursor" : "BtreeCursor ",
"isMultiKey" : false,
"scanAndOrder" : false,
"indexBounds" : {
"ITEMTYPE" : [
[
"like",
"like"
]
],
"ActorID" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"allPlans" : [
{
"cursor" : "BtreeCursor ",
"isMultiKey" : false,
"scanAndOrder" : false,
"indexBounds" : {
"ITEMTYPE" : [
[
"like",
"like"
]
],
"ActorID" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
}
]
}
}
},
{
"$group" : {
"_id" : {
"cust_id2" : "$ActorID",
"cust_id" : "$ITEMTYPE"
},
"numberofActorID" : {
"$sum" : {
"$const" : 1
}
}
}
},
{
"$sort" : {
"sortKey" : {
"numberofActorID" : -1
},
"limit" : NumberLong(5)
}
}
],
"ok" : 1
}
Structure of JSON
{ "_id" : ObjectId("5492ba51ff16cd9391a2c02d"), "POSTDBID" : 231041, "ITEMID" : 231041, "ITEMTYPE" : "post", "ITEMCREATIONDATE" : ISODate("2009-02-28T20:37:02Z"), "POSVal" : 0.327282, "NEGVal" : 0.315738, "NEUVal" : 0.356981, "LabelSentiment" : "Neutral", "ActorID" : NumberLong(1179444542), "QuarterLabel" : "2009-1\r", "rowid" : 2 }
Note: Some of the things I mention are simplified for the sake of this answer. However, to the best of my knowledge, they can be applied as described.
Misconceptions
First of all: aggregations can't utilize covered queries:
Even when the pipeline uses an index, aggregation still requires access to the actual documents; i.e. indexes cannot fully cover an aggregation pipeline.
(see the Aggregation documentation for details.)
Second: Aggregations are not meant to be used as real time queries
The aggregation pipeline provides an alternative to map-reduce and may be the preferred solution for aggregation tasks where the complexity of map-reduce may be unwarranted.
You would not want to use map/reduce for real time processing, would you? ;) While sometimes aggregations can be so fast that they can be used as real time queries, it is not the intended purpose. Aggregations are meant for precalculation of statistics, if you will.
Improvements on the aggregation
You might want to use a $project phase right after the match to reduce the data passed into the group phase to that what is processed there:
{ $project: { 'ActorID':1, 'ITEMTYPE':1 } }
This might improve the processing.
Hardware impact
As for your description, I assume you use some sort of MacBook. OSX and the program's you have running require quite some RAM. MongoDB, on the other hand, tries to keep as much of it's indices and the so called working set (most recently accessed documents, to keep it simple) in RAM. It is designed that way. It is supposed to run on one or more dedicated instances. You might want to use MMS to check wether you have a high number of page faults – which I'd expect. MySQL is much more conservative and less dependent on free RAM, although it will be outperformed by MongoDB when a certain amount of ressources is available (conceptually, because the two DBMS are very hard to compare reasonably), simply because it is not optimized for dealing with situations when a lot of RAM is available. We don't even touch resource competition between various processes here, which is a known performance killer for MongoDB, too.
Second, in case you have a spinning disk: MongoDB has – for various reasons – sub par read performance on spinning disks, the main problem being seek latency. Usually, the disks in MacBooks do 5400rpm, which further increases seek latency, worsening the problem, and making it a real pain in the neck for aggregations, which – as shown - access a lot of documents. The way the MongoDB storage engine works, two documents which follow each other in an index might well be saved at two entirely different locations, even in different data files. ( This is because MongoDB is heavily write optimized, so documents are written at the first position providing enough space for the document and it's padding. ) So depending on the number of documents in your collection, you can have a lot of disk seeks.
MySQL, on the other hand, is rather read optimized.
Data modelling
You did not show us your data model, but sometimes small changes in the model have a huge impact on performance. I'd suggest doing a peer review of the data model.
Conclusion
You are comparing two DBMS, which are designed and optimized for diametrical use cases on an environment which is pretty much the opposite of that for what one of these systems was specifically designed in a use case for which it wasn't optimized, expecting real time results from a tool which isn't made for that. That's might be the reasons why MongoDB is outperformed by MySQL. Side note: you didn't show us the corresponding (My)SQL query.

MongoDb performance slow even using index

We are trying to build a notification application for our users with mongo. we created 1 mongodb on 10GB RAM, 150GB SAS HDD 15K RPM, 4 Core 2.9GHZ xeon intel XEN VM.
DB schema :-
{
"_id" : ObjectId("5178c458e4b0e2f3cee77d47"),
"userId" : NumberLong(1574631),
"type" : 2,
"text" : "a user connected to B",
"status" : 0,
"createdDate" : ISODate("2013-04-25T05:51:19.995Z"),
"modifiedDate" : ISODate("2013-04-25T05:51:19.995Z"),
"metadata" : "{\"INVITEE_NAME\":\"2344\",\"INVITEE\":1232143,\"INVITE_SENDER\":1574476,\"INVITE_SENDER_NAME\":\"123213\"}",
"opType" : 1,
"actorId" : NumberLong(1574630),
"actorName" : "2344"
}
DB stats :-
db.stats()
{
"db" : "UserNotificationDev2",
"collections" : 3,
"objects" : 78597973,
"avgObjSize" : 489.00035699393925,
"dataSize" : 38434436856,
"storageSize" : 41501835008,
"numExtents" : 42,
"indexes" : 2,
"indexSize" : 4272393328,
"fileSize" : 49301946368,
"nsSizeMB" : 16,
"dataFileVersion" : {
"major" : 4,
"minor" : 5
},
"ok" : 1
}
index :- userid and _id
we are trying to select latest 21 notifications for one user.
db.userNotification.find({ "userId" : 53 }).limit(21).sort({ "_id" : -1 });
but this query is taking too much time.
Fri Apr 26 05:39:55.563 [conn156] query UserNotificationDev2.userNotification query: { query: { userId: 53 }, orderby: { _id: -1 } } cursorid:225321382318166794 ntoreturn:21 ntoskip:0 nscanned:266025 keyUpdates:0 numYields: 2 locks(micros) r:4224498 nreturned:21 reslen:10295 2581ms
even count is taking hell lot of time.
Fri Apr 26 05:47:46.005 [conn159] command UserNotificationDev2.$cmd command: { count: "userNotification", query: { userId: 53 } } ntoreturn:1 keyUpdates:0 numYields: 11 locks(micros) r:9753890 reslen:48 5022ms
Are we doing some wrong in the query?
Please help!!!
Also suggest if our schema is not correct to storing user notifications. we have tried a embedded notifications like user and then notification for that user under that document but document limit is limiting us to store only ~50k notifications. so we changed to this.
You are querying by userId but not indexing it anywhere. My suggestion is to create an index on { "userId" : 1, "_id" : -1 }. This will create an index tree that starts with userId, then _id, which is almost exactly what your query is doing. This is the simplest/most flexible way to speeding up your query.
Another, more memory efficient, approach is to store your userId and timestamp as a string in _id, like _id : "USER_ID:DATETIME. Ex :
{_id : "12345:20120501123000"}
{_id : "15897:20120501124000"}
{_id : "15897:20120501125000"}
Notice _id is a string, not MongoId. Then your query above becomes a regex :
db.userNotification.find({ "_id" : /^53:/ }).limit(21).sort({ "_id" : -1 });
As expected, this will return all notifications for userId 53 in descending order. The memory efficient part is two fold:
You only need one index field. (Indexes compete with data for memory and are often several gigs in size)
If your queries are often about fetching newer data Right Balanced indexes keep your most often working in memory when the indexes are too large to fit whole.
Re: count. Count does take time because it scans through the entire collection.
Re: your schema. I'm guessing for your data set this is the best way to utilize your memory. When objects get large and your queries scan across multiple objects they will need to be loaded into memory in their entirety (I've had the OOM killer kill my mongod instance when i sorted with 2000 2MB objects on a 2GB RAM machine). With large objects your RAM usage will fluctuate greatly (not to mention they are limited upto a point). With your current schema mongo will have a much easier time loading only the data you're querying, resulting in less swapping and more consistent memory usage patterns.
One option is to try sharding then you can distribute notifications evenly between shards so when you need to select you will scan smaller subset of data. Need to decide however what your sharding will be using. To me it looks like operationType or userName but I do not know your data well enough. Another thing is why do you sort by _id?
I have just tried to replicate your problem. Created 140.000.000 inserts in userNotifications.
Without index on userId I got responses of 3-4seconds. After I created index on userId time dropped to almost instant responses.
db.userNotifications.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "test.userNotifications",
"name" : "id"
},
{
"v" : 1,
"key" : {
"userId" : 1
},
"ns" : "test.userNotifications",
"name" : "userId_1"
}
]
Another thing is: When your select happens is system writing constantly to mongo userNotification collection? Mongo locks whole collection if that happens. If it is the case
I would split read and writes between master and slave (see replication) and also do some sharding. Btw. What language you use for your app?
The most important thing is that you currently don't seem to have an index to support the query for user's latest notifications.
You need a compound index on userId, _id. This will support queries which only query by userId, but they also are used by queries by userId which sort/limit by _id.
When you add {userId:1, _id:-1} index don't forget to drop index on just userId as it will become redundant.
As far as count() make sure you are using 2.4.3 (the latest version) there were significant improvements in how count() uses indexes which resulted in much better performance.

Suitability of MongoDB for hierarchial type queries

I have a particular data manipulation requirement that I have worked out how to do in SQL Server and PostgreSQL. However, I'm not too happy with the speed, so I am investigating MongoDB.
The best way to describe the query is as follows. Picture the hierarchical data of the USA: Country, State, County, City. Let's say a particular vendor can service the whole of California. Another can perhaps service only Los Angeles. There are potentially hundreds of thousands of vendors and they all can service from some point(s) in this hierarchy down. I am not confusing this with Geo - I am using this to illustrate the need.
Using recursive queries, it is quite simple to get a list of all vendors who could service a particular user. If he were in say Pasadena, Los Angeles, California, we would walk up the hierarchy to get the applicable IDs, then query back down to find the vendors.
I know this can be optimized. Again, this is just a simple query example.
I know MongoDB is a document store. That suits other needs I have very well. The question is how well suited is it to the query type I describe? (I know it doesn't have joins - those are simulated).
I get that this is a "how long is a piece of string" question. I just want to know if anyone has any experience with MongoDB doing this sort of thing. It could take me quite some time to go from 0 to tested, and I'm looking to save time if MongoDB is not suited to this.
EXAMPLE
A local movie store "A" can supply Blu-Rays in Springfield. A chain store "B" with state-wide distribution can supply Blu-Rays to all of IL. And a download-on-demand store "C" can supply to all of the US.
If we wanted to get all applicable movie suppliers for Springfield, IL, the answer would be [A, B, C].
In other words, there are numerous vendors attached at differing levels on the hierarchy.
I realize this question was asked nearly a year ago, but since then MongoDB has an officially supported solution for this problem, and I just used their solution. Refer to their documentation here: https://docs.mongodb.com/manual/tutorial/model-tree-structures-with-materialized-paths/
The concept relating closest to your question is named "partial path."
While it may feel a bit heavy to embed ancestor data; this approach is the most suitable way to solve your problem in MongoDB. The only pitfall to this, that I've experienced so far, is that if you're storing all of this in a single document you can hit the, as of this time, 16MB document size limit when working with enough data (although, I can only see this happening if you're using this structure to track user referrals [which could reach millions] rather than US cities [which is upwards of 26,000 according to the latest US Census]).
References:
http://www.mongodb.org/display/DOCS/Schema+Design
http://www.census.gov/geo/www/gazetteer/places2k.html
Modifications:
Replaced link: http://www.mongodb.org/display/DOCS/Trees+in+MongoDB
Note that this question was also asked on the google group. See http://groups.google.com/group/mongodb-user/browse_thread/thread/5cd5edd549813148 for that disucssion.
One option is to use an array key. You can store the hierarchy as an
array of values (for example ['US','CA','Los Angeles']). Then you can
query against records based on individual elements in that array key
For example:
First, store some documents with the array value representing the
hierarchy
> db.hierarchical.save({ location: ['US','CA','LA'], name: 'foo'} )
> db.hierarchical.save({ location: ['US','CA','SF'], name: 'bar'} )
> db.hierarchical.save({ location: ['US','MA','BOS'], name: 'baz'} )
Make sure we have an index on the location field so we can perform
fast queries against its values
> db.hierarchical.ensureIndex({'location':1})
Find all records in California
> db.hierarchical.find({location: 'CA'})
{ "_id" : ObjectId("4d9f69cbf88aea89d1492c55"), "location" : [ "US", "CA", "LA" ], "name" : "foo" }
{ "_id" : ObjectId("4d9f69dcf88aea89d1492c56"), "location" : [ "US", "CA", "SF" ], "name" : "bar" }
Find all records in Massachusetts
> db.hierarchical.find({location: 'MA'})
{ "_id" : ObjectId("4d9f6a21f88aea89d1492c5a"), "location" : [ "US", "MA", "BOS" ], "name" : "baz" }
Find all records in the US
> db.hierarchical.find({location: 'US'})
{ "_id" : ObjectId("4d9f69cbf88aea89d1492c55"), "location" : [ "US", "CA", "LA" ], "name" : "foo" }
{ "_id" : ObjectId("4d9f69dcf88aea89d1492c56"), "location" : [ "US", "CA", "SF" ], "name" : "bar" }
{ "_id" : ObjectId("4d9f6a21f88aea89d1492c5a"), "location" : [ "US", "MA", "BOS" ], "name" : "baz" }
Note that in this model, your values in the array would need to be
unique. So for example, if you had 'springfield' in different states,
then you would need to do some extra work to differentiate.
> db.hierarchical.save({location:['US','MA','Springfield'], name: 'one' })
> db.hierarchical.save({location:['US','IL','Springfield'], name: 'two' })
> db.hierarchical.find({location: 'Springfield'})
{ "_id" : ObjectId("4d9f6b7cf88aea89d1492c5b"), "location" : [ "US", "MA", "Springfield"], "name" : "one" }
{ "_id" : ObjectId("4d9f6b86f88aea89d1492c5c"), "location" : [ "US", "IL", "Springfield"], "name" : "two" }
You can overcome this by using the $all operator and specifying more
levels of the hierarchy. For example:
> db.hierarchical.find({location: { $all : ['US','MA','Springfield']} })
{ "_id" : ObjectId("4d9f6b7cf88aea89d1492c5b"), "location" : [ "US", "MA", "Springfield"], "name" : "one" }
> db.hierarchical.find({location: { $all : ['US','IL','Springfield']} })
{ "_id" : ObjectId("4d9f6b86f88aea89d1492c5c"), "location" : [ "US", "IL", "Springfield"], "name" : "two" }