Let's say I have a collection with documents that looks like this (just simplified example, but it should show the scheme):
> db.data.find()
{ "_id" : ObjectId("4e9c1f27aa3dd60ee98282cf"), "type" : "A", "value" : 11 }
{ "_id" : ObjectId("4e9c1f33aa3dd60ee98282d0"), "type" : "A", "value" : 58 }
{ "_id" : ObjectId("4e9c1f40aa3dd60ee98282d1"), "type" : "B", "value" : 37 }
{ "_id" : ObjectId("4e9c1f50aa3dd60ee98282d2"), "type" : "B", "value" : 1 }
{ "_id" : ObjectId("4e9c1f56aa3dd60ee98282d3"), "type" : "A", "value" : 85 }
{ "_id" : ObjectId("4e9c1f5daa3dd60ee98282d4"), "type" : "B", "value" : 12 }
Now I need to collect some statistics on that collection. For example:
db.data.mapReduce(function(){
emit(this.type,this.value);
},function(key,values){
var total = 0;
for(i in values) {total+=values[i]};
return total;
},
{out:'stat'})
will collect totals in 'stat' collection.
> db.stat.find()
{ "_id" : "A", "value" : 154 }
{ "_id" : "B", "value" : 50 }
At this point everything is perfect, but I've stuck on the next move:
'data' collection is constantly updated with new data (old documents stays unchanged, only inserts, no updates)
I would like to periodically update 'stat' collection, but do not want to query the whole 'data' collection every time, so I choose to run incremental mapReduce
It may seems good to just update 'stat' collection on every insert in 'data' collection and do no use mapReduce, but the real case is more complex than this example and I would like to get statistics only on demand.
To do this I should be able to query only documents, that was added after my last mapReduce
As far as I understand I cannot rely on ObjectId property, just store the last one and later select every document with ObjectId > stored because ObjectId is not equal autoincrement ids in SQL databases (for example different shards will produce different ObjectIds).
I can change ObjectId generator, but not sure how to do it better in sharded environment
So the question is:
Is it any way to select only documents, added after the last mapReduce to run incremental mapReduce or may be there is another strategy to update statistic data on constantly growing collection?
You can cache the time and use it as a barrier for your next incremental map-reduce.
We're testing this at work and it seems to be working. Correct me if I'm wrong, but you can't safely do map-reduce while an insert is happening across shards. The versions become inconsistent and your map-reduce operation will fail. (If you find a solution to this, please do let me know! :)
We use bulk-inserts instead, once every 5 minutes. Once all the bulk inserts are done, we run the map-reduce like this (in Python):
m = Code(<map function>)
r = Code(<reduce function>)
# pseudo code
end = last_time + 5 minutes
# Use time and optionally any other keys you need here
q = bson.SON([("date" : {"$gte" : last_time, "$lt" : end})])
collection.map_reduce(m, r, out=out={"reduce": <output_collection>}, query=q)
Note that we used reduce and not merge, because we don't want to override what we had before; we want to combine the old results and the new result with the same reduce function.
You can get just the time portion of the ID using _id.getTime() (from: http://api.mongodb.org/java/2.6/org/bson/types/ObjectId.html). That should be sortable across all shards.
EDIT: Sorry, that was the java docs... The JS version appears to be _id.generation_time.in_time_zone(Time.zone), from http://mongotips.com/b/a-few-objectid-tricks/
I wrote up a complete pymongo-based solution that uses incremental map-reduce and caches the time, and expects to run in a cron job. It locks itself so two can't run concurrently:
https://gist.github.com/2233072
""" This method performs an incremental map-reduce on any new data in 'source_table_name'
into 'target_table_name'. It can be run in a cron job, for instance, and on each execution will
process only the new, unprocessed records.
The set of data to be processed incrementally is determined non-invasively (meaning the source table is not
written to) by using the queued_date field 'source_queued_date_field_name'. When a record is ready to be processed,
simply set its queued_date (which should be indexed for efficiency). When incremental_map_reduce() is run, any documents
with queued_dates between the counter in 'counter_key' and 'max_datetime' will be map/reduced.
If reset is True, it will drop 'target_table_name' before starting.
If max_datetime is given, it will only process records up to that date.
If limit_items is given, it will only process (roughly) that many items. If multiple
items share the same date stamp (as specified in 'source_queued_date_field_name') then
it has to fetch all of those or it'll lose track, so it includes them all.
If unspecified/None, counter_key defaults to counter_table_name:LastMaxDatetime.
"""
We solve this issue using 'normalized' ObjectIds. Steps that we are doing:
normalize id - take timestap from the current/stored/last processed id and set other
parts of id to its min values. C# code: new ObjectId(objectId.Timestamp,
0, short.MinValue, 0)
run map-reduce with all items that have id
greater than our normalized id, skip already processed items.
store last processed id, and mark all processed items.
Note: Some boundary items will be processed several times. To fix it we set some sort of a flag in the processed items.
Related
I have a collection with text search enabled and and index of the FName column. If I perform a search like this:
db.items.find({ "FName" : /^customer\\rainfall\\geometry\ types\\customer\ 1\ manhole\ monitors\\region\ 10\\area\ 100\\catchment\ 1000\\/ }).count()
Then this is very fast and returns a value of 300 as there are 300 matching documents.
If, however, I run this query:
db.items.find([{ "FName" : /^customer\\rainfall\\geometry\ types\\customer\ 1\ manhole\ monitors\\region\ 10\\area\ 100\\catchment\ 1000\\/ }, "$text" : { "$search" : "\"\\average rainfall\"" }]).count(), it can take up to 20 seconds to complete. If I run the initial query, get all the documents and carry out the search client side it is much faster - a second or two.
How can this be faster given that the second query is actually being run on the database? It almost looks like the FName filter is not being used first but I cannot prove this.
Any ideas?
Thanks
Ian
I have a collection in mongo of 10 billion documents. Some of which have false information and require updating. The documents look like
{
"_id" : ObjectId("5567c71e2cdc06be25dbf7a0"),
"index1" : "stringIndex1",
"field" : "stringField",
"index2" : "stringIndex2",
"value" : 100,
"unique_id" : "b21fc73e-55a0-4e15-8db0-fa94e4ebcc0b",
"t" : ISODate("2015-05-29T01:55:39.092Z")
}
I want to update the value field for all documents that match criteria on index1, index2 and field. I want to do this across many combinations of the 3 fields.
In an ideal world we could create a second collection and compare the 2 before replacing the original in order to guarantee that we haven't lost any documents. But the size of the collection means that this isn't possible. Any suggestions for how to update this large amount of data without risking damaging it.
I have some problems with very slow distinct commands that use a query.
From what I have observed the distinct command only makes use of an index if you do not specify a query:
I have created a test database on my MongoDB 3.0.10 server with 1Mio objects. Each object looks as follows:
{
"_id" : ObjectId("56e7fb5303858265f53c0ea1"),
"field1" : "field1_6",
"field2" : "field2_10",
"field3" : "field3_29",
"field4" : "field4_64"
}
The numbers at the end of the field values are random 0-99.
On the collections two simple indexes and one compound-index has been created:
{ "field1" : 1 } # simple index on "field1"
{ "field2" : 1 } # simple index on "field2"
{ # compound index on all fields
"field2" : 1,
"field1" : 1,
"field3" : 1,
"field4" : 1
}
Now I execute distinct queries on that database:
db.runCommand({ distinct: 'dbtest',key:'field1'})
The result contains 100 values, nscanned=100 and it has used index on "field1".
Now the same distinct query is limited by a query:
db.runCommand({ distinct: 'dbtest',key:'field1',query:{field2:"field2_10"}})
It contains again 100 values, however nscanned=9991 and the used index is the third one on all fields.
Now the third index that was used in the last query is dropped. Again the last query is executed:
db.runCommand({ distinct: 'dbtest',key:'field1',query:{field2:"field2_10"}})
It contains again 100 values, nscanned=9991 and the used index is the "field2" one.
Conclusion: If I execute a distinct command without query the result is taken directly from an index. However when I combine a distinct command with a query only the query uses an index, the distinct command itself does not use an index in such a case.
My problem is that I need to perform a distinct command with query on a very large database. The result set is very large but only contains ~100 distinct values. Therefore the complete distinct command takes ages (> 5 minutes) as it has to cycle through all values.
What needs to be done to perform my distinct command presented above that can be answered by the database directly from an index?
The index is automatically used for distinct queries if your Mongo database version supports it.
The possibility to use an index in a distinct query requires Mongo version 3.4 or higher - it works for both storage engines MMAPv1/WiredTiger.
See also the bug ticket https://jira.mongodb.org/browse/SERVER-19507
I have a production mongo database of over 1B documents in a single collection sharded on _id across multiple servers. I'm trying to replicate recently updated records from this collection into Red Shift.
Shard keys:
db.sample_collection.ensureIndex({_id: "hashed"})
sh.shardCollection("sample_collection.sample_object", {_id: "hashed"})
Example 'sample_object' Document
{
"_id" : ObjectId("527a6c9226d6b7770ab05345"),
"p": ISODate("2013-10-27T14:30:18.000Z"),
"a" : {
"ln" : "Doe",
"id" : NumberLong(3),
"fn" : "John",
},
"co" : {
"ct" : 2,
"it" : [
{'t': 'loreum', 'u' : NumberLong(300), 'd': ISODate("2013-10-28T14:30:18.000Z")},
{'t': 'loreum', 'u' : NumberLong(400), 'd': ISODate("2013-10-29T14:30:18.000Z")},
..]
},
"li" : {
"ct" : 2,
"it" : [
{'u' : NumberLong(500), 'd': ISODate("2013-10-30T14:30:18.000Z")},
{'u' : NumberLong(501), 'd': ISODate("2013-10-29T14:30:18.000Z")},
..]
},
}
Option #1:
I'm in the process of analyzing this data and I need to query for documents that were "updated" between a period.
i.e., I want to return all the objects that have been p (published) or an li.it (item) or co.it (item) added between '2014-07-01' and '2014-07-03'.
What would be the most performant way of doing this?
Option #2:
Another option that I'm evaluating is whether I want to add an 'u' property with an updated date to account for when the document was updated
(ie., li or co item added)
If I make the change to the process to ensure new documents have this property, how would I iterate through existing documents and add this retroactively?
Would filtering on 'u' be more performant that Option 1? I'm looking at this option as using COPY FROM JSON from a mongoexport
Option #1 (multiple dates)
There isn't a good option to index this, as it looks like you would ideally want a compound index that includes p (date) plus two date arrays (lt.it and co.it). A compound index can only include at most one array field. Even if you could do this, the index would be very large given the suggested number of dates and the query would involve checking multiple fields to infer the last updated date.
Option #2 (single updated date)
Adding an indexed u (latest updated date) is definitely a better approach to allow a simple and performant query.
If I make the change to the process to ensure new documents have this property, how would I iterate through existing documents and add this retroactively?
You can use the $exists operator to find documents that do not have this field set yet.
Caveat on hashed shard key
To elaborate on Neil's comment: a hashed shard key gives you good write distribution at the expense of being able to do range queries (all queries become scatter-gather). If your common queries are range-based on date (and you are concerned about performance) then you could possibly chose a more appropriate shard key to support those queries. However, since shard keys are immutable and you want to query on an "updated" date, it doesn't sound like a change of shard key will help your use case.
I have a (hopefully quick) question about MongoDB queries on compound indexes.
Say I have a data set (for example, comments) which I want to sort descending by score, and then date:
{ "score" : 10, "date" : ISODate("2014-02-24T00:00:00.000Z"), ...}
{ "score" : 10, "date" : ISODate("2014-02-18T00:00:00.000Z"), ...}
{ "score" : 10, "date" : ISODate("2014-02-12T00:00:00.000Z"), ...}
{ "score" : 9, "date" : ISODate("2014-02-22T00:00:00.000Z"), ...}
{ "score" : 9, "date" : ISODate("2014-02-16T00:00:00.000Z"), ...}
...
My understanding thus far is that I can make a compound index to support this query, which looks like {"score":-1,"date":-1}. (For clarity's sake, I am not using a date in the index, but an ObjectID for unique, roughly time-based order)
Now, say I want to support paging through the comments. The first page is easy enough, I can just stick a .limit(n) option on the end of the cursor. What I'm struggling with is continuing the search.
I have been referring to MongoDB: The Definitive Guide by Kristina Chodorow. In this book, Kristina mentions that using skip() on large datasets is not very performant, and recommends using range queries on parameters from the last seen result (eg. the last seen date).
What I would like to do is perform a range query that acts on two fields, but treats the second field as secondary to the first (just like the index is sorted.) Since my compound index is already sorted in exactly the order I want, it seems like there should be some way to jump into the search by pointing at a specific element in the index and traversing it in the sort order. However, from my (admittedly rudimentary) understanding of queries in MongoDB this doesn't seem possible.
As far as I can see, I have three options:
Using skip() anyway
Using either an $or query or two distinct queries: {$or : [{"score" : lastScore, "date" : { $lt : lastDate}}, {'score' : {$lt : lastScore}]}
Using the $max special query option
Number 3 seems like the closest to ideal for me, but the reference text notes that 'you should generally use "$lt" instead of "$max"'.
To summarize, I have a few questions:
Is there some way to perform the operation I described, that I may have missed? (Jumping into an index and traversing it in the sort order)
If not, of the three options I described (or any I have overlooked), which would (very generally speaking) give the most consistent performance under the compound index?
Why is $lt preferred over $max in most cases?
Thanks in advance for your help!
Another option is to store score and date in a sub-document and then index the sub-document. For example:
{
"a" : { "score" : 9,
"date" : ISODate("2014-02-22T00:00:00Z") },
...
}
db.foo.ensureIndex( { a : 1 } )
db.foo.find( { a : { $lt : { score : lastScore,
date: lastDate } } } ).sort( { a : -1 } )
With this approach you need to ensure that the fields in the BSON sub-document are always stored in the same order, otherwise the query won't match what you expect since index key comparison is binary comparison of the entire BSON sub-document.
I would go with using $max to specify the upper bound, in conjunction with $hint to make sure that the database uses the index you want. The reason that $lt is in general preferred over $max is because $max selects the index using the specified index bounds. This means:
the index chosen may not necessarily be the best choice.
if multiple indexes exist on same fields with different sort orders, the selection of the index may be ambiguous.
The above points are covered in further detail here.
One last point: max is equivalent to $lte, not $lt, so using this approach for pagination you'll need to skip over the first returned document to avoid outputting the same document twice.