Update a mongo collection of 10 billion documents. Without contaminating the data.

Update a mongo collection of 10 billion documents. Without contaminating the data. - mongodb

I have a collection in mongo of 10 billion documents. Some of which have false information and require updating. The documents look like
{
"_id" : ObjectId("5567c71e2cdc06be25dbf7a0"),
"index1" : "stringIndex1",
"field" : "stringField",
"index2" : "stringIndex2",
"value" : 100,
"unique_id" : "b21fc73e-55a0-4e15-8db0-fa94e4ebcc0b",
"t" : ISODate("2015-05-29T01:55:39.092Z")
}
I want to update the value field for all documents that match criteria on index1, index2 and field. I want to do this across many combinations of the 3 fields.
In an ideal world we could create a second collection and compare the 2 before replacing the original in order to guarantee that we haven't lost any documents. But the size of the collection means that this isn't possible. Any suggestions for how to update this large amount of data without risking damaging it.

Related

How does the order of compound indexes matter in MongoDB performance-wise?

We need to create a compound index in the same order as the parameters are being queried. Does this order matter performance-wise at all?
Imagine we have a collection of all humans on earth with an index on sex (99.9% of the time "male" or "female", but string nontheless (not binary)) and an index on name.
If we would want to be able to select all people of a certain sex with a certain name, e.g. all "male"s named "John", is it better to have a compound index with sex first or name first? Why (not)?

Redsandro,
You must consider Index Cardinality and Selectivity.
1. Index Cardinality
The index cardinality refers to how many possible values there are for a field. The field sex only has two possible values. It has a very low cardinality. Other fields such as names, usernames, phone numbers, emails, etc. will have a more unique value for every document in the collection, which is considered high cardinality.
Greater Cardinality
The greater the cardinality of a field the more helpful an index will be, because indexes narrow the search space, making it a much smaller set.
If you have an index on sex and you are looking for men named John. You would only narrow down the result space by approximately %50 if you indexed by sex first. Conversely if you indexed by name, you would immediately narrow down the result set to a minute fraction of users named John, then you would refer to those documents to check the gender.
Rule of Thumb
Try to create indexes on high-cardinality keys or put high-cardinality keys first in the compound index. You can read more about it in the section on compound indexes in the book:
MongoDB The Definitive Guide
2. Selectivity
Also, you want to use indexes selectively and write queries that limit the number of possible documents with the indexed field. To keep it simple, consider the following collection. If your index is {name:1}, If you run the query { name: "John", sex: "male"}. You will have to scan 1 document. Because you allowed MongoDB to be selective.
{_id:ObjectId(),name:"John",sex:"male"}
{_id:ObjectId(),name:"Rich",sex:"male"}
{_id:ObjectId(),name:"Mose",sex:"male"}
{_id:ObjectId(),name:"Sami",sex:"male"}
{_id:ObjectId(),name:"Cari",sex:"female"}
{_id:ObjectId(),name:"Mary",sex:"female"}
Consider the following collection. If your index is {sex:1}, If you run the query {sex: "male", name: "John"}. You will have to scan 4 documents.
{_id:ObjectId(),name:"John",sex:"male"}
{_id:ObjectId(),name:"Rich",sex:"male"}
{_id:ObjectId(),name:"Mose",sex:"male"}
{_id:ObjectId(),name:"Sami",sex:"male"}
{_id:ObjectId(),name:"Cari",sex:"female"}
{_id:ObjectId(),name:"Mary",sex:"female"}
Imagine the possible differences on a larger data set.
A little explanation of Compound Indexes
It's easy to make the wrong assumption about Compound Indexes. According to MongoDB docs on Compound Indexes.
MongoDB supports compound indexes, where a single index structure
holds references to multiple fields within a collection’s documents.
The following diagram illustrates an example of a compound index on
two fields:
When you create a compound index, 1 Index will hold multiple fields. So if we index a collection by {"sex" : 1, "name" : 1}, the index will look roughly like:
["male","Rick"] -> 0x0c965148
["male","John"] -> 0x0c965149
["male","Sean"] -> 0x0cdf7859
["male","Bro"] ->> 0x0cdf7859
...
["female","Kate"] -> 0x0c965134
["female","Katy"] -> 0x0c965126
["female","Naji"] -> 0x0c965183
["female","Joan"] -> 0x0c965191
["female","Sara"] -> 0x0c965103
If we index a collection by {"name" : 1, "sex" : 1}, the index will look roughly like:
["John","male"] -> 0x0c965148
["John","female"] -> 0x0c965149
["John","male"] -> 0x0cdf7859
["Rick","male"] -> 0x0cdf7859
...
["Kate","female"] -> 0x0c965134
["Katy","female"] -> 0x0c965126
["Naji","female"] -> 0x0c965183
["Joan","female"] -> 0x0c965191
["Sara","female"] -> 0x0c965103
Having {name:1} as the Prefix will serve you much better in using compound indexes. There is much more that can be read on the topic, I hope this can offer some clarity.

I'm going to say I did an experiment on this myself, and found that there seems to be no performance penalty for using the poorly distinguished index key first. (I'm using mongodb 3.4 with wiredtiger, which may be different than mmap). I inserted 250 million documents into a new collection called items. Each doc looked like this:
{
field1:"bob",
field2:i + "",
field3:i + ""
"field1" was always equal to "bob". "field2" was equal to i, so it was completely unique. First I did a search on field2, and it took over a minute to scan 250 million documents. Then I created an index like so:
`db.items.createIndex({field1:1,field2:1})`
Of course field1 is "bob" on every single document, so the index should have to search a number of items before finding the desired document. However, this was not the result I got.
I did another search on the collection after the index finished creating. This time I got results which I listed below. You'll see that "totalKeysExamined" is 1 each time. So perhaps with wired tiger or something they have figured out how to do this better. I have read the wiredtiger actually compresses index prefixes, so that may have something to do with it.
db.items.find({field1:"bob",field2:"250888000"}).explain("executionStats")
{
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 4,
"totalKeysExamined" : 1,
"totalDocsExamined" : 1,
"executionStages" : {
"stage" : "FETCH",
"nReturned" : 1,
"executionTimeMillisEstimate" : 0,
"works" : 2,
"advanced" : 1,
...
"docsExamined" : 1,
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : 1,
"executionTimeMillisEstimate" : 0,
...
"indexName" : "field1_1_field2_1",
"isMultiKey" : false,
...
"indexBounds" : {
"field1" : [
"[\"bob\", \"bob\"]"
],
"field2" : [
"[\"250888000\", \"250888000\"]"
]
},
"keysExamined" : 1,
"seeks" : 1
}
}
Then I created an index on field3 (which has the same value as field 2). Then I searched:
db.items.find({field3:"250888000"});
It took the same 4ms as the one with the compound index. I repeated this a number of times with different values for field2 and field3 and got insignificant differences each time. This suggests that with wiredtiger, there is no performance penalty for having poor differentiation on the first field of an index.

Note that multiple equality predicates do not have to be ordered from most selective to least selective. This guidance has been provided in the past however it is erroneous due to the nature of B-Tree indexes and how in leaf pages, a B-Tree will store combinations of all field’s values. As such, there is exactly the same number of combinations regardless of key order.
https://www.alexbevi.com/blog/2020/05/16/optimizing-mongodb-compound-indexes-the-equality-sort-range-esr-rule/
This blog article disagrees with the accepted answer. The benchmark in the other answer also shows that it doesn't matter. The author of that article is a "Senior Technical Services Engineer at MongoDB" which sounds like a creditable person to me on this topic, so I guess the order really doesn't affect performance after all on equality fields. I'll follow the ESR rule instead.
Also consider prefixes. Filtering for { a: 1234 } won't work with a index of { b: 1, a: 1 }: https://docs.mongodb.com/manual/core/index-compound/#prefixes

Can I do a second 'query' on a MongoDB cursor?

Imagine a collection with about 5,000,000 documents. I need to do a basicCursor query to select ~100 documents based on too many fields to index. Let's call this the basicCursorMatch. This will be immensely slow.
I can however to a bTreeCursor query on a few indexes that will limit my search to ~500 documents. Let's call this query the bTreeCursorMatch.
Is there a way I can do this basicCursorMatch directly on the cursor or collection resulting from the bTreeCursorMatch?
Intuitively I tried
var cursor = collection.find(bTreeCursorMatch);
var results = cursor.find(basicCursorMatch);
similar to collection.find(bTreeCursorMatch).find(basicCursorMatch), which doesn't seem to work.
Alternatively, I was hoping I could do something like this:
collection.aggregate([
{$match: bTreeCursorMatch}, // Uses index 5,000,000 -> 500 fast
{$match: basicCursorMatch}, // No index, 500 -> 100 'slow'
{$sort}
]);
.. but it seems that I cannot do this either. Is there an alternative to do what I want?
The reason I am asking is because this second query will differ a lot and there is no way I can index all the fields. But I do want to make that first query using a bTreeCursor, otherwise querying the whole collection will take forever using a basicCursor.
update
Also, through user input the subselection of 500 documents will be queried in different ways during a session with an unpredictable basicCursor query, using multiple $in $eq $gt $lt. But during this, the bTreeCursor subselection remains the same. Should I just keep doing both queries for every user query, or is there a more efficient way to keep a reference to this collection?

In practice, you rarely need to run second queries on a cursor. You specially don't need to break MongoDB's work into separate indexable / non-indexable chunks.
If you pass a query to MongoDB's find method that can be partially fulfilled by a look-up in an index, MongoDB will do that look-up first, and then do a full scan on the remaining documents.
For instance, I have a collection users with documents like:
{ _id : 4, gender : "M", ... }
There is an index on _id, but not on gender. There are ~200M documents in users.
To get an idea of what MongoDB is doing under the hood, add explain() to your cursor (in the Mongo shell):
> db.users.find( { _id : { $gte : 1, $lt : 10 } } ).explain()
{
"cursor" : "BtreeCursor oldId_1_state_1",
"n" : 9,
"nscannedObjects" : 9
}
I have cut out some of the fields returned by explain. Basically, cursor tells you if it's using an index, n tells you the number of documents returned by the query and nscannedObjects is the number of objects scanned during the query. In this case, mongodb was able to scan exactly the right number of objects.
What happens if we now query on gender as well?
> db.users.find( { _id : { $gte : 1, $lt : 10 }, gender : "F" } ).explain()
{
"cursor" : "BtreeCursor oldId_1_state_1",
"n" : 5,
"nscannedObjects" : 9
}
find returns 5 objects, but had to scan 9 documents. It was therefore able to isolate the correct 9 documents using the _id field. It then went through all 9 documents and filtered them by gender.

MongoDB with 1B documents, what is most optimum filter to return recently updated documents

I have a production mongo database of over 1B documents in a single collection sharded on _id across multiple servers. I'm trying to replicate recently updated records from this collection into Red Shift.
Shard keys:
db.sample_collection.ensureIndex({_id: "hashed"})
sh.shardCollection("sample_collection.sample_object", {_id: "hashed"})
Example 'sample_object' Document
{
"_id" : ObjectId("527a6c9226d6b7770ab05345"),
"p": ISODate("2013-10-27T14:30:18.000Z"),
"a" : {
"ln" : "Doe",
"id" : NumberLong(3),
"fn" : "John",
},
"co" : {
"ct" : 2,
"it" : [
{'t': 'loreum', 'u' : NumberLong(300), 'd': ISODate("2013-10-28T14:30:18.000Z")},
{'t': 'loreum', 'u' : NumberLong(400), 'd': ISODate("2013-10-29T14:30:18.000Z")},
..]
},
"li" : {
"ct" : 2,
"it" : [
{'u' : NumberLong(500), 'd': ISODate("2013-10-30T14:30:18.000Z")},
{'u' : NumberLong(501), 'd': ISODate("2013-10-29T14:30:18.000Z")},
..]
},
}
Option #1:
I'm in the process of analyzing this data and I need to query for documents that were "updated" between a period.
i.e., I want to return all the objects that have been p (published) or an li.it (item) or co.it (item) added between '2014-07-01' and '2014-07-03'.
What would be the most performant way of doing this?
Option #2:
Another option that I'm evaluating is whether I want to add an 'u' property with an updated date to account for when the document was updated
(ie., li or co item added)
If I make the change to the process to ensure new documents have this property, how would I iterate through existing documents and add this retroactively?
Would filtering on 'u' be more performant that Option 1? I'm looking at this option as using COPY FROM JSON from a mongoexport

Option #1 (multiple dates)
There isn't a good option to index this, as it looks like you would ideally want a compound index that includes p (date) plus two date arrays (lt.it and co.it). A compound index can only include at most one array field. Even if you could do this, the index would be very large given the suggested number of dates and the query would involve checking multiple fields to infer the last updated date.
Option #2 (single updated date)
Adding an indexed u (latest updated date) is definitely a better approach to allow a simple and performant query.
If I make the change to the process to ensure new documents have this property, how would I iterate through existing documents and add this retroactively?
You can use the $exists operator to find documents that do not have this field set yet.
Caveat on hashed shard key
To elaborate on Neil's comment: a hashed shard key gives you good write distribution at the expense of being able to do range queries (all queries become scatter-gather). If your common queries are range-based on date (and you are concerned about performance) then you could possibly chose a more appropriate shard key to support those queries. However, since shard keys are immutable and you want to query on an "updated" date, it doesn't sound like a change of shard key will help your use case.

MongoDB incremental mapReduce, select only new documents, added after last mapReduce

Let's say I have a collection with documents that looks like this (just simplified example, but it should show the scheme):
> db.data.find()
{ "_id" : ObjectId("4e9c1f27aa3dd60ee98282cf"), "type" : "A", "value" : 11 }
{ "_id" : ObjectId("4e9c1f33aa3dd60ee98282d0"), "type" : "A", "value" : 58 }
{ "_id" : ObjectId("4e9c1f40aa3dd60ee98282d1"), "type" : "B", "value" : 37 }
{ "_id" : ObjectId("4e9c1f50aa3dd60ee98282d2"), "type" : "B", "value" : 1 }
{ "_id" : ObjectId("4e9c1f56aa3dd60ee98282d3"), "type" : "A", "value" : 85 }
{ "_id" : ObjectId("4e9c1f5daa3dd60ee98282d4"), "type" : "B", "value" : 12 }
Now I need to collect some statistics on that collection. For example:
db.data.mapReduce(function(){
emit(this.type,this.value);
},function(key,values){
var total = 0;
for(i in values) {total+=values[i]};
return total;
},
{out:'stat'})
will collect totals in 'stat' collection.
> db.stat.find()
{ "_id" : "A", "value" : 154 }
{ "_id" : "B", "value" : 50 }
At this point everything is perfect, but I've stuck on the next move:
'data' collection is constantly updated with new data (old documents stays unchanged, only inserts, no updates)
I would like to periodically update 'stat' collection, but do not want to query the whole 'data' collection every time, so I choose to run incremental mapReduce
It may seems good to just update 'stat' collection on every insert in 'data' collection and do no use mapReduce, but the real case is more complex than this example and I would like to get statistics only on demand.
To do this I should be able to query only documents, that was added after my last mapReduce
As far as I understand I cannot rely on ObjectId property, just store the last one and later select every document with ObjectId > stored because ObjectId is not equal autoincrement ids in SQL databases (for example different shards will produce different ObjectIds).
I can change ObjectId generator, but not sure how to do it better in sharded environment
So the question is:
Is it any way to select only documents, added after the last mapReduce to run incremental mapReduce or may be there is another strategy to update statistic data on constantly growing collection?

You can cache the time and use it as a barrier for your next incremental map-reduce.
We're testing this at work and it seems to be working. Correct me if I'm wrong, but you can't safely do map-reduce while an insert is happening across shards. The versions become inconsistent and your map-reduce operation will fail. (If you find a solution to this, please do let me know! :)
We use bulk-inserts instead, once every 5 minutes. Once all the bulk inserts are done, we run the map-reduce like this (in Python):
m = Code(<map function>)
r = Code(<reduce function>)
# pseudo code
end = last_time + 5 minutes
# Use time and optionally any other keys you need here
q = bson.SON([("date" : {"$gte" : last_time, "$lt" : end})])
collection.map_reduce(m, r, out=out={"reduce": <output_collection>}, query=q)
Note that we used reduce and not merge, because we don't want to override what we had before; we want to combine the old results and the new result with the same reduce function.

You can get just the time portion of the ID using _id.getTime() (from: http://api.mongodb.org/java/2.6/org/bson/types/ObjectId.html). That should be sortable across all shards.
EDIT: Sorry, that was the java docs... The JS version appears to be _id.generation_time.in_time_zone(Time.zone), from http://mongotips.com/b/a-few-objectid-tricks/

I wrote up a complete pymongo-based solution that uses incremental map-reduce and caches the time, and expects to run in a cron job. It locks itself so two can't run concurrently:
https://gist.github.com/2233072
""" This method performs an incremental map-reduce on any new data in 'source_table_name'
into 'target_table_name'. It can be run in a cron job, for instance, and on each execution will
process only the new, unprocessed records.
The set of data to be processed incrementally is determined non-invasively (meaning the source table is not
written to) by using the queued_date field 'source_queued_date_field_name'. When a record is ready to be processed,
simply set its queued_date (which should be indexed for efficiency). When incremental_map_reduce() is run, any documents
with queued_dates between the counter in 'counter_key' and 'max_datetime' will be map/reduced.
If reset is True, it will drop 'target_table_name' before starting.
If max_datetime is given, it will only process records up to that date.
If limit_items is given, it will only process (roughly) that many items. If multiple
items share the same date stamp (as specified in 'source_queued_date_field_name') then
it has to fetch all of those or it'll lose track, so it includes them all.
If unspecified/None, counter_key defaults to counter_table_name:LastMaxDatetime.
"""

We solve this issue using 'normalized' ObjectIds. Steps that we are doing:
normalize id - take timestap from the current/stored/last processed id and set other
parts of id to its min values. C# code: new ObjectId(objectId.Timestamp,
0, short.MinValue, 0)
run map-reduce with all items that have id
greater than our normalized id, skip already processed items.
store last processed id, and mark all processed items.
Note: Some boundary items will be processed several times. To fix it we set some sort of a flag in the processed items.

How to improve query performance with operators like $nin, $in for Mongodb

I have a reasonably large dataset of over 3 million documents that have tags similar to StackOverflow that uses tags for each question. The schema that I use for storing the tags is as follows:
{"id": 12345, "tags":["tag1", "tag2", "tag3"]}, {"id": 12346, "tags":["tag2", "tag3"]}
I have a multi-key index created on tags field. When I am performing queries using $in or $nin operators to find the intersection, union of the tags, the performance is around 7 seconds on a server class machine. Is there anything that I can do to improve the speed of query search?
EDIT 1:
Here is the explain plan as requested. What I observed is that the queries returned much faster after I restarted my server and just ran just the mongodb server. The queries performed much faster(< 50ms). I suspect the indexes were not cached in memory, although I had ample unused ram available and my index (800MB) could easily fit in memory.
db.tagsCollection.find( { "tags" : { $in : ['tag1', 'tag2'], $nin : ['tag4', '
tag5', 'tag6', 'tag7'] } } ).explain();
{
"cursor" : "BtreeCursor tags_1 multi",
"nscanned" : 6145193,
"nscannedObjects" : 6145192,
"n" : 969386,
"millis" : 19640,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : {
"tags" : [
[
"tag1",
"tag1"
],
[
"tag2",
"tag2"
]
]
}
}
Note

This is what I thought of as an optimization ( though you might need to test it )
Instead of storing tags,store a small key which identifies all the tags particular document has.
say for post#125 the tags are : PHP, MongoDb , database .
a) clean the tags like convert all of them to small case etc
and then sort them alphabetically .
current tags will be : database,mongodb,php
b) Have a seperate collection which stores integer to tag mapping :
{ "_id" : 1 , "t" : "mongodb" }
{ "_id" : 2 , "t" : "php" } and so on store all the possible tags for your website
c) to store a document, create the tag key using tags to number map from previous collection.
so curent database,mongodb,php will become something like 1-12-2
d) store your document like :
{ "id" : 12345 , "tags" : [1,12,3] }
QUERYING :
The use of integers instead of strings on an indexed field would reduce the index size by great extent, and also make querying faster as compared to a string index.
Not sure about amount of performance gain, but still worth a try to compare to your current implementation.

Check the size of your multi-key tags index using db.col.stats(). If it doesn't fit in RAM then you might be disk-bound and incurring some disk IO cost. If the index fits entirely in memory then I'm not sure what else you can do, apart from throw more hardware at it, unless you can optimise the queries themselves.
Do you need to search through all the data, or can you query a subset that's filtered by another indexed field? Or can you eliminate the $nin queries, which will tend to be slower because the have to iterate every tag, where as $in only has to iterate until it finds a match.

If you want performance to be super fast and dont have space contraints, I would suggest to have separate collection of tags with video id array and have an index on tag name.

Here is another suggestion but I've had not a chance to test it.
{
tags:{
items:[ 'a', 'b', 'c' ],
mixed:{
a:1, // hash value for a tag
b:2, // hash value for b tag
c:3 // hash value for c tag
}
}
}
and search query is
db.demo.find({ 'tags.mixed.a':1, 'tags.mixed.b':2 })
if possible have to create compound index for tags.mixed

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Update a mongo collection of 10 billion documents. Without contaminating the data. - mongodb

Related

How does the order of compound indexes matter in MongoDB performance-wise?

Can I do a second 'query' on a MongoDB cursor?

MongoDB with 1B documents, what is most optimum filter to return recently updated documents

MongoDB incremental mapReduce, select only new documents, added after last mapReduce

How to improve query performance with operators like $nin, $in for Mongodb

Categories

Resources