Given a large (millions+) collection of documents similar to:
{ _id : ObjectId, "a" : 3, "b" : 5 }
What is the most efficient way to process these documents directly on the server, with the results added to each document within the same collection? For example, add a key c whose value equals a+b.
{ _id : ObjectId, "a" : 3, "b" : 5, "c" : 8 }
I'd prefer to do this in the shell.
Seems that find().forEach() would waste time in transit between the db and the shell, and mapReduce() seems intended to process groups of objects down into aggregated data (though I may be misunderstanding).
EDIT: I'd prefer a solution that doesn't block, if there is one (other than using a cursor on the client)...
From the MongoDB Docs on db.eval():
"db.eval() is used to evaluate a function (written in JavaScript) at the database server.
This is useful if you need to touch a lot of data lightly. In that scenario, network transfer of the data could be a bottleneck."
The documentation has an example of how to use it that is very similar to what you are trying to do.
forEach is your best option. I would run it on the server (from the shell) to reduce latency.
Related
I have a MongoDB collection with 8k+ documents, around 40GB. Inside it, the data follows this format:
{
_id: ...,
_session: {
_id: ...
},
data: {...}
}
I need to get all the _session._id for my application. The following approach (python) takes too long to get them:
cursor = collection.find({}, projection={'_session._id': 1})
I have created an Index in MongoDB Compass, but I'm not sure if my query is making use of it at all.
Is there a way to speed this query such that I get all the _session._id very fast?
In mongo shell you can hint() the query optimizer to use the available index as follow:
db.collection.find({},{_id:0,"_session._id":1}).hint({"_session._id":1})
Following test is confirmed to work via python:
import pymongo
db=pymongo.MongoClient("mongodb://user:pass#localhost:12345")
mydb=db["test"]
docs= mydb.test2.find( {} ).hint([ ("x.y", pymongo.ASCENDING) ])
for i in docs:
print(i)
db.test2.createIndex({"x.y":1})
{
"v" : 2,
"key" : {
"x.y" : 1
},
"name" : "x.y_1"
}
python 3.7 ,
pymongo 3.11.2 ,
mongod 5.0.5
In your case seems to be text index , btw it seems abit strange why session is text index , for text index somethink like this must work:
db.test2.find({}).hint("x.y_text").explain()
And here is working example with text index:
import pymongo
db=pymongo.MongoClient("mongodb://user:pass#localhost:123456")
print('Get first 10 docs from test.test:')
mydb=db["test"]
docs= mydb.test2.find( {"x.y":"3"} ).hint( "x.y_text" )
print("===start:====")
for i in docs:
print(i)
db.test2.createIndex({"x.y":"text"}):
{
"v" : 2,
"key" : {
"_fts" : "text",
"_ftsx" : 1
},
"name" : "x.y_text",
"weights" : {
"x.y" : 1
},
"default_language" : "english",
"language_override" : "language",
"textIndexVersion" : 3
}
There are a few points of confusion in this question and the ensuing discussion which generally come down to:
What indexes are present in the environment (and why the attempts to hint it failed)
When using indexing is most appropriate
Current Indexes
I think there are at least 5 indexes that were mentioned so far:
A standard index of {"_session._id":1} mentioned originally in #R2D2's answer.
A text index on the _session._id field (mentioned in this comment)
A text index on the _ts_meta.session field (mentioned in this comment)
A standard index of {"x.y":1} mentioned second in #R2D2's answer.
A text index of {"x.y":"text"} mentioned at the end of #R2D2's answer.
Only the first of these is likely to even really be relevant to the original question. Note that the difference a text index is a specialized index that is meant for performing more advanced text searching. Such indexes are not required for simple string matching or value retrieval. But standard indexes, { '_session._id': 1}, will also store string values and are relevant here.
What Indexing is For
Indexes are typically useful for retrieving a small subset of results from the database. The larger that set of results becomes relative to the overall size of the collection, the less helpful using an index will become. In your situation you are looking to retrieve data from all of the documents in the collection which is why the database doesn't consider using any index at all.
Now it is still possible that an index could help in this situation. That would be if we used it to perform a covered query which means that the data can be retrieved from the index alone without looking at the documents themselves. In this case the database would have to scan the full index, so it is not clear that it would be faster or not. But you could certainly try. To do so you would need to follow #R2D2's instructions, specifically by creating the index and then hinting it in the query (while also projecting out the _id field):
db.collection.createIndex({"_session._id":1})
db.collection.find({},{_id:0,"_session._id":1}).hint({"_session._id":1})
Additional Questions
There were two other things mentioned in the question that are important to address.
I have created an Index in MongoDB Compass, but I'm not sure if my query is making use of it at all.
We talked about why this was the case above. But to find out if the database is using it or not you could navigate to the Explain tab in compass to take a look. If you explain plan visualization it should indicate if the index was used. Remember that you will need to hint the index based on your query.
Is there a way to speed this query such that I get all the _session._id very fast?
What is your definition of "very fast" here?
The general answer is that your operation requires scanning either all documents in the collection or a full index. There is no way to do this more efficiently based on the current schema. Therefore how fast it happens is largely going to come down to the hardware that the database is running on and it will slow down as the collection grows.
If this operation is something that you will be running frequently or have strict performance requirements around, then it may be important to think through your intended goals to see if there are other ways of achieving them. What will you or the application be doing with this list of session IDs?
I have removed some documents in my last query by mistake, Is there any way to rollback my last query mongo collection.
Here it is my last query :
db.foo.remove({ "name" : "some_x_name"})
Is there any rollback/undo option? Can I get my data back?
There is no rollback option (rollback has a different meaning in a MongoDB context), and strictly speaking there is no supported way to get these documents back - the precautions you can/should take are covered in the comments. With that said however, if you are running a replica set, even a single node replica set, then you have an oplog. With an oplog that covers when the documents were inserted, you may be able to recover them.
The easiest way to illustrate this is with an example. I will use a simplified example with just 100 deleted documents that need to be restored. To go beyond this (huge number of documents, or perhaps you wish to only selectively restore etc.) you will either want to change the code to iterate over a cursor or write this using your language of choice outside the MongoDB shell. The basic logic remains the same.
First, let's create our example collection foo in the database dropTest. We will insert 100 documents without a name field and 100 documents with an identical name field so that they can be mistakenly removed later:
use dropTest;
for(i=0; i < 100; i++){db.foo.insert({_id : i})};
for(i=100; i < 200; i++){db.foo.insert({_id : i, name : "some_x_name"})};
Now, let's simulate the accidental removal of our 100 name documents:
> db.foo.remove({ "name" : "some_x_name"})
WriteResult({ "nRemoved" : 100 })
Because we are running in a replica set, we still have a record of these documents in the oplog (being inserted) and thankfully those inserts have not (yet) fallen off the end of the oplog (the oplog is a capped collection remember) . Let's see if we can find them:
use local;
db.oplog.rs.find({op : "i", ns : "dropTest.foo", "o.name" : "some_x_name"}).count();
100
The count looks correct, we seem to have our documents still. I know from experience that the only piece of the oplog entry we will need here is the o field, so let's add a projection to only return that (output snipped for brevity, but you get the idea):
db.oplog.rs.find({op : "i", ns : "dropTest.foo", "o.name" : "some_x_name"}, {"o" : 1});
{ "o" : { "_id" : 100, "name" : "some_x_name" } }
{ "o" : { "_id" : 101, "name" : "some_x_name" } }
{ "o" : { "_id" : 102, "name" : "some_x_name" } }
{ "o" : { "_id" : 103, "name" : "some_x_name" } }
{ "o" : { "_id" : 104, "name" : "some_x_name" } }
To re-insert those documents, we can just store them in an array, then iterate over the array and insert the relevant pieces. First, let's create our array:
var deletedDocs = db.oplog.rs.find({op : "i", ns : "dropTest.foo", "o.name" : "some_x_name"}, {"o" : 1}).toArray();
> deletedDocs.length
100
Next we remind ourselves that we only have 100 docs in the collection now, then loop over the 100 inserts, and finally revalidate our counts:
use dropTest;
db.foo.count();
100
// simple for loop to re-insert the relevant elements
for (var i = 0; i < deletedDocs.length; i++) {
db.foo.insert({_id : deletedDocs[i].o._id, name : deletedDocs[i].o.name});
}
// check total and name counts again
db.foo.count();
200
db.foo.count({name : "some_x_name"})
100
And there you have it, with some caveats:
This is not meant to be a true restoration strategy, look at backups (MMS, other), delayed secondaries for that, as mentioned in the comments
It's not going to be particularly quick to query the documents out of the oplog (any oplog query is a table scan) on a large busy system.
The documents may age out of the oplog at any time (you can, of course, make a copy of the oplog for later use to give you more time)
Depending on your workload you might have to de-dupe the results before re-inserting them
Larger sets of documents will be too large for an array as demonstrated, so you will need to iterate over a cursor instead
The format of the oplog is considered internal and may change at any time (without notice), so use at your own risk
While I understand this is a bit old but I wanted to share something that I researched in this area that may be useful to others with a similar problem.
The fact is that MongoDB does not Physically delete data immediately - it only marks it for deletion. This is however version specific and there is currently no documentation or standardization - which could enable a third party tool developer (or someone in desperate need) to build a tool or write a simple script reliably that works across versions. I opened a ticket for this - https://jira.mongodb.org/browse/DOCS-5151.
I did explore one option which is at a much lower level and may need fine tuning based on the version of MongoDB used. Understandably too low level for most people's linking, however it works and can be handy when all else fails.
My approach involves directly working with the binary in the file and using a Python script (or commands) to identify, read and unpack (BSON) the deleted data.
My approach is inspired by this GitHub project (I am NOT the developer of this project). Here on my blog I have tried to simplify the script and extract a specific deleted record from a Raw MongoDB file.
Currently a record is marked for deletion as "\xee" at the start of the record. This is what a deleted record looks like in the raw db file,
‘\xee\xee\xee\xee\x07_id\x00U\x19\xa6g\x9f\xdf\x19\xc1\xads\xdb\xa8\x02name\x00\x04\x00\x00\x00AAA\x00\x01marks\x00\x00\x00\x00\x00\x00#\x9f#\x00′
I replaced the first block with the size of the record which I identified earlier based on other records.
y=”3\x00\x00\x00″+x[20804:20800+51]
Finally using the BSON package (that comes with pymongo), I decoded the binary to a Readable object.
bson.decode_all(y)
[{u’_id': ObjectId(‘5519a6679fdf19c1ad73dba8′), u’name': u’AAA’, u’marks': 2000.0}]
This BSON is a python object now and can be dumped into a recover collection or simply logged somewhere.
Needless to say this or any other recovery technique should be ideally done in a staging area on a backup copy of the database file.
I had following mongo collections structures
{
"_id" : ObjectId("52204f5b24c8cbf03ca16f8e"),
"Date" : 1377849179,
"cpuUtilization" : 31641,
"memory" : 20623801,
"hostId" : "600.6.6.6"
}
In above collections I had 1000 hostId and every hostId produced cpuutilization and memory every 5 min. So any one suggest me I put my data into single collection or I create separate 1000 collections using hostId like collections name as 100.1.12.2,101.2.10.1....
and I also want indexing on collections for searching records.
From the structure you have shared it would be an intelligent choice to put data into separate records, since the memory and cpuUtilization would always be different. Also, if you store timestamp in Date field, that would always be different.
It would be far more easier to query your database if you store records separately and you could avoid using aggregation as well which will give you better query performance by using appropriate indexes.
So your records should look like below:
{ "_id" : ObjectId("someID1"),"Date" : 1377849179,"cpuUtilization" : 31641,"memory" : 20623801,"hostId" : "600.6.6.6"}
{ "_id" : ObjectId("someID2"),"Date" : 1377849210,"cpuUtilization" : 20141,"memory" : 28787801,"hostId" : "600.6.6.6"}
One collection will be good enough to store the information . One of the thought you have to take care is Write performance , as mongodb locks while writing at database level , Write may be slow. One suggestion I can give to have two or three database which will hold the collections for specific range of host. It help you to write faster . Beginning with version 2.2, MongoDB implements locks on a per-database basis for most read and write operations.
I'm storing timeseries in MongoDB and the strucuture is as follows:
{
"_id" : ObjectId("5128e567df6232180e00fa7d"),
"values" : [563.424, 520.231, 529.658, 540.459, 544.271, 512.641, 579.591, 613.878, 627.708, 636.239, 672.883, 658.895, 646.44, 619.644, 623.543, 600.527, 619.431, 596.184, 604.073, 596.556, 590.898, 559.334, 568.09, 568.563],
"day" : 20110628,
}
The values-array is representing a value for each hour. So the position is important since position 0 = first hour, 1 = second hour and so on.
To update the value of a specific hour is quite easy. For example, to update the 7th hour of the day I do this:
db.timeseries.update({day:20130203},{$set : {values.6 : 482.65}}, {upsert : true})
My problem is that I would like to use upsert, like this
db.timeseries.update({day:20130203},{$set : {values.6 : 482.65}})
But if the document does not exist, MongoDB will craete an embedded document intead of an embedded array. Like this:
{
"_id" : ObjectId("5128e567df6232180e00fa7d"),
"values" : {"6" : 482.65},
"day" : 20130203,
}
There is a ticket to add a feature to solve this issue here, but meanwhile I have come up with a work-around to solve this in my case.
What I do, is that I first created a uniqe-index on the day-field. And whenever I want to upsert a hourly volume I do these two commands.
db.timeseries.insert({day:20130203, values : []}); // Will be rejected if it exists
db.timeseries.update({day:20130203},{$set : {values.6 : 482.65}});
The first statement tried to create a new document - and thanks to the uniqe-index the insert will be rejected if it already exists. If not, a document with an embedded array for value-field will be created. This ensures that the update will work.
Result:
{
"_id" : ObjectId("5128e567df6232180e00fa7d"),
"values" : [null,null,null,null,null,null,482.65],
"day" : 20130203,
}
And here's is my question
In production, when several commands like this will be run simultaneously can I be sure that my update-command will be executed after my insert-command? Note that I want to run both commands in unsafe-mode, that is I will not wait for any response from the server.
(It would also be interesting to here comments about my work-around from a performance perspective.)
Generally yes, there is a way to ensure that two requests from a client use the same connection. By using the same connection you force a strict order of execution on the server.
The way to accomplish this are different for different drivers.
For the Asynchronous Java Driver you can create a "Serialized" MongoClient from the initial MongoClient instance and it will ensure that all requests use a single connection.
For the 10gen java driver it will automatically (via a ThreadLocal) try to use the same connection. You can also give a hint to the driver via the DB.requestStart()/DB.requestEnd() methods that a group of commands need to be pipe-lined.
The startRequest/endRequest applies to most of the 10gen drivers. As another example the PyMongo driver mongo_client has a start_request()/end_request() pair.
From a performance point of view, it is better using only one access to the database than two. Cannot you use $push instead of $set for updating the values field?
$rename function is available only in development version 1.7.2.
How to rename field in 1.6.5?
The simplest way to perform such an operation is to loop through the data set re-mapping the name of the field. The easiest way to do this is to write a function that performs the re-write and then use the .find().forEach() syntax in the shell.
Here's a sample from the shell:
db.foo.save({ a : 1, b : 2, c : 3});
db.foo.save({ a : 4, b : 5, c : 6});
db.foo.save({ a : 7, b : 8 });
db.foo.find();
remap = function (x) {
if (x.c){
db.foo.update({_id:x._id}, {$set:{d:x.c}, $unset:{c:1}});
}
}
db.foo.find().forEach(remap);
db.foo.find();
In the case above I'm doing an $unset and a $set in the same action. MongoDB does not support transactions across collections, but the above is a single document. So you're guaranteed that the set and unset will be atomic (i.e. they both succeed or they both fail).
The only limitation here is that you'll need to manage outside writers to keep the data consistent. My normal preference for this is simply to turn off writes while this updates. If this option is not available, then you'll have to figure out what level of consistency you want for the data. (I can provide some ideas here, but it's really going to be specific to your data and system)
db.collection_name.update({}, {$rename: {"oldname": "newname"}}, false, true);
This will rename the column for each row in the collection.
Also, I discovered that if your column (the one you're renaming) appears within the index catalog (db.collection_name.getIndexes()), then you're going to have to drop and recreate the index (using the new column name) also.