I have the following code that currently works. It goes through and finds every file that's newer than a specified date and that matches a regex, then deletes it, as well as the chunks that are pointing to it.
conn = new Mongo("<url>");
db = conn.getDB("<project>");
res = db.fs.files.find({"uploadDate" : { $gte : new ISODate("2017-04-04")}}, {filename : /.*(png)/});
while (res.hasNext()) {
var tmp = res.next();
db.getCollection('fs.chunks').remove({"files_id" : tmp._id});
db.fs.files.remove({ "_id" : tmp._id});
}
It's extremely slow, and most of the time, the client I'm running it from just times out.
Also, I know that I'm deleting files from the filesystem, and not from the normal collections. It's a long story, but the code above does exactly what I want it to do.
How can I get this to run faster? It was pointed out to me earlier that I'm running this code on a client, but it's possible to run it on the server side? Before I was trying to use the Javascript driver, this is probably why. I assume using the Mongo shell executes everythin on the server.
Any help would be appreciated. So close but so far...
I know that I'm deleting files from the filesystem, and not from the normal collections
GridFS is a specification for storing binary data in MongoDB, so you are actually deleting documents from MongoDB collections rather than files from the filesystem.
It was pointed out to me earlier that I'm running this code on a client, but it's possible to run it on the server side?
The majority of your code (queries and commands) is being executed by your MongoDB server. The client (mongo shell, in this case) isn't doing any significant processing.
It's extremely slow, and most of the time, the client I'm running it from just times out.
You need to investigate where the time is being spent.
If there is problematic network latency between your mongo shell and your deployment, you could consider running the query from a mongo shell session closer to the deployment (if possible) or use query criteria matching a smaller range of documents.
Another obvious candidate to look into would be server resources. For example, is deleting a large number of documents putting pressure on your I/O or RAM? Reducing the number of documents you delete in each script run may also help in this case.
db.fs.files.find({"uploadDate" : { $gte : new ISODate("2017-04-04")}}, {filename : /.*(png)/})
This query likely isn't doing what you intended: the filename is being provided as the second option to find() (so is used for projection rather than search criteria) and the regex matches a filename containing png anywhere (for example: typng.doc).
I assume using the Mongo shell executes everythin on the server.
That's an incorrect general assumption. The mongo shell can evaluate local functions, so depending on your code there may be aspects that are executed/evaluated in a client context rather than a server context. Your example code is running queries/commands which are processed on the server, but fs.files documents returned from your find() query are being accessed in the mongo shell in order to construct the query to remove related documents in fs.chunks.
How can I get this to run faster?
In addition to comments noted above, there are a few code changes you can make to improve efficiency. In particular, you are currently removing chunk documents individually. There is a Bulk API in MongoDB 2.6+ which will reduce the round trips required per batch of deletes.
Some additional suggestions to try to improve the speed:
Add an index on {uploadDate:1, filename: 1} to support your find() query:
db.fs.files.createIndex({uploadDate:1, filename: 1})
Use the Bulk API to remove matching chunk documents rather than individual removes:
while (res.hasNext()) {
var tmp = res.next();
var bulk = db.fs.chunks.initializeUnorderedBulkOp();
bulk.find( {"files_id" : tmp._id} ).remove();
bulk.execute();
db.fs.files.remove({ "_id" : tmp._id});
}
Add a projection to the fs.files query to only include the fields you need:
var res = db.fs.files.find(
// query criteria
{
uploadDate: { $gte: new ISODate("2017-04-04") },
// Filenames that end in png
filename: /\.png$/
},
// Only include the _id field
{ _id: 1 }
)
Note: unless you've added a lot of metadata to your GridFS files (or have a lot of files to remove) this may not have a significant impact. The default fs.files documents are ~130 bytes, but the only field you require is _id (a 12 byte ObjectId).
Related
Here is the issue :
If the collection has only the default index "_id", then the time to upload a set of documents is constant when the collection gets bigger.
But if I add the following index to the collection :
db.users.createIndex({"s_id": "hashed"}, {"background": true})
then, the time to upload the same set of documents increase drastically (it looks like an exponential function)
Context :
I'm trying to insert about 80 millions documents in a collection. I don't use mongo's sharding, there is only one instance.
I'm using the python API and here is my code :
client = pymongo.MongoClient(ip_address, 27017)
users = client.get_database('local')\
.get_collection('users')
bulk_op = users.initialize_unordered_bulk_op()
for s in iterator:
bulk_op.insert(s)
bulk_op.execute()
client.close()
There are 15 concurrent connections (I'm using Apache Spark and it corresponds to the different partitions).
The instance has 4GB of RAM.
The total size of indexes is about 1,5GB when upload is finish.
Thanks a lot for your help.
I have the following two documents in a mongo collection:
{
_id: "123",
name: "n1"
}
{
_id: "234",
name: "n2"
}
Let's suppose I read those two documents, and make changes, for example, add "!" to the end of the name.
I now want to save the two documents back.
For a single document, there's save, for new documents, I can use insert to save an array of documents.
What is the solution for saving updates to those two documents? The update command asks for a query, but I don't need a query, I already have the documents, I just want to save them back...
I can update one by one, but if that was 2 million documents instead of just two this would not work so well.
One thing to add: we are currently using Mongo v2.4, we can move to 2.6 if Bulk operations are the only solution for this (as that was added in 2.6)
For this you have 2 options (present in 2.6),
Bulk operations like Mongoimport, mongorestore.
Upsert command for each document.
First option goes better with huge no. of documents (which is your case). In Mongoimport you can use --upsert flag to overwrite the existing documents. You can use --upsert --drop flags to drop existing data and set new document.
This options scales well with lot amount of data in terms of IO and system util.
Upsert command works on in-place update principle. You can use it with a filter but drawback is it works in serial fashion and shouldn't be used for huge data size. Performant only with small data.
When you switch off write concerns, a save doesn't block until the database wrote and returns almost immediately. So with WriteConcern.Unacknowledged, storing 2 million documents with save is a lot quicker than you would think. But no write concerns have the drawback that you won't get any errors from the database.
When you don't want to save them one-by-one, bulk operations are the way to go.
I've got a mongo db instance with a collection in it which has around 17 million records.
I wish to alter the document structure (to add a new attribute in the document) of all 17 million documents, so that I dont have to problematically deal with different structures as well as make queries easier to write.
I've been told though that if I run an update script to do that, it will lock the whole database, potentially taking down our website.
What is the easiest way to alter the document without this happening? (I don't mind if the update happens slowly, as long as it eventually happens)
The query I'm attempting to do is:
db.history.update(
{ type : { $exists: false }},
{
$set: { type: 'PROGRAM' }
},
{ multi: true }
)
You can update the collection in batches(say half a million per batch), this will distribute the load.
I created a collection with 20000000 records and ran your query on it. It took ~3 minutes to update on a virtual machine and i could still read from the db in a separate console.
> for(var i=0;i<20000000;i++){db.testcoll.insert({"somefield":i});}
The locking in mongo is quite lightweight, and it is not going to be held for the whole duration of the update. Think of it like 20000000 separate updates. You can read more here:
http://docs.mongodb.org/manual/faq/concurrency/
You do actually care if your update query is slow, because of the write lock problem on the database you are aware of, both are tightly linked. It's not a simple read query here, you really want this write query to be as fast as possible.
Updating the "find" part is part of the key here. First, since your collection has millions of documents, it's a good idea to keep the field name size as small as possible (ideally one single character : type => t). This helps because of the schemaless nature of mongodb collections.
Second, and more importantly, you need to make your query use a proper index. For that you need to workaround the $exists operator which is not optimized (several ways to do it there actually).
Third, you can work on the field values themselves. Use http://bsonspec.org/#/specification to estimate the size of the value you want to store, and eventually pick a better choice (in your case, you could replace the 'PROGRAM' string by a numeric constant for example and gain a few bytes in the process, multiplied by the number of documents to update for each update multiple query). The smaller the data you want to write, the faster the operation will be.
A few links to other questions which can inspire you :
Can MongoDB use an index when checking for existence of a field with $exists operator?
Improve querying fields exist in MongoDB
I am using free text search of mongo2.4 with pymongo.
What I want is to get the number of documents having some text. In the mongo shell, increasing the limit is a good turnaround, but from python it gets very slow since all documents have to be sent. For indication, the query is ~50 times slower in pymongo compared to mongo shell.
I use a command similar to this:
>>>res=db.command('text','mytable',search='eden',limit=100000)
>>>numfound = res['stats']['nfound']
But as I said, since all documents are returned, it is really slow. Is there a command to specify that you don't need documents, just the stats??
What is the list of all available options??
thx,
colin
I couldn't find a server ticket for this feature - so please add a feature request to: jira.mongodb.org then you'll get updates and feedback from the core server developers.
You can project when doing a text query, so you can reduce the amount sent over the wire - but still sends some information eg:
db.mytable.runCommand( "text", { search: "eden", project: {_id: 0, b: 1}})
I'm thinking about trying MongoDB to use for storing our stats but have some general questions about whether I'm understanding it correctly before I actually start learning it.
I understand the concept of using documents, what I'm not too clear about is how much data can be stored inside each document. The following diagram explains the layout I'm thinking of:
Website (document)
- some keys/values about the particular document
- statistics (tree)
- millions of rows where each record is inserted from a pageview (key/value array containing data such as timestamp, ip, browser, etc)
What got me excited about mongodb was the grouping functions such as:
http://www.mongodb.org/display/DOCS/Aggregation
db.test.group(
{ cond: {"invoked_at.d": {$gte: "2009-11", $lt: "2009-12"}}
, key: {http_action: true}
, initial: {count: 0, total_time:0}
, reduce: function(doc, out){ out.count++; out.total_time+=doc.response_time }
, finalize: function(out){ out.avg_time = out.total_time / out.count }
} );
But my main concern is how hard would that command for example be on the server if there is say 10's of millions of records across dozens of documents on a 512-1gb ram server on rackspace for example? Would it still run low load?
Is there any limit to the number of documents MongoDB can have (seperate databases)? Also, is there any limit to the number of records in a tree I explained above? Also, does that query I showed above run instantly or is it some sort of map/reduce query? Not very sure if I can execute that upon page load in our control panel to get those stats instantly.
Thanks!
Every document has a size limit of 4MB (which in text is A LOT).
It's recommended to run MongoDB in replication mode or to use sharding as you otherwise will have problems with single-server durability. Single-server durability is not given because MongoDB only fsync's to the disk every 60 seconds, so if your server goes down between two fsync's the data that got inserted/updated in that time will be lost.
There is no limit of documents other than your disk space in mongodb.
You should try to import a dataset that matches your data (or generate some test data) to MongoDB and analyse how fast your query executes. Remember to set indexes on those fields that you use heavily in your queries. Your above query should work pretty well even with a lot of data.
In order to analyze the speed of your query use the database profiler MongoDB comes with. On the mongo shell do:
db.setProfilingLevel(2); // to set the profiling level
[your query]
db.system.profile.find(); // to see the results
Remember to turn off profiling once you're finished (log will get pretty huge otherwise).
Regarding your database layout I suggest to change the "schema" (yeah yeah, schema less..) to:
website (collection):
- some keys/values about the particular document
statistics (collection)
- millions of rows where each record is inserted from a pageview (key/value array containing data such as timestamp, ip, browser, etc)
+ DBRef to website
See Database References
Documents in MongoDB are limited to a size of 4MB. Let's say a single page view results in 32 bytes being stored. Then you'll be able to store about 130,000 page views in a single document.
Basically the amount of page views a page can generate is infinite, and you indicated that you expect millions of them, so I suggest you store the log entries as separate documents. Each log entry should contain the _id of the parent document.
The number of documents in a database is limited to 2GB of total space on 32-bit systems. 64-bit systems don't have this limitation.
The group() function is a map-reduce query under the hood. The documentation recommends you use a map-reduce query instead of group(), because it has some limitations with large datasets and sharded environments.