Using MongoDB, I'm trying to remove a column from a collection that contains ~3 million records.
db.Listing.update( {}, { $unset: { Longitude: 1 } }, false, true);
When I execute this command, the RAM on the server continues to go up until it runs out of RAM and then the server is hosed and needs to be physically rebooted. Is there a better way to remove a column from a large collection that won't hose the server?
I expect your problem is the system OOM killer. You should make sure you aren't limiting the resources for mongod. See this: http://www.mongodb.org/display/DOCS/Checking+Server+Memory+Usage
http://prefetch.net/blog/index.php/2009/09/30/how-the-linux-oom-killer-works/
If you are using a virt. system like openvz you might want to stop , or adjust the over-committing feature.
Related
I am designing a MongoDB database that looks something like this:
registry:{
id:1,
duration:123,
score:3,
text:"aaaaaaaaaaaaaaaaaaaaaaaaaaaa"
}
The text field is very big compared to the rest. I sometimes need to perform analytics queries that average the duration or the score, but never use the text.
I have queries that are more specific, and retrieve all the information about a single document. But in this queries I could spend more time making two queries to retrieve all the data.
My question is, if I make a query like this:
db.registries.aggregate( [
{
$group: {
_id: null,
averageDuration: { $avg: "$duration" },
}
}
] )
Would it need to read the data from the transcript field? That would make the query much slower and it would take a lot of RAM. If that is the case it would be better to split the records in two and have something like this right?:
registry:{
id:1,
duration:123,
score:3,
}
registry_text:{
id:1,
text:"aaaaaaaaaaaaaaaaaaaaaaaaaaaa"
}
Thanks a lot!
I don't know how the server works in this case but I expect that, for caching reasons, the server will load complete documents into memory when it reads them from disk. Disk reads are very slow (= expensive in time taken) and I expect server will aggressively use memory if it can to avoid reads.
An important note here is that the documents are stored on disk as lists of key-value pairs comprising their contents. To not load a field from disk the server would have to rebuild the document in question as part of reading it since there are length fields involved. I don't see this happening in practice.
So, once the documents are in memory I assume they are there with all of their fields and I don't expect you can tune this.
When you are querying, the server may or may not drop individual fields but this would only change the memory requirements for the particular query. Generally these memory requirements are dwarfed by the overall database cache size and aggregation pipelines. So I don't think it really matters at what point a large field is dropped from a document during query processing (assuming you project it out in the query).
I think this isn't a worthwhile matter to try to ponder/optimize. If you have a real system with real workloads, you'll be much more pressed to optimize something else.
If you are concerned with memory usage when the amount of available memory is consumer-sized (say, under 16 gb), just get more memory - it's insanely cheap given how much time you'd spend working around lack of it (whether we are talking about provisioning bigger AWS instances or buying more sticks of RAM).
You should be able to use $project to limit the fields read.
As a general advice, don't try to normalize the data with MongoDB as you would with SQL. Also, it's often more performant to read documents plain from DB and do the processing on your server.
I have found this answer that seems to indicate that project needs to fetch all document in the database server, it only reduces bandwith
When using projection to remove unused fields, the MongoDB server will
have to fetch each full document into memory (if it isn't already
there) and filter the results to return. This use of projection
doesn't reduce the memory usage or working set on the MongoDB server,
but can save significant network bandwidth for query results depending
on your data model and the fields projected.
https://dba.stackexchange.com/questions/198444/how-mongodb-projection-affects-performance
I have a shell script that creates a cursor on a collection then updates each docunment with data from another collection.
When I run it on a local db it finishes in about 15 sec, but on a hosted database, it runs for over 45 min.
db.col1.find().forEach(function(doc) {
db.col2.findAndModify(
{
query: {attr1: doc.attr1},
update: { $push: {attr2: doc.attr2},
upsert: true
});
});
So there is obviously network overhead between the client and server in order to process this script. Is there a way to keep the processing all server side? I have looked at server side javascript, but from what I read here, it is not a recommended practice.
Locally, you have almost no network overhead. No interference, no routers, no switches, no limited bandwidth. Plus, during most situations, your mass storage, be it SSD or HDD, more or less idles around (Unless you tend to play games while developing.) So when a operation requiring a lot of IO capabilities kicks in, it is available.
When you run your script from your local shell against a server, here is what happens.
db.col1.find().forEach The whole collection is going to be read, from an unknown medium (most likely HDDs of which the available IO could be shared among many instances). The documents will then be transferred to you local shell. Compared to a connection to localhost, each document retrieval is routed over dozens of hops, each adding a tiny amount of latency. Over presumably quite some documents, this adds up. Don't forget that the complete document is sent over the network, since you did not use projection to limit the fields returned to attr1 and attr2. External bandwidth of course is slower than a connection to localhost.
db.col2.findAndModify For each document, a query is done. Again, the shared IO might well kill performance.
{ query: {attr1: doc.attr1}, update: { $push: {attr2: doc.attr2}, upsert: true} Are you sure attr1 is indexed, by the way? And even when it is, it is not sure wether the index is currently in RAM. We are talking of a shared instance, right? And it might well be that your write operations have to wait until they are even processed by mongod, as per default write concern, the data has to be successfully applied to the in memory data set before it is acknowledged, but if gazillions of operations are sent to the shared instance, it might well be that your operation is number 1 bazillion and one in the queue. And the network latency is added a second time, since the values transferred to your local shell needs to be sent back.
What you can do
First of all, make sure that you
limit the returned values to those you need using projection:
db.col1.find({},{ "_id":0, "attr1":1, "attr2":1 })
Make sure you have attr1 indexed
db.col2.ensureIndex( { "attr1":1 } )
Use bulk operations. They are executed much faster, at the expense of reduced feedback in case of problems.
// We can use unordered here, because the operations
// each apply to only a single document
var bulk = db.col2.initializeUnorderedBulkOp()
// A counter we will use for intermediate commits
// We do the intermediate commits in order to keep RAM usage low
var counter = 0
// We limit the result to the values we need
db.col1.find({}.{"_id":0, "attr1":1, "attr2":1 }).forEach(
function(doc){
// Find the matching document
// Update exactly that
// and if it does not exist, create it
bulk
.find({"attr1": doc.attr1})
.updateOne({ $push: {"attr2": doc.attr2})
.upsert()
counter++
// We have queued 1k operations and can commit them
// MongoDB would split the bulk ops in batches of 1k operations anyway
if( counter%1000 == 0 ){
bulk.execute()
print("Operations committed: "+counter)
// Initialize a new batch of operations
bulk = db.col2.initializeUnorderedBulkOp()
}
}
)
// Execute the remaining operations not committed yet.
bulk.execute()
print("Operations committed: "+counter)
I am using mongodb via the mongo shell to query a large collection. For some reason after 90 seconds the mongo shell seems to be stopping my query and nothing is returned.
I have tried the following two commands but neither will return anything. After 90 seconds it just gives me a new line that I can type in another command.
db.cards.find("Field":"Something").maxTimeMS(9999999)
db.cards.find("Field":"Something").addOption(DBQuery.Option.tailable)
db.cards.find() return results, but anything with parameters is timing out at exactly 90 seconds and nothing is being returned.
Any help would be greatly appreciated.
Given the level of detail in your question, I am going to focus on 'query a large collection' and guess that your are using the MMAPv1 storage engine, with no index coverage on your query.
Are you disk bound?
Given the above assumptions, you could be cycling data between RAM and disk. Mongo has a default 100MB RAM limit, so if your query has to examine a lot of documents (no index coverage), paging data from disk to RAM could be the culprit. I have heard of mongo shell acting as you describe or locking/terminating when memory constraints are exceeded.
32bit systems can also impose severe memory limits for large collections.
You could look at your OS specific disk activity monitor to get a clue into whether this is your problem.
Just how large is your collection?
Next, how big is your collection? You can show collections and see the physical size of the collection and also db.cards.count() to see your record count. This helps quantify "large collection".
NOTE: you might need the mongo-hacker extensions to see collection disk use in show collections.
Mongo shell investigation
Within the mongo shell, you have a couple more places to look.
By default, mongo will log slow queries (> 100ms). After your 90 sec timeout:
db.adminCommand({getLog: "global" })
and look for slow query log entries.
Next look at your winning query plan.
var e = db.cards.explain()
e.find("Field":"Something")
I am guessing you will see
"stage": "COLLSCAN",
Which means you are doing a full collection scan and you need index coverage for your query (good idea for queries and sorts).
Suggestions
You should have at least partial index coverage on any production query. A proper index should solve your problem (assuming you don't have documents > 16MB).
Another approach (that I don't recommend - indexing is better) is to use a cursor instead
var cursor = db.cards.find("Field":"Something")
while (cursor.hasNext()) {
print(tojson(cursor.next()));
}
Depending on the root cause, this may work for you.
I have 2 EC2 instances, one as a mongodb server and the other is a python web app (same availability zone). The python sever connects to mongo server using PyMongo and everything works fine.
The problem is, when I profile execution-time in python, in some calls (less than 5%) it takes even up to couple of second to return. I was able to narrow down the problem and the time delay was actually on the db calls to mongo server.
The two causes that I thought were,
1. The Mongo server is slow/over-loaded
2. Network latency
So, I tried upgrading the mongo servers to 4X faster instance but the issue still happens (Some calls takes even 3 secs to return) I assumed since both the servers are on EC2, network latency should not be a problem... but may be I was wrong.
How can I confirm if the issue is actually the network itself? If so, what the best way to solve it? Is there any other possible cause?
Any help is appreciated...
Thanks,
UPATE: The entities that I am fetching are very small (indexed) and usually the calls takes only 0.01-0.02 secs to finish.
UPDATE:
As suggested by "James Wahlin", I enabled profiling on my mongo server and got some interesting logs,
Fri Mar 15 18:05:22 [conn88635] query db.UserInfoShared query: { $or:
[ { _locked: { $exists: false } }, { _locked: { $lte:
1363370603.297361 } } ], _id: "750837091142" } nto return:1 nscanned:1 nreturned:1 reslen:47 2614ms
Fri Mar 15 18:05:22 [conn88635] command db.$cmd command: {
findAndModify: "UserInfoShared", fields: { _id: 1 }, upsert: true,
query: { $or: [ { _locked: { $exists: false } }, { _locked: { $lte:
1363370603.297361 } } ], _id: "750837091142" }, update: { $set: { _locked: 1363370623.297361 } }, new: true } ntoreturn:1 reslen:153 2614ms
You can see these two calls took more than 2 secs to finish. The field _id is unique indexed and finding it should not have taken this much time. May be I have to post a new question for it, but can the mongodb GLOBAL LOCK be the cause ?
#James Wahlin, thanks a lot for helping me out.
As it turned out the main cause of latency was mongodb GLOBAL LOCK itself. We had a lock percentage which was averaging at 5% and sometime peaks up to 30-50% and that results in the slow queries.
If you are facing this issue, the first thing you have to do it enable mongodb MMS service (mms.10gen.com), which will give you a lot of insights on what exactly is happening in your db server.
In our case the LOCK PERCENTAGE was really high and there were multiple reasons for it. First thing you have to do to figure it out is to read mongodb documentation on concurrency,
http://docs.mongodb.org/manual/faq/concurrency/
The reason for lock can be in application level, mongodb or hardware.
1) Our app was doing a lot of updates and each update(more than 100 ops/sec) holds a global lock in mongodb. The issue was that when an update happens for an entry which is not in memory, mongo will have to load the data into memory first and then update (in memory) and the whole process happens while inside the global lock. If say the whole thing takes 1 sec to complete (0.75sec to load the data from disk and 0.25sec to update in memory), the whole rest of the update calls waits (for the entire 1 sec) and such updates starts queuing up... and you will notice more and more slows requests in your app server.
The solution for it (while it might sound silly) is to query for the same data before you make an update. What it effectively does is moving the 'load data to memory' (0.75sec) part out of the global lock which greatly reduces your lock percentage
2) The other main cause of global lock is mongodb's data flush to disk. Basically in every 60sec (or less) mongodb (or the OS) writes the data to disk and a global lock is held during this process. (This kinda explains the random slow queries). In your MMS stats, see the graph for background flush avg... if its high, that means you have to get faster disks.
In our case, we moved to a new EBS optimized instance in EC2 and also bumped our provisioned IOPS from 100 to 500 which almost halved the background flush avg and the servers are much happier now.
I'm working on schema design of a scalable session table (of a customized authentication) in mongo db. I know the scalability of Mongo DB is inherited from design and also have requirements. My user case is simple,
when user login, a random token is generated and granted to user, then insert record to session table using the token as primary key, which is shard-able. old token record would be deleted if exists.
user access service using the token
my question is, if system keep delete the expired session key, the size of the session collection (considering shard'ed situation that I need partition on the token field) possibly will grow to very big and include alot 'gap' of expired session, how to gracefully handle this problem (or any better design)?
Thanks in advance.
Edit: My question is about storage level. how mongodb manage disk space if records are frequently removed and inserted? it should be kind of an (auto-)shrink mechanism there. Hopefully won't block reads to the collection.
TTL is good and all however repair is not. --repair is not designed to be run regularly on a database, infact maybe once every 3 months or something. It does a lot of internal stuff that, if run often, will seriously damage your servers performance.
Now about reuse of disk space in such an envirionemt; when you delete a record it will free that "block". If another document fits into that "block" it will reuse that space otherwise it will actually create a new extent, meaning a new "block" a.k.a more space.
So if you want save disk space here you will need to make sure that documents do not exceed each other, fortunately you have a relatively static schema here of maybe:
{
_id: {},
token: {},
user_id: {},
device: {},
user_agent: ""
}
which should mean that documents, hopefully, will reuse their space.
Now you come to a tricky part if they do not. MongoDB will not automatically give back free space per collection (but does per database since that is the same as deleting the files) so you have to run --repair on the database or compact() on the collection to actually get your space back.
That being said, I believe your documents will be of relative size to each other so I am unsure if you will see a problem here but you could also try: http://www.mongodb.org/display/DOCS/Padding+Factor#PaddingFactor-usePowerOf2Sizes for a collection that will frequently have inserts and deletes, it should help the performance on that front.
I agree with #Steven Farley, While creating index you can set ttl, in python by pymongo driver we can do like this
http://api.mongodb.org/python/1.3/api/pymongo/collection.html#pymongo.collection.Collection.create_index
I would have to suggest you use TTL. You can read more about it at http://docs.mongodb.org/manual/tutorial/expire-data/ it would be a perfect fit for what your doing. This is only available since version 2.2
How mongo stores data: http://www.mongodb.org/display/DOCS/Excessive+Disk+Space
Way to clean up removed records:
Command Line: mongod --repair
See: http://docs.mongodb.org/manual/reference/mongod/#cmdoption-mongod--repair
Mongo Shell: db.repairDatabase()
See: http://docs.mongodb.org/manual/reference/method/db.repairDatabase/
So you could have an automated clean up script that executes the repair, keep in mind this will block mongo for a while.
There are a few ways to achieve sessions:
Capped collections as showed in this use case.
Expire data with a TTL to the index by adding expireAfterSeconds to ensureIndex.
Cleaning sessions program side using a TTL and remove.
Faced to the same problematic, I used solution 3 for the flexibility it provides.
You can find a good overview of remove and disk optimization in this answer.