Why does MongoDB *client* use more memory than the server in this case? - mongodb

I'm evaluating MongoDB. I have a small 20GB subset of documents. Each is essentially a request log for a social game along with some captured state of the game the user was playing at that moment.
I thought I'd try finding game cheaters. So I wrote a function that runs server side. It calls find() on an indexed collection and sorts according to the existing index. Using a cursor it goes through all documents in indexed order. The index is {user_id,time}. So I'm going through each user's history, checking if certain values (money/health/etc) increase faster than is possible in the game. The script returns the first violation found. It does not collect violations.
The ONLY thing that this script does on the client is define the function and calls mymongodb.eval(myscript) on a mongod instance on another box.
The box that mongod is running on does fine. The one that the script is launched from starts losing memory and swap. Hours later: 8GB of RAM and 6GB of swap are being used on the client machine that did nothing more than launch a script on another box and wait for a return value.
Is the mongo client really that flakey? Have I done something wrong or made an incorrect assumption about mongo/mongod?

If you just want to open up a client connection to a remote database you should use the mongo command, not mongod. mongod starts up a server on your local machine. Not sure what specifying a url will do.
Try
mongo remotehost:27017

From the documentation:
Use map/reduce instead of db.eval() for long running jobs. db.eval blocks other operations!
eval is a function that blocks the entire server if you don't use a special flag. Again, from the docs:
If you don't use the "nolock" flag, db.eval() blocks the entire mongod process while running [...]
You are kind of abusing MongoDB here. Your current routine is strange, because it returns the first violation found, but it will have to re-check everything when run the next time (unless your user ids are ordered and you store the last evaluated user id).
Map/Reduce generally is the better option for a long-running task, but aggregating your data does not seem trivial. However, a map/reduce based solution would also solve the re-evaluation problem.
I'd probably return something like this from map/reduce:
user id -> suspicious actions, e.g.
------
2525454 -> [{logId: 235345435, t: ISODate("...")}]

Related

Calculate WiredTiger cache miss from db.serverStatus output

Been reading the following https://medium.com/dbkoda/the-notorious-database-cache-hit-ratio-c7d432381229 article which seems to calculate WiredTiger cache miss rate from data taken in db.serverStatus() output.
However, after performing the command (and also checking that the Java API doesn't have such method, don't really know how he is using the API?), just by checking what the method shows I can't really see the properties from the Document he is trying to retrieve, which are basically 'pages requested from the cache' and 'pages read into cache'.
The only metrics I can see related to that are a couple included within extra_fields, which are page_faults and page_reclaims, and if I'm correct those are both cache misses and cache hits respectively, right?
I'm trying to obtain cache performance (if it's hitting the cache or not after performing certain aggregations) when using certain queries.
Is there any way to obtain this metric straight away via MongoDB commands?
The code given is intended to be run in mongo shell.
The driver equivalent is the https://docs.mongodb.com/manual/reference/command/serverStatus/ command.
You would execute it using your driver's facility to run admin commands or arbitrary commands or database commands. For Ruby driver, this is https://docs.mongodb.com/ruby-driver/current/tutorials/ruby-driver-database-tasks/#arbitrary-comands.

Is it possible to see the incoming queries in mongodb to debug/trace issues?

I have mongo running on my macbook (OSX).
Is it possible to run some kind of a 'monitor' that will display any income requests to my mongodb?
I need to trace if I have the correct query formatting from my application.
You will find these tools (or utilities) useful for monitoring as well as diagnosing purposes. All the tools except mtools are packaged with MongoDB server (sometimes they are installed separately).
1. Database Profiler
The profiler stores every CRUD operation coming into the database; it is off, by default. Having it on is quite expensive; it turns every read into a read+insert, and every write into a write+insert. CAUTION: Keeping it on can quickly overpower the server with incoming operations - saturating the IO.
But, it is a very useful tool when used for a short time to find what is going on with database operations. It is recommended to be used in development environments.
The profiler setting can be accessed by using the command db.getProfilingLevel(). To activate the profilre use the db.setProfilingLevel(level) command. Verify what is captured by the profiler in the db.system.profile collection; you can query it like any other collection using the find or aggregate methods. The db.system.profile document field op specifies the type of database operation; e.g., for queries it is "query".
The profiler has three levels:
0is not capturing any info (or is turned off and default). 1 captures every query that takes over 100ms. 2 captures every query;this can be used to find the actual load that is coming in.
2. mongoreplay
mongoreplay is a traffic capture and replay tool for MongoDB that you can use to inspect and record commands sent to a MongoDB instance, and then replay those commands back onto another host at a later time. NOTE: Available for Linux and macOS.
3. mongostat
mongostat commad-line utility provides a quick overview of the status of a currently running mongod instance.
You can view the incoming operations in real-time. The statistics are displated, by default every second. There are various options to customize the output, the time interval, etc.
4. mtools
mtools is a collection of helper scripts to parse, filter, and visualize (thru graphs) MongoDB log files.
You will find the mlogfilter script useful; it reduces the amount of information from MongoDB log files using various command options. For example, mlogfilter mongod.log --operation query filters the log by query operations only.

MongoDB: Switch database/collection referenced by a given name on the fly

My application needs only read access to all of its databases. One of those databases (db_1) hosts a collection coll_1 whose entire contents* need to be replaced periodically**.
My goal is to have no or very little effect on read performance for servers currently connected to the database.
Approaches I could think of with so far:
1. renameCollection
Build a temporary collection coll_tmp, then use renameCollection with dropTarget: true to move its contents over to coll_1. The downside of this approach is that as far as I can tell, renameCollection does not copy indexes, so once the collection is renamed, coll_1 would need reindexing. While I don't have a good estimate of how long this would take, I would think that query-performance will be significantly affected until reindexing is complete.
2. TTL Index
Instead of straight up replacing, use a time-to-live index to expire documents after the chosen replacement period. Insert new data every time period. This seems like a decent solution to me, except that for our specific application, old data is better than no data. In this scenario, if the cron job to repopulate the database fails for whatever reason, we could potentially be left with an empty coll_1 which is undesirable. I think this might have a negligible effect, but this solution also requires on-the-fly indexing as every document is inserted.
3. Communicate current database to read-clients
Simply use two different databases (or collections?) and inform connected clients which one is more recent. This solution would allow for finishing indexing the new coll_1_alt (and then coll_1 again) before making it available. I personally dislike the solution since it couples the read clients very closely to the database itself, and of course communication channels are always imperfect.
4. copyDatabase
Use copyDatabase to rename (designate) an alternate database db_tmp to db_1.db_tmp would also have a collection coll_1. Once reindexing is complete on db_tmp.coll_1, copyDatabase could be used to simply rename db_tmp to db_1. It seems that this would require droppping db_1 before renaming, leaving a window in which data won't be accessible.
Ideally (and naively), I'd just set db_1 to be something akin to a symlink, switching to the most current database as needed.
Anyone has good suggestions on how to achieve the desired effect?
*There are about 10 million documents in coll_1.
** The current plan is to replace the collection once every 24 hours. The replacement interval might get as low as once every 30 minutes, but not lower.
The problem that you point out in option 4 you will also have with option 1. dropTarget will also mean that the collection is not available.
Another alternative could be to just have both the old and the new data in the same collection, and use a "version ID" that you then still have to communicate to your clients to do a query on. That at least stops you from having to do reindexing like you pointed out for option 1.
I think your best bet is actually option 3, and it's the most equivalent to changing a symlink, except it is on the client side.

MongoDB Fail Over with one address

I would like to know if it is at all possible to have mongodb fail overs only using a single address. I know replica sets are typically used for this while relying on the driver to make the switch over, but I was hoping there may be a solution out there that would allow one address or hostname to automatically change over when the mongodb instance was recognized as being down.
Any such luck? I know there are solutions for MySQL, but I haven't had much luck with finding something for MongoDB.
Thanks!
Yes it is possible, the driver holds a cache map of your replica set which it will query for a new primary when the set suffers an election. This map is refreshed once every so often however, if your application restarts (process is quit or something, or each request of PHP fork mode) then the driver has no choice but to refresh its map. At this point you will suffer connectivity problems.
Of course the best thing to do is to add a seedlist.
Using a single IP defies the redundancy that is in-built into MongoDB.

MongoDB takes long for indexing

I have the following setup:
Mac Pro with 2 GB of RAM (yes, not that much)
MongoDB 1.1.3 64-bit
8 million entries in a single collection
index for one field (integer) wanted
Calling .ensureIndex(...) takes more than an hour, actually I killed the process after that. My impression is, that it takes far too long. Also, I terminated the process but the index can be seen with .getIndexes() afterwards.
Anybody knows what is going wrong here?
Adding an index on an existing data set is expected to take a while, as the entire BTree needs to be constructed. If you think this is taking an unreasonable amount of time, or you've seen a regression in performance the best bet is to ask about it on the list.
I would just like to point out the command:
db.currentOp()
which prints the current operations running on the server, and also shows the indexing process.
The foreground indexing is done in 3 steps, and the background one in 2 steps (if I remember correctly), but the background one is alot slower. The foreground one on the other hand locks the collection while indexing it (ie not very useful on a running application server).
As said before, google BTree if you are interested in how they work.
Anybody knows what is going wrong here?
Are you running via ssh or connecting remotely in some way? Sound a bit like a broken pipe issue. Did you create the index with {background : true} or not?