session table in mongo db - mongodb

I'm working on schema design of a scalable session table (of a customized authentication) in mongo db. I know the scalability of Mongo DB is inherited from design and also have requirements. My user case is simple,
when user login, a random token is generated and granted to user, then insert record to session table using the token as primary key, which is shard-able. old token record would be deleted if exists.
user access service using the token
my question is, if system keep delete the expired session key, the size of the session collection (considering shard'ed situation that I need partition on the token field) possibly will grow to very big and include alot 'gap' of expired session, how to gracefully handle this problem (or any better design)?
Thanks in advance.
Edit: My question is about storage level. how mongodb manage disk space if records are frequently removed and inserted? it should be kind of an (auto-)shrink mechanism there. Hopefully won't block reads to the collection.

TTL is good and all however repair is not. --repair is not designed to be run regularly on a database, infact maybe once every 3 months or something. It does a lot of internal stuff that, if run often, will seriously damage your servers performance.
Now about reuse of disk space in such an envirionemt; when you delete a record it will free that "block". If another document fits into that "block" it will reuse that space otherwise it will actually create a new extent, meaning a new "block" a.k.a more space.
So if you want save disk space here you will need to make sure that documents do not exceed each other, fortunately you have a relatively static schema here of maybe:
{
_id: {},
token: {},
user_id: {},
device: {},
user_agent: ""
}
which should mean that documents, hopefully, will reuse their space.
Now you come to a tricky part if they do not. MongoDB will not automatically give back free space per collection (but does per database since that is the same as deleting the files) so you have to run --repair on the database or compact() on the collection to actually get your space back.
That being said, I believe your documents will be of relative size to each other so I am unsure if you will see a problem here but you could also try: http://www.mongodb.org/display/DOCS/Padding+Factor#PaddingFactor-usePowerOf2Sizes for a collection that will frequently have inserts and deletes, it should help the performance on that front.

I agree with #Steven Farley, While creating index you can set ttl, in python by pymongo driver we can do like this
http://api.mongodb.org/python/1.3/api/pymongo/collection.html#pymongo.collection.Collection.create_index

I would have to suggest you use TTL. You can read more about it at http://docs.mongodb.org/manual/tutorial/expire-data/ it would be a perfect fit for what your doing. This is only available since version 2.2
How mongo stores data: http://www.mongodb.org/display/DOCS/Excessive+Disk+Space
Way to clean up removed records:
Command Line: mongod --repair
See: http://docs.mongodb.org/manual/reference/mongod/#cmdoption-mongod--repair
Mongo Shell: db.repairDatabase()
See: http://docs.mongodb.org/manual/reference/method/db.repairDatabase/
So you could have an automated clean up script that executes the repair, keep in mind this will block mongo for a while.

There are a few ways to achieve sessions:
Capped collections as showed in this use case.
Expire data with a TTL to the index by adding expireAfterSeconds to ensureIndex.
Cleaning sessions program side using a TTL and remove.
Faced to the same problematic, I used solution 3 for the flexibility it provides.
You can find a good overview of remove and disk optimization in this answer.

Related

how to choose best mechanism for delete logs saved to mongodb

I'm implementing a logger using MongoDB and I'm quite new to the concept.
The logger is supposed to log each request and Its response.
I'm facing the question of using the TTL Index of mongo or just using the query overnight approach.
I think that the first method might bring some overhead by using a background thread and probably rebuilding the index after each deletion but, it frees space as soon as the documents expire and this might be beneficial.
The second approach, on the other hand, does not have this kind of overhead but it frees up space just at the end of each day.
It seems to me that the second approach will suit my case better as it would not be the case that my server just goes on the edge of not having enough disk space, but it will always be the case that we need to reduce the overhead on the server.
I'm wondering if there are some aspects to the subject that I'm missing and also I'm not sure about the applications of the MongoDB TTL.
Just my opinion:
It seems to be best to store logs in monthly , daily or hourly collection depends on your applications write load , and at the end of the day to just drop() the oldest collections with custom script. From experience TTL indices not working well when there is heavy write load to your collection since they add additional write load based on expiration time.
For example imagine you insert at 06:00h log events with 100k/sec and your TTL index life time is set to 3h , this mean after 3h at 09:00h you will have those 100k/sec deletes applied to your collection that are also stored in the oplog ... , solution in such cases is to add more shards , but it become kind of expensive... , far easier is to just drop the exprired collection ...
Moreover depending on your project size for bigger collections to speed up searches you can additionally shard and pre-split the collections based on compound index hashed datetime field(every log contain timestamp) with another field which you will search often and this will allow you scalable search across multiple distributed shards.
Also note mongoDB is a general purpose document database and fulltext search is kind of limited to expensinve regex expressions , so in case you need to do fast raw fulltext search in your logs some inverse index search engine like elasticsearch on top of your mongoDB backand maybe a good solution to cover this functionality.

Mongo DB collection size not changed after removal of fields from document

To reduce my existing collection size, I have removed some unwanted fields from the documents from the collection. After and before I ran the collection stats to check the size, but it never changed.
I am missing some thing (like update anything) to reflect the reduced size in my stats, Please advice me.
I am running this in my local PC, not having any other nodes.
Thank you.
This is correct behavior.
The only time mongo releases disk space is when you drop a database or do a repair. (see here)
prettier explanation here
Basically, mongodb keeps any space it has allocated unless and until you drop a database or do a repair. It does this because adding space is not efficient, it is time-consuming. So once it has it, it keeps it and uses the blank space created until there is none left and then it gets more.
As mentioned by Daniel F in a comment on the question, this worked for me in the mongo shell:
use your-database-name
db.runCommand({ compact: "your-collection-name" })
I know this is an old question but I'll add that here just in case. Read carefully the documentation on compact as it causes the node it is run on to stop taking most of the traffic until it is done.
All previous ideas have their usecases I'll add another one that might be the best fit. If your able to query the subset that you want to leave from the collection and it is a lot smaller than the entire collection and you have tons of other collections there that you don't want to wait on restoring from scratch then you can write the sub query to another collection, apply indexes and when done plan a short downtime if necessary and switch the collections.

Intercept or filter out oplog transactions from MongoDB

There is a MongoDB which has interesting data I want to examine. Unfortunately, due to size concerns, once every 48 hours, the database is purged of "old" records.
I created a replica set with a secondary database system that has priority 0 and vote 0 so as not to interfere with the main database performance. This works great as I can query the secondary and get my data. However, there are many occasions that my system cannot process all the records in time and will lose some old records if I did not get to them within 48 hours.
Is there a way where I can cache the oplog on another system which I can then process at my leisure, possibly filtering out the deletes until I am ready?
I considered the slavedelay parameters, but that will affect all transactions. I also looked into Tungsten Replicate as a solution so I can essentially cache the the oplogs, however, they do not support MongoDB as a source of the data.
Is the oplog stored in plain text on the secondary such that I can read it and extract what I want from it.
Any pointers to this would be helpful, unfortunately I could not find much documentation on Oplog in the MongoDB website.
MongoDB oplog is stored as a capped collection called 'oplog.rs' in your local DB:
use local
db.oplog.rs.find()
If you want to store more old data in oplog for later use, you can try to increase the size of that collection. See http://docs.mongodb.org/manual/tutorial/change-oplog-size/
Alternatively, you can recreate oplog.rs as an uncapped collection (though this is not recommended since you will have to maually clean up oplog). Follow the same steps to increase the size above, but when recreating the oplog, use this command
db.runCommand( { create: "oplog.rs", capped: false})
Another solution is to create a cron job with the following command dump oplog into the folder YYYYMMDD:
mongodump --db local --collection oplog.rs -o $(date +%Y%m%d)
Hope that helps.
I wonder why you would do that manually. The "canonical" way to do it is to either identify the lifetime or expiration date of a record. If it is a lifetime, you'd do sth like
db.collection.insert({'foo':'bar' [...], created: ISODate("2014-10-06T09:00:05Z")})
and
db.collection.ensureIndex({'created':1},{expireAfterSeconds:172800})
By doing so, a thread called TTLMonitor will awake every minute and remove all documents which have a created field which is older than two days.
If you have a fixed expiration date for each document, you'd basically do the same:
db.collection.insert({'foo':'bar' [...], expirationDate: ISODate("2100-01-01T00:00:00Z"})
and
db.collection.ensureIndex({expirationDate:1},{expireAfterSeconds:0})
This will purge the documents in the first run of TTLMonitor after the expirationDate.
You could adjust expireAfterSeconds to a value which safely allows you to process the records before they are purged, keeping the overall size at acceptable needs and making sure that even when your application goes down during the purging work, the records are removed. (Not to mention that you don't need to maintain the purging logic yourself).
That being said and in the hope it might be useful to you, I think your problem is a conceptual one.
You have a scaling problem. You system is unable to deal with peaks, hence it occasionally is unable to process all data in time. Instead of fiddling with the internals of MongoDB (which might be quite dangerous, as #chianh correctly pointed out), you should rather scale accordingly by identifying your bottleneck and scale it according to your peaks.

Do I need to reindex MongoDB Collection after some period of time like RDBMS

I like to get some knowledge on reindexing the MongoDB. Please forgive me as I am asking some subjective questions.
The questinon is : Do MongoDB needs to do reindexing periodically like we do for RDBMS or Mongo automatically manages it.
Thanks for your fedback
Mongodb takes care of indexes during routine updates. This operation
may be expensive for collections that have a large amount of data
and/or a large number of indexes.For most users, the reIndex command
is unnecessary. However, it may be worth running if the collection
size has changed significantly or if the indexes are consuming a
disproportionate amount of disk space.
Call reIndex using the following form:
db.collection.reIndex();
Reference : https://docs.mongodb.com/manual/reference/method/db.collection.reIndex/
That's a good question, because nowhere in the documentation does it mention explicitly that indexes are automatically maintained*. But, they are. You rarely need to reindex manually.
*I filed a bug for that, go vote for it :)

MongoDB fast deletion best approach

My application currently use MySQL. In order to support very fast deletion, I organize my data in partitions, according to timestamp. Then when data becomes obsolete, I just drop the whole partition.
It works great, and cleaning up my DB doesn't harm my application performance.
I would want to replace MySQL with MongoDB, and I'm wondering if there's something similiar in MongoDB, or would I just need to delete the records one by one (which, I'm afraid, will be really slow and will make my DB busy, and slow down queries response time).
In MongoDB, if your requirement is to delete data to limit the collection size, you should use a capped collection.
On the other hand, if your requirement is to delete data based on a timestamp, then a TTL index might be exactly what you're looking for.
From official doc regarding capped collections:
Capped collections automatically remove the oldest documents in the collection without requiring scripts or explicit remove operations.
And regarding TTL indexes:
Implemented as a special index type, TTL collections make it possible to store data in MongoDB and have the mongod automatically remove data after a specified period of time.
I thought, even though I am late and an answer has already been accepted, I would add a little more.
The problem with capped collections is that they regularly reside upon one shard in a cluster. Even though, in latter versions of MongoDB, capped collections are shardable they normally are not. Adding to this a capped collection MUST be allocated on the spot, so if you wish to have a long history before clearing the data you might find your collection uses up significantly more space than it should.
TTL is a good answer however it is not as fast as drop(). TTL is basically MongoDB doing the same thing, server-side, that you would do in your application of judging when a row is historical and deleting it. If done excessively it will have a detrimental effect on performance. Not only that but it isn't good at freeing up space to your $freelists which is key to stopping fragmentation in MongoDB.
drop()ing a collection will literally just "drop" the collection on the spot, instantly and gracefully giving that space back to MongoDB (not the OS) giving you absolutely no fragmentation what-so-ever. Not only that but the operation is a lot faster, 90% of the time, than most other alternatives.
So I would stick by my comment:
You could factor the data into time series collections based on how long it takes for data to become historical, then just drop() the collection
Edit
As #Zaid pointed out, even with the _id field capped collections are not shardable.
One solution to this is using TokuMX which supports partitioning:
https://www.percona.com/blog/2014/05/29/introducing-partitioned-collections-for-mongodb-applications/
Advantages over capped collections: capped collections use a fixed amount of space (even when you don't have this much data) and they can't be resized on-the-fly. Partitioned collections usage depends on data; you can add and remove partitions (for newly inserted data) as you see fit.
Advantages over TTL: TTL is slow, it just takes care of removing old data automatically. Partitions are fast - removing data is basically just a file removal.
HOWEVER: after getting acquired by Percona, development of TokuMX appears to have stopped (would love to be corrected on this point). Unfortunately MongoDB doesn't support this functionality and with TokuMX on its way out it looks like we will be stranded without proper solution.