pymogo performance for mongodb - mongodb

I have a small confusion regarding pymongo connections . I have a situation in which I have to insert continuous stream of data into MongoDB for which I am using pymongo. The problem is once I create a connection to the DB I need to continuously insert into different collections? Is it good to switch between different collections very quickly to insert data (and this happens for very long may be for hours) . I mean will it not create any problems ? I have tried it seems to work properly but I still want to know even if I insert documents into different collections very quickly will it cause any problem?

It won't cause any problems till your active set fits in the RAM, else you will start to see faults in mongostat.
The concept remains the same irrespective of your storage engine WiredTiger or MMAPv1, though you will achieve a better write and storage size performance with WiredTiger engine.
Best way to figure out long term performance is to simulate your write load, and observe mongostat for that duration.

Related

storing huge amounts of data in mongo

I am working on a front end system for a radius server.
The radius server will pass updates to the system every 180 seconds. Which means if I have about 15,000 clients that would be around 7,200,000 entries per day...Which is a lot.
I am trying to understand what the best possible way to store and retrieve this data will be. Obviously as time goes on, this will become substantial. Will MongoDB handle this? Typical document is not much, something this
{
id: 1
radiusId: uniqueId
start: 2017-01-01 14:23:23
upload: 102323
download: 1231556
}
However, there will be MANY of these records. I guess this is something similar to the way that SNMP NMS servers handle data which as far as I know they use RRD to do this.
Currently in my testing I just push every document into a single collection. So I am asking,
A) Is Mongo the right tool for the job and
B) Is there a better/more preferred/more optimal way to store the data
EDIT:
OK, so just incase someone comes across this and needs some help.
I ran it for a while in mongo, I was really not satisfied with performance. We can chalk this up to the hardware I was running on, perhaps my level of knowledge or the framework I was using. However I found a solution that works very well for me. InfluxDB pretty much handles all of this right out of the box, its a time series database which is effectively the data I am trying to store (https://github.com/influxdata/influxdb). Performance for me has been like night & day. Again, could all be my fault, just updating this.
EDIT 2:
So after a while I think I figured out why I never got the performance I was after with Mongo. I am using sailsjs as framework and it was searching by id using regex, which obviously has a huge performance hit. I will eventually try migrate back to Mongo instead of influx and see if its better.
15,000 clients updating every 180 seconds = ~83 insertions / sec. That's not a huge load even for a moderately sized DB server, especially given the very small size of the records you're inserting.
I think MongoDB will do fine with that load (also, to be honest, almost any modern SQL DB would probably be able to keep up as well). IMHO, the key points to consider are these:
Hardware: make sure you have enough RAM. This will primarily depend on how many indexes you define, and how many queries you're doing. If this is primarily a log that will rarely be read, then you won't need much RAM for your working set (although you'll need enough for your indexes). But if you're also running queries then you'll need much more resources
If you are running extensive queries, consider setting up a replica set. That way, your master server can be reserved for writing data, ensuring reliability, while your slaves can be configured to serve your queries without affecting the write reliability.
Regarding the data structure, I think that's fine, but it'll really depend on what type of queries you wish to run against it. For example, if most queries use the radiusId to reference another table and pull in a bunch of data for each record, then you might want to consider denormalizing some of that data. But again, that really depends on the queries you run.
If you're really concerned about managing the write load reliably, consider using the Mongo front-end only to manage the writes, and then dumping the data to a data warehouse backend to run queries on. You can partially do this by running a replica set like I mentioned above, but the disadvantage of a replica set is that you can't restructure the data. The data in each member of the replica set is exactly the same (hence the name, replica set :-) Oftentimes, the best structure for writing data (normalized, small records) isn't the best structure for reading data (denormalized, large records with all the info and joins you need already done). If you're running a bunch of complex queries referencing a bunch of other tables, using a true data warehouse for the querying part might be better.
As your write load increases, you may consider sharding. I'm assuming the RadiusId points to each specific server among a pool of Radius servers. You could potentially shard on that key, which would split the writes based on which server is sending the data. Thus, as you increase your radius servers, you can increase your mongo servers proportionally to maintain write reliability. However, I don't think you need to do this right away as I bet one reasonably provisioned server should be able to manage the load you've specified.
Anyway, those are my preliminary suggestions.

Best way to create data dump when dealing with high frequency data

I am developing a listener/logger to track events and later do analytics. Since the data could potentially scale to a very large size, I want to be able to split the data into chunks and store in a particular format after each hour. Now, while doing this, I dont want the DB performance to get affected.
Currently I am using MongoDB and looking at "shard key" with a possibility of using timestamp (hour resolution) to be the key.
Another approach could be to have a database replica and use the replica for creating data dump.
Please help me with this issue.
You're 100% right about the performance
When connected to a MongoDB instance, mongodump can adversely affect mongod performance. If your data is larger than system memory, the queries will push the working set out of memory.
and to handle the issue they have given the solution
use mongodump to capture backups from a secondary member of the replica set.
MongodB Docs about BackUp and Restore

Lock rate in mongo 3.0

I've been playing with mongo 3.0 rc7 and rc8 and I've discovered that mongostat doesn't show lock rate column whether I use MMAPv1 or WiredTiger engine. Similarly in MMS, "lock %" chart is unavailable for 3.0 systems
We've been using lock rate in our monitoring systems, and also as one of the measurement during performance tests (we've been running same sets of heavy load tests via Gatling or Tsung and observing if recent optimizations in our usage of DB have some real impact, and also to discover if some new features doesn't have regression in this area).
Is there a way to find this value some way in mongo 3? Now we mainly want to run comparison tests on 2.6.7 and 3.0.0-rc8 to see what the difference is, and while we of course get nice set of data from the application performance standpoint, we'd also like to compare some DB stats and lock rate was one of them. Or are we completely missing the point and collection level locks in v3 MMAPv1 or document level locks in WiredTiger are now pointless to measure or compare? If so, how can we measure, what is the DB limit at heavy load (in < 2.6.7 it was fairly easy, usually lock rate was the first thing to fire and once it got above 70-80% we knew that it's the upper limit), or test regressions/improvements in how we use DB?
Many thanks
It's not pointless to compare some kind of lock statistics for mmapv1 and WiredTiger, but I think the situation right now is that it's unclear what you should be looking at in WiredTiger for comparison. The operation of the storage engine is very different from mmapv1. Presently, I think you'll want to look at other statistics like throughput, and you can expect more statistics and more guidance on using them in future versions of MongoDB with WiredTiger.

Processing large mongo collection offsite

I have a system writing logs into mongodb (about 1kk logs per day). On a weekly basis I need to calculate some statistics on those logs. Since the calculations are very processor and memory consuming I want to copy collection I'm working to powerful offsite machine. How do I keep offsite collections up to date without copying everything? I modify offsite collection, by storing statistic within its elements i.e. adding fields {"alogirthm_1": "passed"} or {"stat1": 3.1415}. Is replication right for my use case or I should investigate other alternatives?
As to your question, yes, replication does partially resolve your issue, with limitations.
So there are several ways I know to resolve your issue:
The half-database, half-application way.
Replication keeps your data up to date. It doesn't allow you to modify the secondary nodes (which you call "offsite collection") however. So you have to do the calculation on the secondary and write data to the primary. You need to have an application running aggregation on the secondary, and write the result back to it's primary.
This requires that you run an application, PHP, .NET, Python, whatever.
full-server way
Since you are going to have multi-servers any way, you can consider using Sharding for the faster storage and directly do the calculation online. This way you don't even need to run an application. The Map/Reduce do the calculation and write output into an new collection. I DON'T recommend this solution though because of the Map/Reduce performance issue of current versions.
The full-application way
Basically you still use an replication for reading, but the server doesn't do any calculations except querying data. You can use capped collection or TTL index for removing expired data, and you just enumerate data one by one in your application and calculation by yourself.

Using memcache infront of a mongodb server

I am trying to understand how mongo's internal cache works and if it does eliminate using memcache. Our database size is around 200G and index fits in the memory but after the index not much free memory left on the server.
One of my colleague says mongo's internal cache will be as fast as memcache so no need to introduce another level of complexity by using memcache.
The scenario in my head is when we read the data from db, it's saved in memcache and next time it's directly read from the cache instead of going back to db server. If the data is changed and needs to be saved/updated, it's done on both memcache server and database server.
I have been reading about this but couldn't convince myself yet. So I'd really appreciate if someone could shed some light on this.
First thing is that a cache storage is different to a database. So MongoDB and SQL are different in purpose and usage when compared to Memcache.
Memcache is really good at lowering working set sizes for queries. For example: imagine a huge aggregated query with subselects and CASE statements and what not in SQL (think of the most complex query you can), doing this query in realtime all the time could cause the computer(s) to "thrash" (not to mention the problems client side).
However as everyone knows you need only summarise this query to another collection/table for it to be instantly faster. The real speed of memcache comes from the fact that it is a in memory key value store. This is where MongoDB could fail in speed because it is not memory stored, it is memory mapped but not stored.
MongoDB does no self caching, providing the query is "hot" and in LRU (this is where your working set comes in) you shouldn't notice much of a difference in response times. A good way to ensure a query is "hot" is to run it. Some people have a script of their biggest queries that they run to warm up the cache.
As I said memcache is a cache layer this is why:
If the data is changed and needs to be saved/updated, it's done on both memcache server and database server.
Makes me die a little inside. Many do blur the line between the DB and the cache layer.