Best way to create data dump when dealing with high frequency data

Best way to create data dump when dealing with high frequency data - mongodb

I am developing a listener/logger to track events and later do analytics. Since the data could potentially scale to a very large size, I want to be able to split the data into chunks and store in a particular format after each hour. Now, while doing this, I dont want the DB performance to get affected.
Currently I am using MongoDB and looking at "shard key" with a possibility of using timestamp (hour resolution) to be the key.
Another approach could be to have a database replica and use the replica for creating data dump.
Please help me with this issue.

You're 100% right about the performance
When connected to a MongoDB instance, mongodump can adversely affect mongod performance. If your data is larger than system memory, the queries will push the working set out of memory.
and to handle the issue they have given the solution
use mongodump to capture backups from a secondary member of the replica set.
MongodB Docs about BackUp and Restore

Related

Sitecore 8.1 update 2 MongoDB backup

I am using replica set (2 mongo, 1 arbitor) for my Sitecore CD servers.
Assuming all mongo DB data get flushed to Reporting SQL DB; do we need to take backup of MongoDB database on production CD ?
If yes what is best approach and frequency to do it; considering My application is moderately using anaytics feature (Personalization , Campaign etc).

Unfortunately, your assumption is bad - the MongoDB is the definitive source of analytic data, not the reporting db. The reporting db contains only the aggregate info needed for generating the report (mostly). In fact, if (when) something goes wrong with the SQL DB, the idea is that it is rebuilt from the source MongoDB. Remember: You can't un-add two numbers after you've added them!
Backup vs Replication
A backup is a point-in-time view of the database, where replication is multiple active copies of a current database. I would advocate for replication over backup for this type of data. Why? Glad you asked!
Currency - under what circumstance would you want to restore a 50GB MongoDB? What if it was a week old? What if it was a month? Really the only useful data is current data, and websites are volatile places - log data backups are out of date within an hour. If you personalise on stale data is that providing a good user experience?
Cost - backing up large datasets is costly in terms of time, storage capacity and compute requirements; they are also a pain to restore and the bigger they are the more likely there's a corruption somewhere
Run of business
In a production MongoDB environment you really should have 2-3 replicas. That's going to save your arse if one of the boxes dies, which they sometimes do - MongoDB works the disks very hard.
These replicas are self-healing, and always current (pretty-much) so they are much better than taking backups. The chances that you lose all your replicas at once is really low except for one particular edge case... upgrades. So a backup is really only protection against hardware failure or data corruption which, in a multi-instance replica set, is already very effectively handled. Unless you're paranoid, you're never going to use that backup and it'll cost you plenty to have it.
Sitecore Upgrades
This is the killer edge-case - always make backups (see Back Up and Restore with MongoDB Tools) before running an upgrade because you can corrupt all of your replicas in one motion and you'll want to be able to roll back.
Data Trimming (side-note)
You didn't ask this, but at some point you'll be thinking "how the heck can I back up this 170GB monster db every day? this is ridiculous" - and you'll be right.
There are various schools of thought around how long this data should be persisted for - that's a question only you or your client can answer. I suggest keeping it until there's too much, then make a decision on how much you have to get rid of. Keep as much as you can tolerate.

pymogo performance for mongodb

I have a small confusion regarding pymongo connections . I have a situation in which I have to insert continuous stream of data into MongoDB for which I am using pymongo. The problem is once I create a connection to the DB I need to continuously insert into different collections? Is it good to switch between different collections very quickly to insert data (and this happens for very long may be for hours) . I mean will it not create any problems ? I have tried it seems to work properly but I still want to know even if I insert documents into different collections very quickly will it cause any problem?

It won't cause any problems till your active set fits in the RAM, else you will start to see faults in mongostat.
The concept remains the same irrespective of your storage engine WiredTiger or MMAPv1, though you will achieve a better write and storage size performance with WiredTiger engine.
Best way to figure out long term performance is to simulate your write load, and observe mongostat for that duration.

Processing large mongo collection offsite

I have a system writing logs into mongodb (about 1kk logs per day). On a weekly basis I need to calculate some statistics on those logs. Since the calculations are very processor and memory consuming I want to copy collection I'm working to powerful offsite machine. How do I keep offsite collections up to date without copying everything? I modify offsite collection, by storing statistic within its elements i.e. adding fields {"alogirthm_1": "passed"} or {"stat1": 3.1415}. Is replication right for my use case or I should investigate other alternatives?

As to your question, yes, replication does partially resolve your issue, with limitations.
So there are several ways I know to resolve your issue:
The half-database, half-application way.
Replication keeps your data up to date. It doesn't allow you to modify the secondary nodes (which you call "offsite collection") however. So you have to do the calculation on the secondary and write data to the primary. You need to have an application running aggregation on the secondary, and write the result back to it's primary.
This requires that you run an application, PHP, .NET, Python, whatever.
full-server way
Since you are going to have multi-servers any way, you can consider using Sharding for the faster storage and directly do the calculation online. This way you don't even need to run an application. The Map/Reduce do the calculation and write output into an new collection. I DON'T recommend this solution though because of the Map/Reduce performance issue of current versions.
The full-application way
Basically you still use an replication for reading, but the server doesn't do any calculations except querying data. You can use capped collection or TTL index for removing expired data, and you just enumerate data one by one in your application and calculation by yourself.

Why does MongoDB takes up so much space?

I am trying to store records with a set of doubles and ints (around 15-20) in mongoDB. The records mostly (99.99%) have the same structure.
When I store the data in a root which is a very structured data storing format, the file is around 2.5GB for 22.5 Million records. For Mongo, however, the database size (from command show dbs) is around 21GB, whereas the data size (from db.collection.stats()) is around 13GB.
This is a huge overhead (Clarify: 13GB vs 2.5GB, I'm not even talking about the 21GB), and I guess it is because it stores both keys and values. So the question is, why and how Mongo doesn't do a better job in making it smaller?
But the main question is, what is the performance impact in this? I have 4 indexes and they come out to be 3GB, so running the server on a single 8GB machine can become a problem if I double the amount of data and try to keep a large working set in memory.
Any guesses into if I should be using SQL or some other DB? or maybe just keep working with ROOT files if anyone has tried them?

Basically, this is mongo preparing for the insertion of data. Mongo performs prealocation of storage for data to prevent (or minimize) fragmentation on the disk. This prealocation is observed in the form of a file that the mongod instance creates.
First it creates a 64MB file, next 128MB, next 512MB, and on and on until it reaches files of 2GB (the maximum size of prealocated data files).
There are some more things that mongo does that might be suspect to using more disk space, things like journaling...
For much, much more info on how mongoDB uses storage space, you can take a look at this page and in specific the section titled Why are the files in my data directory larger than the data in my database?
There are some things that you can do to minimize the space that is used, but these tequniques (such as using the --smallfiles option) are usually only recommended for development and testing use - never for production.

Question: Should you use SQL or MongoDB?
Answer: It depends.
Better way to ask the question: Should you use use a relational database or a document database?
Answer:
If your data is highly structured (every row has the same fields), or you rely heavily on foreign keys and you need strong transactional integrity on operations that use those related records... use a relational database.
If your records are heterogeneous (different fields per document) or have variable length fields (arrays) or have embedded documents (hierarchical)... use a document database.
My current software project uses both. Use the right tool for the job!

Which noSQL database is best for high volume inserts / writes?

Which nosql system is better equipped for handling high volume inserts out of the box?
Preferably, running on 1 physical machine (many instances allowed).
Has anyone done any benchmarks? (googling did not help)
Note: I understand that choosing noSQL database depends on what kind of data needs to be stored (document:MongoDB, graph:Neo4j, etc.).

If you want fast write speed, you can just insert your data into memory and flush data to the disc at a background every minute or so. That should be fastest solution.
MongoDB and Redis do this actually. For example, in mongodb you can go without journal enabled and writes will be very fast. But keep in mind that if you store data in memory at a single server there is possibility to loose your data (data that not flushed to the disc yet) when your server goes down.
In general, what database to use highly depends on data you want to store and task you are trying to solve.

Apache Cassandra is great in write operations, thanks to its unique persistence model. Some claim that it writes about 20 times faster than it reads but I believe it's really dependent on your usage profile.
Read about it in their FAQ and in various blog posts.
That is, of course, if you have "classical" DB profile of large amounts of data. If your data is small, or is used temporarily and/or as a cache layer, then of course opt for Redis which has the fastest throughput both for reads and for writes, since it's memory-based (with eventual disk persistence).

If you're dealing with a complex object model for inserts your best option is an object database like Versant's:
http://www.versant.com/vision/The_Magic_Cube.aspx

According to my benchmarks, Cassandra is better than MongoDB on large arrays, but MongodDB is more flexible.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse