Processing large mongo collection offsite

Processing large mongo collection offsite - mongodb

I have a system writing logs into mongodb (about 1kk logs per day). On a weekly basis I need to calculate some statistics on those logs. Since the calculations are very processor and memory consuming I want to copy collection I'm working to powerful offsite machine. How do I keep offsite collections up to date without copying everything? I modify offsite collection, by storing statistic within its elements i.e. adding fields {"alogirthm_1": "passed"} or {"stat1": 3.1415}. Is replication right for my use case or I should investigate other alternatives?

As to your question, yes, replication does partially resolve your issue, with limitations.
So there are several ways I know to resolve your issue:
The half-database, half-application way.
Replication keeps your data up to date. It doesn't allow you to modify the secondary nodes (which you call "offsite collection") however. So you have to do the calculation on the secondary and write data to the primary. You need to have an application running aggregation on the secondary, and write the result back to it's primary.
This requires that you run an application, PHP, .NET, Python, whatever.
full-server way
Since you are going to have multi-servers any way, you can consider using Sharding for the faster storage and directly do the calculation online. This way you don't even need to run an application. The Map/Reduce do the calculation and write output into an new collection. I DON'T recommend this solution though because of the Map/Reduce performance issue of current versions.
The full-application way
Basically you still use an replication for reading, but the server doesn't do any calculations except querying data. You can use capped collection or TTL index for removing expired data, and you just enumerate data one by one in your application and calculation by yourself.

Related

storing huge amounts of data in mongo

I am working on a front end system for a radius server.
The radius server will pass updates to the system every 180 seconds. Which means if I have about 15,000 clients that would be around 7,200,000 entries per day...Which is a lot.
I am trying to understand what the best possible way to store and retrieve this data will be. Obviously as time goes on, this will become substantial. Will MongoDB handle this? Typical document is not much, something this
{
id: 1
radiusId: uniqueId
start: 2017-01-01 14:23:23
upload: 102323
download: 1231556
}
However, there will be MANY of these records. I guess this is something similar to the way that SNMP NMS servers handle data which as far as I know they use RRD to do this.
Currently in my testing I just push every document into a single collection. So I am asking,
A) Is Mongo the right tool for the job and
B) Is there a better/more preferred/more optimal way to store the data
EDIT:
OK, so just incase someone comes across this and needs some help.
I ran it for a while in mongo, I was really not satisfied with performance. We can chalk this up to the hardware I was running on, perhaps my level of knowledge or the framework I was using. However I found a solution that works very well for me. InfluxDB pretty much handles all of this right out of the box, its a time series database which is effectively the data I am trying to store (https://github.com/influxdata/influxdb). Performance for me has been like night & day. Again, could all be my fault, just updating this.
EDIT 2:
So after a while I think I figured out why I never got the performance I was after with Mongo. I am using sailsjs as framework and it was searching by id using regex, which obviously has a huge performance hit. I will eventually try migrate back to Mongo instead of influx and see if its better.

15,000 clients updating every 180 seconds = ~83 insertions / sec. That's not a huge load even for a moderately sized DB server, especially given the very small size of the records you're inserting.
I think MongoDB will do fine with that load (also, to be honest, almost any modern SQL DB would probably be able to keep up as well). IMHO, the key points to consider are these:
Hardware: make sure you have enough RAM. This will primarily depend on how many indexes you define, and how many queries you're doing. If this is primarily a log that will rarely be read, then you won't need much RAM for your working set (although you'll need enough for your indexes). But if you're also running queries then you'll need much more resources
If you are running extensive queries, consider setting up a replica set. That way, your master server can be reserved for writing data, ensuring reliability, while your slaves can be configured to serve your queries without affecting the write reliability.
Regarding the data structure, I think that's fine, but it'll really depend on what type of queries you wish to run against it. For example, if most queries use the radiusId to reference another table and pull in a bunch of data for each record, then you might want to consider denormalizing some of that data. But again, that really depends on the queries you run.
If you're really concerned about managing the write load reliably, consider using the Mongo front-end only to manage the writes, and then dumping the data to a data warehouse backend to run queries on. You can partially do this by running a replica set like I mentioned above, but the disadvantage of a replica set is that you can't restructure the data. The data in each member of the replica set is exactly the same (hence the name, replica set :-) Oftentimes, the best structure for writing data (normalized, small records) isn't the best structure for reading data (denormalized, large records with all the info and joins you need already done). If you're running a bunch of complex queries referencing a bunch of other tables, using a true data warehouse for the querying part might be better.
As your write load increases, you may consider sharding. I'm assuming the RadiusId points to each specific server among a pool of Radius servers. You could potentially shard on that key, which would split the writes based on which server is sending the data. Thus, as you increase your radius servers, you can increase your mongo servers proportionally to maintain write reliability. However, I don't think you need to do this right away as I bet one reasonably provisioned server should be able to manage the load you've specified.
Anyway, those are my preliminary suggestions.

Multiple mongodb servers seen as one and data flow management

For my application I need to move old data periodically from a mongodb server to another one (ie, two distinct servers). I also want to be able to query those data as if they were the same database.
In short terms, I want to be able to see two mongodb instances (on two different servers) as one and be able to control when and where the data is stored.
I read about the concept of sharding and chunks and rapidly saw the moveChunk function which can easily do what I want.
The problem is that it seems to be impossible to configure such architecture in mongodb. Am I missing something here?

Archiving Deleted Documents
For the problem of keeping the deleted documents, you have no option to achieve this with build-in features/mechanisms like sharding or replication. The only way to do it is to handle that case manually, e.g. holding a separate collection for deleted documents, and simply move documents to that collections instead of deleting them.
For your global problem of moving data you have the following two options:
Sharding
Using sharding you will split your data into pieces which will be stored on two (in your case) different servers. In this scenario you can use the moveChunk method as you have mentioned. But this method is very tricky, as for that you will need to disable the built-in automatic balancer to have a full manual control over your chunks. Anyway, this is not recommended by the MongoDB:
Only use the moveChunk in special circumstances such as preparing your sharded cluster for an initial ingestion of data, or a large bulk import operation. In most cases allow the balancer to create and balance chunks in sharded clusters
Besides this will only allow to split data, and finally, to get to your goal, you will end up with one full and one empty shard.
Replication
The replication approach is much more safe and easy to achieve. You can simply configure a replica set and add your second server to that set.
If the data is too big, you can configure your second server as hidden. So that no reads will be performed towards that server, so no inconsistent data will be received. After the data replication is finished, you will have the copy of your data on both servers.
As for using both servers as a single server, if you need to balance the requests between these tow, you can configure your readPreference to secondary, which will assure that all the reads are being sent to the secondary server, and writes by default are done on the primary.
In this case your code will be unaware of what server you are querying. You will just run your client methods, and the rest will be done behind the drivers.
Conclusion
So my advice would be to use the replication approach as more clean, pain-less and safe solution.

Is MongoDB usable as shared memory for a parallell processing / multiple-instances application?

I'm planning a product that will process updates from multiple data feeds. Input-data is guesstimated to be a total of 100Mbps stream containing 100 byte sized messages. These messages contain several data fields that needs to be checked for correlation with the existing data set within the application. If a input-message correlates with an existing data record, then the input-message will update the existing data-record, if not: it will create a new record. It is assumed that data are updated every 3 seconds in average.
The correlation process is assumed to be a bottleneck, and thus I intend to make our product able to run balanced in multiple processes if needed (most likely on a separate hardware or VM). Somewhat in the vicinity of Space-based architecture. I'd then like a shared storage between my processes so that all existing data records are visible to all the running processes. The shared storage will have to fetch possible candidates for correlation through a query/search based on some attributes (e.g. elevation). It will have to offer configuring warm redundancy, and a possibility to store snapshots every 5 minutes for logging.
Everything seems to be pointing towards MongoDB, but I'd like a confirmation from you that MongoDB will meet my needs. So do you think it is a go?
-Thank you
NB: I am not considering a relational database because we want to focus all coding in our application, instead of having to make 'stored procedures'/'functions' in a separate environment to optimize the performance of our system. Further, the data is diverse and I don't want to try normalize it into a schema.

Yes, MongoDB will meet your needs. I think the following aspects of your description are particularly relevant in your DB selection decision:
1. An update happens every 3 seconds
MongoDB has a database level write-lock (usually short lived) that blocks read operations. This means that you want will want to ensure that you have enough memory to fit your working set, and you will generally not run into any write-lock issues. Note that bulk inserts will hold the write lock for longer.
If you are sharding, you will want to consider shard keys that allow for write scaling i.e. distribute writes on different shards.
2. Shared storage for multiple processes
This is a pretty common scenario; in fact, many MongoDB deployments are expected be accessed from multiple processes concurrently. Unlike the write-lock, the read-lock does not block other reads.
3. Warm redundancy
Supported through MongoDB replication. If you'd like to read from secondary server(s) you will need to set the Read Preference to secondaryPreferred in your driver.

Frequent large, multi-record updates in MongoDB, Lucene, etc

I am working on the high-level design of a web application with the following characteristics:
Millions of records
Heavily indexed/searchable by various criteria
Variable document schema
Regular updates in blocks of 10K - 200K records at a time
Data needs to remain highly available during updates
Must scale horizontally effectively
Today, this application exists in MySQL and we suffer from a few huge problems, particularly that it is challenging to adapt to flexible schema, and that large bulk updates lock the data for 10 - 15 seconds at a time, which is unacceptable. Some of these things can be tackled by better database design within the context of MySQL, however, I am looking for a better "next generation" solution.
I have never used MongoDB, but its feature set seemed to most closely match what I am looking for, so that was my first area of interest. It has some things I am excited about, such as data sharding, the ability to find-update-return in a single statement, and of course the schema flexibility of NoSQL.
There are two things I am not sure about, though, with MongoDB:
I can't seem to find solid
information about the concurrency of
updates with large data sets (see my
use case above) so I have a hard
time understanding how it might
perform.
I do need open text search
That second requirement brought me to Lucene (or possibly to Solr if I kept it external) as a search store. I did read a few cases where Lucene was being used in place of a NoSQL database like MongoDB entirely, which made me wonder if I am over-complicating things by trying to use both in a single app -- perhaps I should just store everything directly in Lucene and run it like that?
Given the requirements above, does it seem like a combination of MongoDB and Lucene would make this work effectively? If not, might it be better to attempt to tackle it entirely in Lucene?

Currently with MongoDB, updates are locking at the server-level. There are a few JIRAs open that address this, planned for v1.9-2.0. I believe the current plan is to yield writes to allow reads to perform better.
With that said, there are plenty of great ways to scale MongoDB for super high concurrency - many of which are simiar for MySQL. One such example is to use RAID 10. Another is to use master-slave where you write to master and read from slave.
You also need to consider if your "written" data needs to be 1) durable and 2) accessible via slaves immediately. The mongodb drivers allow you to specify if you want the data to be written to disk immediately (or hang in memory for the next fsync) and allow you to specify how many slaves the data should be written to. Both of these will slow down MongoDB writing, which as noted above can affect read performance.
MongoDB also does not have nearly the capability for full-text search that Solr\Lucene have and you will likely want to use both together. I am currently using both Solr and MongoDB together and am happy with it.

MongoDB vs. Redis vs. Cassandra for a fast-write, temporary row storage solution

I'm building a system that tracks and verifies ad impressions and clicks. This means that there are a lot of insert commands (about 90/second average, peaking at 250) and some read operations, but the focus is on performance and making it blazing-fast.
The system is currently on MongoDB, but I've been introduced to Cassandra and Redis since then. Would it be a good idea to go to one of these two solutions, rather than stay on MongoDB? Why or why not?
Thank you

For a harvesting solution like this, I would recommend a multi-stage approach. Redis is good at real time communication. Redis is designed as an in-memory key/value store and inherits some very nice benefits of being a memory database: O(1) list operations. For as long as there is RAM to use on a server, Redis will not slow down pushing to the end of your lists which is good when you need to insert items at such an extreme rate. Unfortunately, Redis can't operate with data sets larger than the amount of RAM you have (it only writes to disk, reading is for restarting the server or in case of a system crash) and scaling has to be done by you and your application. (A common way is to spread keys across numerous servers, which is implemented by some Redis drivers especially those for Ruby on Rails.) Redis also has support for simple publish/subscribe messenging, which can be useful at times as well.
In this scenario, Redis is "stage one." For each specific type of event you create a list in Redis with a unique name; for example we have "page viewed" and "link clicked." For simplicity we want to make sure the data in each list is the same structure; link clicked may have a user token, link name and URL, while the page viewed may only have the user token and URL. Your first concern is just getting the fact it happened and whatever absolutely neccesary data you need is pushed.
Next we have some simple processing workers that take this frantically inserted information off of Redis' hands, by asking it to take an item off the end of the list and hand it over. The worker can make any adjustments/deduplication/ID lookups needed to properly file the data and hand it off to a more permanent storage site. Fire up as many of these workers as you need to keep Redis' memory load bearable. You could write the workers in anything you wish (Node.js, C#, Java, ...) as long as it has a Redis driver (most web languages do now) and one for your desired storage (SQL, Mongo, etc.)
MongoDB is good at document storage. Unlike Redis it is able to deal with databases larger than RAM and it supports sharding/replication on it's own. An advantage of MongoDB over SQL-based options is that you don't have to have a predetermined schema, you're free to change the way data is stored however you want at any time.
I would, however, suggest Redis or Mongo for the "step one" phase of holding data for processing and use a traditional SQL setup (Postgres or MSSQL, perhaps) to store post-processed data. Tracking client behavior sounds like relational data to me, since you may want to go "Show me everyone who views this page" or "How many pages did this person view on this given day" or "What day had the most viewers in total?". There may be even more complex joins or queries for analytic purposes you come up with, and mature SQL solutions can do a lot of this filtering for you; NoSQL (Mongo or Redis specifically) can't do joins or complex queries across varied sets of data.

I currently work for a very large ad network and we write to flat files :)
I'm personally a Mongo fan, but frankly, Redis and Cassandra are unlikely to perform either better or worse. I mean, all you're doing is throwing stuff into memory and then flushing to disk in the background (both Mongo and Redis do this).
If you're looking for blazing fast speed, the other option is to keep several impressions in local memory and then flush them disk every minute or so. Of course, this is basically what Mongo and Redis do for you. Not a real compelling reason to move.

All three solutions (four if you count flat-files) will give you blazing fast writes. The non-relational (nosql) solutions will give you tunable fault-tolerance as well for the purposes of disaster recovery.
In terms of scale, our test environment, with only three MongoDB nodes, can handle 2-3k mixed transactions per second. At 8 nodes, we can handle 12k-15k mixed transactions per second. Cassandra can scale even higher. 250 reads is (or should be) no problem.
The more important question is, what do you want to do with this data? Operational reporting? Time-series analysis? Ad-hoc pattern analysis? real-time reporting?
MongoDB is a good option if you want the ability to do ad-hoc analysis based on multiple attributes within a collection. You can put up to 40 indexes on a collection, though the indexes will be stored in-memory, so watch for size. But the result is a flexible analytical solution.
Cassandra is a key-value store. You define a static column or set of columns that will act as your primary index right up front. All queries run against Cassandra should be tuned to this index. You can put a secondary on it, but that's about as far as it goes. You can, of course, use MapReduce to scan the store for non-key attribution, but it will be just that: a serial scan through the store. Cassandra also doesn't have the notion of "like" or regex operations on the server nodes. If you want to find all customers where the first name starts with "Alex", you'll have to scan through the entire collection, pull the first name out for each entry and run it through a client-side regex.
I'm not familiar enough with Redis to speak intelligently about it. Sorry.
If you are evaluating non-relational platforms, you might also want to consider CouchDB and Riak.
Hope this helps.

Just found this: http://blog.axant.it/archives/236
Quoting the most interesting part:
This second graph is about Redis RPUSH vs Mongo $PUSH vs Mongo insert, and I find this graph to be really interesting. Up to 5000 entries mongodb $push is faster even when compared to Redis RPUSH, then it becames incredibly slow, probably the mongodb array type has linear insertion time and so it becomes slower and slower. mongodb might gain a bit of performances by exposing a constant time insertion list type, but even with the linear time array type (which can guarantee constant time look-up) it has its applications for small sets of data.
I guess everything depends at least on data type and volume. Best advice probably would be to benchmark on your typical dataset and see yourself.

According to the Benchmarking Top NoSQL Databases (download here)
I recommend Cassandra.

If you have the choice (and need to move away from flat fies) I would go with Redis. Its blazingly fast, will comfortably handle the load you're talking about, but more importantly you won't have to manage the flushing/IO code. I understand its pretty straight forward but less code to manage is better than more.
You will also get horizontal scaling options with Redis that you may not get with file based caching.

I can get around 30k inserts/sec with MongoDB on a simple $350 Dell. If you only need around 2k inserts/sec, I would stick with MongoDB and shard it for scalability. Maybe also look into doing something with Node.js or something similar to make things more asynchronous.

The problem with inserts into databases is that they usually require writing to a random block on disk for each insert. What you want is something that only writes to disk every 10 inserts or so, ideally to sequential blocks.
Flat files are good. Summary statistics (eg total hits per page) can be obtained from flat files in a scalable manner using merge-sorty map-reducy type algorithms. It's not too hard to roll your own.
SQLite now supports Write Ahead Logging, which may also provide adequate performance.

I have hand-on experience with mongodb, couchdb and cassandra. I converted a lot of files to base64 string and insert these string into nosql.
mongodb is the fastest. cassandra is slowest. couchdb is slow too.
I think mysql would be much faster than all of them, but I didn't try mysql for my test case yet.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse