To give you an idea of the data:
DB has a collections/tables that has over a hundred million documents/records each containing more than 100 attributes/columns. The data size is expected to grow by hundred times soon.
Operations on the data:
There are mainly the following types of operations on the data:
Validating the data and then importing the data into the DB, that happens multiple times daily
Aggregations on this imported data
Searches/ finds
Updates
Deletes
Tools/softwares used:
MongoDB for database: PSS architecture based replicaset, indexes (most of the queries are INDEX scans)
NodeJS using Koa.js
Problems:
HOWEVER, the tool is very badly slow when it comes to aggregations, finds, etc.
What have I implemented for performance so far?:
DB Indexing
Caching
Pre-aggregations (using MongoDB aggregate to aggregate the data before hand and store it in different collections during importing to avoid aggregations at runtime)
Increased RAM and CPU cores on the DB server
Separate server for NodeJS server and Front-end build
PM2 to manage NodeJS server application and for spawning clusters
However from my experience, even after implementing all the above, the application is not performant enough. I feel that the reason for this is that the data is pretty huge. I am not aware of how Big Data applications are managed to deliver high performance. Please advise.
Also, is the selection of technology not suitable or will changing the technology/tools help? If yes, what is advised under such scenarios?
I'm requesting your advise to help me improve the performance of the application.
Not easy to give a correct answer because we do not really have that much details. What I would do is a detailed monitoring, at least the following:
Machine Level:
monitor the overall CPU load (for all cores) and RAM usage on your DB machine
monitor disk IO on the disks where the data is stored
this should show, if the machine specs are a bottleneck
Database & DB Process Level (my first guess, that this is the critical part):
what is the overall size of your data at the moment (I know, it will increase drastically but if it is already to slow now, this could be an interesting information - especially in relation to the current RAM size and number of CPU cores)
monitor memory usage and CPU load for your mongo DB process...
did a look on the query plans (while doing aggregations) guided you, what improvements can be done?
have look at the caching strategy. What strategy are you using?
this should give more detailed results on where to make improvements on a DB level. Is it just because of hardware bottlenecks or is it a aggregation problem...
Node.JS APP Level:
node.js app: how much RAM and CPU usage does this one take ...?
if there are multiple instances of the node.js app, track this for all instances
is the data import also happens through the nodejs app. Does the load on the app increases drastically while importing data?
if you see that you have a high load on this app that there is a need to act here (increasing instances, splitting it into seperate apps (e.g. import as a seperate app)
Related
I will soon need to perform some operations on our (azure-deployed) production database:
2 large MongoRestore operations
2 index rebuilds
2 collection counts
I need to ensure that these operations will not crash our servers by using up too many resources. I currently don't know how "resource-hungry" my operations are/will be, and how much load they will generate. I want to know how likely they are to crash the database.
My plan is to first execute these operations in our development environment and monitor the resulting load to get an idea of the "strain" that these operations will create.
I basically have two questions:
What tool(s) is/are best for monitoring the job? Specifically for monitoring the state/health of the database as these operations are running on it, to get an idea of how "safe" the operations are. I currently have a paid "Core" version of Studio 3T. I also have the free version of Mongo Compass.
What metrics should I be watching for? I'm new to this- so I assume I have watch RAM/CPU usage to make sure it doesn't go above a certain threshold. How do I know "how high is too high"? What else should I look for?
Monitor the RAM , CPU , IOPS , storage usage from the infrastructure.
mongotop and mongostat give some general usage idea.
Monitor the mongod/mongos logs ( the most valuable information is there )
--noIndexRestore (add this to avoid your members to be blocked for index re-creation immediately after the load , you will create the indexes later in background )
--numInsertionWorkersPerCollection=num1 , bigger num1 faster restore(more resource usage)
--numParallelCollections=num2(default=4) , bigger num2 faster restore(more resource usage)
--writeConcern="{w:'majority'}" ( keep it for safety , reducing it will improv significantly load speed )
I seem to be having real issues trying to get performance anywhere near that stated in the docs (~700 - 2000 tps with a VM of: 2 vCPUs 4GB RAM). I have tried on a local VM, a local machine and a few AWS VMs and I can't get anywhere close. - The maximum I have achieved is 80 tps on an AWS VM.
I have tried changing the -dbPoolSize and the -reqPoolSize for orion and playing with ulimit to set it to that suggested by MongoDB - but everything I change doesn't seem to get me anywhere close.
I have set indexes on the _id.id, _id.type and _id.servicePath as suggested in the docs - the latter of which gave me an increase from 40 tps to 80 tps.
Are there any config options for Orion or Mongo that I should be setting away from the default which will get me any closer? Are there any other tips for performance? The link in the docs to the test scripts doesn't work so I haven't been able to see the examples.
I have created my own test scripts using Node.js and I have tested update and queries using a variable amount of concurrent connections and between 1 and 2 load injectors.
From looking at the output from "top" the load is with Mongo as it almost maxes out the CPU but adding more cores to the VM doesn't change the stats. The VM has 7.5GB or 15GB of RAM so mongo should be able to put all the data into memory for blazing fast performance?
I have used mongostat to see that the connections from orion to mongo change with the -dbPoolSize option, but this doesn't yield any better performance.
Any help you can provide would be much appreciated.
I have tried using CentOS 6.5 and 6.7 with Orion 0.25 and 0.26 and MongoDB 2.6 with ~500,000 entities
My test scripts and data are on GitHub
I have only tested without subscriptions so far, but I have scripts ready to test with subscriptions - but I wanted to get a good baseline before adding subscriptions.
My data is modeled around parking spaces in the UK countries their regions and their outcodes (first part of the postcode). This is using servicePaths to split them down to parking lot in an outcode.
Here is a gist with the requests and mongo shell output
Performance is a complex topic which depends on many factor (deployment setup of Orion and MongoDB, startup configuration of Orion and MongoDB, hardware profile in the systems hosting the processes, network communications, overprovisioning level in the case of virtualization, injected load, etc.) so there isn't any general answer to deal with this kind of problems. However, I'd try to provide some hints and recommendations that I hope may help.
Regarding versions, Orion 0.26.0 (or 0.26.1) is recommended over 0.25.0. We have included a lot of improvements related with performance in Orion 0.26.0. Regarding MongoDB, we have also found that 3.0 could be much better than 2.6, specially in update intensive scenarios.
Having said that, first of all you should locate the bottleneck. Useful tools to do this are top, mongostat and mongotop. It could be either Orion, MongoDB or the network connecting them. If the bottleneck is CB, maybe the performance tuning hints provided in this document may help. Slow queries information in MongoDB could be also pointing to bottlenecks at Orion. If the bottleneck is MongoDB, taking into account the large number of entities you have (500,000) maybe you should consider to implement sharding. If the bottleneck is the network, colocation both Orion and MongoDB may help.
Finally, some things you can also try in order to get more insight into the problem:
Run some tests outside AWS (i.e. virtual machines in local premises) to compare. I don't know too much about the overprovising policy in AWS but based in my previous experiences with other cloud providers the VM overprovisioning (specially if it varies along time) could impact in performance.
Analyze if the peformance is related with the number of entities. E.g. run test with 500, 5,000, 50,000 and 500,000 entities and get the performance figure in each case.
Analyze if the performance is related with the usage of servicePath, e.g. put all the 500,000 entities in the default service path / (moving the current content of the servicePath to another place, e.g. an entity attribute or part of the entity ID string) and test. Currently Orion uses a regex to filter for servicePath and that could be slow.
A part of the application, I am working on, allows a user to build a query (visually) to export data from the system. Currently, I am building a Elasticsearch query based on the user input and writing a web service that will collect all the results of that query (after applying some basic filters).
After doing some basic benchmarking on the production Elasticsearch cluster, I am starting to doubt if Elasticsearch is the right tool for this. It takes about 22 minutes to export 1 million contacts (from index size of 11 million). There are 3 nodes in the cluster - each with 4 cores, 16 GB ram (heap size 8GB), and EBS storage (I know this is not the most efficient storage for ES).
Is Elasticsearch really not well suited for these kind of large volume (and frequent) exports? In my setup Elasticsearch follows Mongo (using transporter plugin), so its possible to get data out from Mongo as well. Would Mongo be a better option? I am bit skeptical to use Mongo considering its not-so-good memory management and also potentially polluting the working set when running export(s).
Currently I am pulling data from Elasticsearch over HTTP REST using scroll (and scan api). Its possible that I might get more throughput using Elasticsearch's Java node client or using a plugin like (https://github.com/jprante/elasticsearch-knapsack). Although with knapsack plugin I lose the ability of recording time (and other book-keeping) for each export.
Any suggestions would be highly appreciated.
You can experiment with the scroll and size parameters. It depends on the size of your documents, but you could try something like 30s for scroll and 1000 for size.
You retrieve docs at a rate of ~760 docs/s which is not great but as you say, EBS is not fast. Did you try to calculate what the current throughput is and compare it to EBS bandwidth?
A Java client could be a bit faster but I would guess the overhead is somewhere else in this case.
We are building a system that will need to serve loads of small requests from day one. By "loads" I mean ~5,000 queries per second. For each query we need to retrieve ~20 records from noSQL database. There will be two batch reads - 3-4 records at first and then 16-17 reads instantly after that (based on the result of first read). That would be ~100,000 objects to read per second.
Until now we were thinking about using DynamoDB for this as it's really easy to start with.
Storage is not something I would be worried about as the objects will be really tiny.
What I am worried about is cost of reads. DynamoDB costs $0.0113 per hour per 100 eventually consistent (which is fine for us) reads per second. That is $11,3 per hour for us provided that all objects are up to 1KB in size. And that would be $5424 per month based on 16 hours/day average usage.
So... $5424 per month.
I would consider other options but I am worried about maintenance issues, costs etc. I have never worked with such setups before so your advice would be really valuable.
What would be the most cost-effective (yet still hassle-free) solution for such read/write intensive application?
From your description above, I'm assuming that your 5,000 queries per second are entirely read operations. This is essentially what we'd call a data warehouse use case. What are your availability requirements? Does it have to be hosted on AWS and friends, or can you buy your own hardware to run in-house? What does your data look like? What does the logic which consumes this data look like?
You might get the sense that there really isn't enough information here to answer the question definitively, but I can at least offer some advice.
First, if your data is relatively small and your queries are simple, save yourself some hassle and make sure you're querying from RAM instead of disk. Any modern RDBMS with support for in-memory caching/tablespaces will do the trick. Postgres and MySQL both have features for this. In the case of Postgres make sure you've tuned the memory parameters appropriately as the out-of-the-box configuration is designed to run on pretty meager hardware. If you must use a NoSQL option, depending on the structure of your data Redis is probably a good choice (it's also primarily in-memory). However in order to say which flavor of NoSQL might be the best fit we'd need to know more about the structure of the data that you're querying, and what queries you're running.
If the queries boil down to SELECT * FROM table WHERE primary_key = {CONSTANT} - don't bother messing with NoSQL - just use an RDBMS and learn how to tune the dang thing. This is doubly true if you can run it on your own hardware. If the connection count is high, use read slaves to balance the load.
Long-after-the-fact Edit (5/7/2013):
Something I should've mentioned before: EC2 is a really really crappy place to measure performance of self-managed database nodes. Unless you're paying out the nose, your I/O perf will be terrible. Your choices are to either pay big money for provisioned IOPS, RAID together a bunch of EBS volumes, or rely on ephemeral storage whilst syncing a WAL off to S3 or similar. All of these options are expensive and difficult to maintain. All of these options have varying degrees of performance.
I discovered this for a recent project, so I switched to Rackspace. The performance increased tremendously there, but I noticed that I was paying a lot for CPU and RAM resources when really I just need fast I/O. Now I host with Digital Ocean. All of DO's storage is SSD. Their CPU performance is kind of crappy in comparison to other offerings, but I'm incredibly I/O bound so I Just Don't Care. After dropping Postgres' random_page_cost to 2, I'm humming along quite nicely.
Moral of the story: profile, tune, repeat. Ask yourself what-if questions and constantly validate your assumptions.
Another long-after-the-fact-edit (11/23/2013): As an example of what I'm describing here, check out the following article for an example of using MySQL 5.7 with the InnoDB memcached plugin to achieve 1M QPS: http://dimitrik.free.fr/blog/archives/11-01-2013_11-30-2013.html#2013-11-22
By "loads" I mean ~5,000 queries per second.
Ah that's not so much, even SQL can handle that. So you are already easily within the limits of what most modern DBs can handle. However they can only handle this with the right:
Indexes
Queries
Server Hardware
Splitting of large data (you might require a large amount of shards with relatively low data each, dependant here so I said "might")
That would be ~100,000 objects to read per second.
Now that's more of a high load scenario. Must you read these in such a fragmented manner? If so then (as I said) you may require to look into spreading the load across replicated shards.
Storage is not something I would be worried about as the objects will be really tiny.
Mongo is aggresive with disk allocation so even with small objects it will still pre-allocate a lot of space, this is something to bare in mind.
So... $5424 per month.
Oh yea the billing thrills of Amazon :\.
I would consider other options but I am worried about maintenance issues, costs etc. I have never worked with such setups before so your advice would be really valuable.
Now you hit the snag of it all. You can setup your own cluster but then you might end up paying that much in money and time (or way more) for the servers, people, admins and your own mantenance time. This is one reason why DynamoDB really shines here. For large setups who are looking to take the load and pain and stress of server management (trust me it is really painful, if your a Dev you might as well change your job title to server admin from now on) off of the company.
Considering to setup this yourself you would need:
A considerable amount of EC instances (dependant upon data and index size but I would say close to maybe 30?)
A server admin (maybe 2, maybe freelance?)
Both of which could set you back 100's of thousands of pounds a year, I would personally bet for the managed approach if it fits your needs and budget. When your need grows beyond what managed Amazon DB can give you then move to your infrastructure.
Edit
I should amend that the cost effectiveness was done with quite a few black holes for example:
I am unsure of the amount of data you have
I am unsure of writes
Both of these contribute me to place a scenario of:
Massive writes (about as much as your reading)
Massive data (lots)
Here is what I recommend in sequence.
Identify your use case and choose the correct db. We test MySQL and MongoDb regularly for all kinds of workloads (OLTP, Analytics, etc). In all cases we have tested with, MySQL outperforms MongoDb and is cheaper ($/TPS) compared to MongoDb. MongoDb has other advantages but that is another story ... since we are talking about performance here.
Try to cache your queries in RAM (by provisioning adequate RAM).
If you are bottle necked on RAM, then you can try a SSD caching solution which takes advantage of ephemeral SSD. This works if your workload is cache friendly. You can save loads of money as ephemeral SSD is typically not charged by the cloud provider.
Try PIOPS/RAID or a combination to create adequate IOPS for your application.
I'm building a system that tracks and verifies ad impressions and clicks. This means that there are a lot of insert commands (about 90/second average, peaking at 250) and some read operations, but the focus is on performance and making it blazing-fast.
The system is currently on MongoDB, but I've been introduced to Cassandra and Redis since then. Would it be a good idea to go to one of these two solutions, rather than stay on MongoDB? Why or why not?
Thank you
For a harvesting solution like this, I would recommend a multi-stage approach. Redis is good at real time communication. Redis is designed as an in-memory key/value store and inherits some very nice benefits of being a memory database: O(1) list operations. For as long as there is RAM to use on a server, Redis will not slow down pushing to the end of your lists which is good when you need to insert items at such an extreme rate. Unfortunately, Redis can't operate with data sets larger than the amount of RAM you have (it only writes to disk, reading is for restarting the server or in case of a system crash) and scaling has to be done by you and your application. (A common way is to spread keys across numerous servers, which is implemented by some Redis drivers especially those for Ruby on Rails.) Redis also has support for simple publish/subscribe messenging, which can be useful at times as well.
In this scenario, Redis is "stage one." For each specific type of event you create a list in Redis with a unique name; for example we have "page viewed" and "link clicked." For simplicity we want to make sure the data in each list is the same structure; link clicked may have a user token, link name and URL, while the page viewed may only have the user token and URL. Your first concern is just getting the fact it happened and whatever absolutely neccesary data you need is pushed.
Next we have some simple processing workers that take this frantically inserted information off of Redis' hands, by asking it to take an item off the end of the list and hand it over. The worker can make any adjustments/deduplication/ID lookups needed to properly file the data and hand it off to a more permanent storage site. Fire up as many of these workers as you need to keep Redis' memory load bearable. You could write the workers in anything you wish (Node.js, C#, Java, ...) as long as it has a Redis driver (most web languages do now) and one for your desired storage (SQL, Mongo, etc.)
MongoDB is good at document storage. Unlike Redis it is able to deal with databases larger than RAM and it supports sharding/replication on it's own. An advantage of MongoDB over SQL-based options is that you don't have to have a predetermined schema, you're free to change the way data is stored however you want at any time.
I would, however, suggest Redis or Mongo for the "step one" phase of holding data for processing and use a traditional SQL setup (Postgres or MSSQL, perhaps) to store post-processed data. Tracking client behavior sounds like relational data to me, since you may want to go "Show me everyone who views this page" or "How many pages did this person view on this given day" or "What day had the most viewers in total?". There may be even more complex joins or queries for analytic purposes you come up with, and mature SQL solutions can do a lot of this filtering for you; NoSQL (Mongo or Redis specifically) can't do joins or complex queries across varied sets of data.
I currently work for a very large ad network and we write to flat files :)
I'm personally a Mongo fan, but frankly, Redis and Cassandra are unlikely to perform either better or worse. I mean, all you're doing is throwing stuff into memory and then flushing to disk in the background (both Mongo and Redis do this).
If you're looking for blazing fast speed, the other option is to keep several impressions in local memory and then flush them disk every minute or so. Of course, this is basically what Mongo and Redis do for you. Not a real compelling reason to move.
All three solutions (four if you count flat-files) will give you blazing fast writes. The non-relational (nosql) solutions will give you tunable fault-tolerance as well for the purposes of disaster recovery.
In terms of scale, our test environment, with only three MongoDB nodes, can handle 2-3k mixed transactions per second. At 8 nodes, we can handle 12k-15k mixed transactions per second. Cassandra can scale even higher. 250 reads is (or should be) no problem.
The more important question is, what do you want to do with this data? Operational reporting? Time-series analysis? Ad-hoc pattern analysis? real-time reporting?
MongoDB is a good option if you want the ability to do ad-hoc analysis based on multiple attributes within a collection. You can put up to 40 indexes on a collection, though the indexes will be stored in-memory, so watch for size. But the result is a flexible analytical solution.
Cassandra is a key-value store. You define a static column or set of columns that will act as your primary index right up front. All queries run against Cassandra should be tuned to this index. You can put a secondary on it, but that's about as far as it goes. You can, of course, use MapReduce to scan the store for non-key attribution, but it will be just that: a serial scan through the store. Cassandra also doesn't have the notion of "like" or regex operations on the server nodes. If you want to find all customers where the first name starts with "Alex", you'll have to scan through the entire collection, pull the first name out for each entry and run it through a client-side regex.
I'm not familiar enough with Redis to speak intelligently about it. Sorry.
If you are evaluating non-relational platforms, you might also want to consider CouchDB and Riak.
Hope this helps.
Just found this: http://blog.axant.it/archives/236
Quoting the most interesting part:
This second graph is about Redis RPUSH vs Mongo $PUSH vs Mongo insert, and I find this graph to be really interesting. Up to 5000 entries mongodb $push is faster even when compared to Redis RPUSH, then it becames incredibly slow, probably the mongodb array type has linear insertion time and so it becomes slower and slower. mongodb might gain a bit of performances by exposing a constant time insertion list type, but even with the linear time array type (which can guarantee constant time look-up) it has its applications for small sets of data.
I guess everything depends at least on data type and volume. Best advice probably would be to benchmark on your typical dataset and see yourself.
According to the Benchmarking Top NoSQL Databases (download here)
I recommend Cassandra.
If you have the choice (and need to move away from flat fies) I would go with Redis. Its blazingly fast, will comfortably handle the load you're talking about, but more importantly you won't have to manage the flushing/IO code. I understand its pretty straight forward but less code to manage is better than more.
You will also get horizontal scaling options with Redis that you may not get with file based caching.
I can get around 30k inserts/sec with MongoDB on a simple $350 Dell. If you only need around 2k inserts/sec, I would stick with MongoDB and shard it for scalability. Maybe also look into doing something with Node.js or something similar to make things more asynchronous.
The problem with inserts into databases is that they usually require writing to a random block on disk for each insert. What you want is something that only writes to disk every 10 inserts or so, ideally to sequential blocks.
Flat files are good. Summary statistics (eg total hits per page) can be obtained from flat files in a scalable manner using merge-sorty map-reducy type algorithms. It's not too hard to roll your own.
SQLite now supports Write Ahead Logging, which may also provide adequate performance.
I have hand-on experience with mongodb, couchdb and cassandra. I converted a lot of files to base64 string and insert these string into nosql.
mongodb is the fastest. cassandra is slowest. couchdb is slow too.
I think mysql would be much faster than all of them, but I didn't try mysql for my test case yet.