Large (and frequent) exports from Elasticsearch or Mongodb - mongodb

A part of the application, I am working on, allows a user to build a query (visually) to export data from the system. Currently, I am building a Elasticsearch query based on the user input and writing a web service that will collect all the results of that query (after applying some basic filters).
After doing some basic benchmarking on the production Elasticsearch cluster, I am starting to doubt if Elasticsearch is the right tool for this. It takes about 22 minutes to export 1 million contacts (from index size of 11 million). There are 3 nodes in the cluster - each with 4 cores, 16 GB ram (heap size 8GB), and EBS storage (I know this is not the most efficient storage for ES).
Is Elasticsearch really not well suited for these kind of large volume (and frequent) exports? In my setup Elasticsearch follows Mongo (using transporter plugin), so its possible to get data out from Mongo as well. Would Mongo be a better option? I am bit skeptical to use Mongo considering its not-so-good memory management and also potentially polluting the working set when running export(s).
Currently I am pulling data from Elasticsearch over HTTP REST using scroll (and scan api). Its possible that I might get more throughput using Elasticsearch's Java node client or using a plugin like (https://github.com/jprante/elasticsearch-knapsack). Although with knapsack plugin I lose the ability of recording time (and other book-keeping) for each export.
Any suggestions would be highly appreciated.

You can experiment with the scroll and size parameters. It depends on the size of your documents, but you could try something like 30s for scroll and 1000 for size.
You retrieve docs at a rate of ~760 docs/s which is not great but as you say, EBS is not fast. Did you try to calculate what the current throughput is and compare it to EBS bandwidth?
A Java client could be a bit faster but I would guess the overhead is somewhere else in this case.

Related

How to improve application performance? [Updated]

To give you an idea of the data:
DB has a collections/tables that has over a hundred million documents/records each containing more than 100 attributes/columns. The data size is expected to grow by hundred times soon.
Operations on the data:
There are mainly the following types of operations on the data:
Validating the data and then importing the data into the DB, that happens multiple times daily
Aggregations on this imported data
Searches/ finds
Updates
Deletes
Tools/softwares used:
MongoDB for database: PSS architecture based replicaset, indexes (most of the queries are INDEX scans)
NodeJS using Koa.js
Problems:
HOWEVER, the tool is very badly slow when it comes to aggregations, finds, etc.
What have I implemented for performance so far?:
DB Indexing
Caching
Pre-aggregations (using MongoDB aggregate to aggregate the data before hand and store it in different collections during importing to avoid aggregations at runtime)
Increased RAM and CPU cores on the DB server
Separate server for NodeJS server and Front-end build
PM2 to manage NodeJS server application and for spawning clusters
However from my experience, even after implementing all the above, the application is not performant enough. I feel that the reason for this is that the data is pretty huge. I am not aware of how Big Data applications are managed to deliver high performance. Please advise.
Also, is the selection of technology not suitable or will changing the technology/tools help? If yes, what is advised under such scenarios?
I'm requesting your advise to help me improve the performance of the application.
Not easy to give a correct answer because we do not really have that much details. What I would do is a detailed monitoring, at least the following:
Machine Level:
monitor the overall CPU load (for all cores) and RAM usage on your DB machine
monitor disk IO on the disks where the data is stored
this should show, if the machine specs are a bottleneck
Database & DB Process Level (my first guess, that this is the critical part):
what is the overall size of your data at the moment (I know, it will increase drastically but if it is already to slow now, this could be an interesting information - especially in relation to the current RAM size and number of CPU cores)
monitor memory usage and CPU load for your mongo DB process...
did a look on the query plans (while doing aggregations) guided you, what improvements can be done?
have look at the caching strategy. What strategy are you using?
this should give more detailed results on where to make improvements on a DB level. Is it just because of hardware bottlenecks or is it a aggregation problem...
Node.JS APP Level:
node.js app: how much RAM and CPU usage does this one take ...?
if there are multiple instances of the node.js app, track this for all instances
is the data import also happens through the nodejs app. Does the load on the app increases drastically while importing data?
if you see that you have a high load on this app that there is a need to act here (increasing instances, splitting it into seperate apps (e.g. import as a seperate app)

AWS database solution for storing non-relational data

Whats the best AWS database for the below requirement
I need to store around 50,000 - 1,00,000 entries in the database.
Each of the entry would have a String as a key and a Json array as the value.
I should be able to retrieve the JSON array using the key.
The size of JSON data is around 20-30KB
I expect around 10,000 - 40,000 reads per hour.
Around 50,000 - 1,00,000 writes/week
I have to consider the cost as well.
Ease of integration/development
I am bit confused between MongoDB, DynamoDB and PostgreSQL. Please share your thoughts on this.
DynamoDB:-
DynamoDB is a fully managed proprietary NoSQL database service that supports key-value and document data structures. For the typical use case that you have described in OP, it would serve the purpose.
DynamoDB can handle more than 10 trillion requests per day and support
peaks of more than 20 million requests per second.
DynamoDB has good AWS SDK for all operations. The read and write capacity units can be configured for the table.
DynamoDB tables using on-demand capacity mode automatically adapt to
your application’s traffic volume. On-demand capacity mode instantly
accommodates up to double the previous peak traffic on a table. For
example, if your application’s traffic pattern varies between 25,000
and 50,000 strongly consistent reads per second where 50,000 reads per
second is the previous traffic peak, on-demand capacity mode instantly
accommodates sustained traffic of up to 100,000 reads per second. If
your application sustains traffic of 100,000 reads per second, that
peak becomes your new previous peak, enabling subsequent traffic to
reach up to 200,000 reads per second.
One point to note is that it doesn't allow to query the table based on non-key attributes. This means if you don't know the hash key of the table, you may need to do full table scan to get the data. However, there is a Secondary Index option which you can explore to get around the problem. You may need to have all the Query Access Patterns of your use case before you design and make informed decision.
MongoDB:-
MongoDB is not a fully managed service on AWS. However, you can setup the database using AWS service such as EC2, VPC, IAM, EBS etc. This requires some AWS cloud experience to setup the database. The other option is to use MongoDB Atlas service.
MongoDB is more flexible in terms of querying. Also, it has a powerful aggregate functions. There are lots of tools available to query the database directly to explore the data like SQL.
In terms of Java API, the Spring MongoDB can be used to perform typical database operation. There are lots of open source frameworks available on various languages for MongoDB (example Mongoose Nodejs) as well.
The MongoDB has support for many programming languages and the APIs are mature as well.
PostgreSQL:-
PostgreSQL is a fully managed database on AWS.
PostgreSQL has become the preferred open source relational database
for many enterprise developers and start-ups, powering leading
geospatial and mobile applications. Amazon RDS makes it easy to set
up, operate, and scale PostgreSQL deployments in the cloud.
I think I don't need to write much about this database and its API. It is very mature database and has good APIs.
Points to consider:-
Query Access Pattern
Easy setup
Database maintenance
API and frameworks
Community support

FIWARE Orion/MongoDB Performance on AWS

I seem to be having real issues trying to get performance anywhere near that stated in the docs (~700 - 2000 tps with a VM of: 2 vCPUs 4GB RAM). I have tried on a local VM, a local machine and a few AWS VMs and I can't get anywhere close. - The maximum I have achieved is 80 tps on an AWS VM.
I have tried changing the -dbPoolSize and the -reqPoolSize for orion and playing with ulimit to set it to that suggested by MongoDB - but everything I change doesn't seem to get me anywhere close.
I have set indexes on the _id.id, _id.type and _id.servicePath as suggested in the docs - the latter of which gave me an increase from 40 tps to 80 tps.
Are there any config options for Orion or Mongo that I should be setting away from the default which will get me any closer? Are there any other tips for performance? The link in the docs to the test scripts doesn't work so I haven't been able to see the examples.
I have created my own test scripts using Node.js and I have tested update and queries using a variable amount of concurrent connections and between 1 and 2 load injectors.
From looking at the output from "top" the load is with Mongo as it almost maxes out the CPU but adding more cores to the VM doesn't change the stats. The VM has 7.5GB or 15GB of RAM so mongo should be able to put all the data into memory for blazing fast performance?
I have used mongostat to see that the connections from orion to mongo change with the -dbPoolSize option, but this doesn't yield any better performance.
Any help you can provide would be much appreciated.
I have tried using CentOS 6.5 and 6.7 with Orion 0.25 and 0.26 and MongoDB 2.6 with ~500,000 entities
My test scripts and data are on GitHub
I have only tested without subscriptions so far, but I have scripts ready to test with subscriptions - but I wanted to get a good baseline before adding subscriptions.
My data is modeled around parking spaces in the UK countries their regions and their outcodes (first part of the postcode). This is using servicePaths to split them down to parking lot in an outcode.
Here is a gist with the requests and mongo shell output
Performance is a complex topic which depends on many factor (deployment setup of Orion and MongoDB, startup configuration of Orion and MongoDB, hardware profile in the systems hosting the processes, network communications, overprovisioning level in the case of virtualization, injected load, etc.) so there isn't any general answer to deal with this kind of problems. However, I'd try to provide some hints and recommendations that I hope may help.
Regarding versions, Orion 0.26.0 (or 0.26.1) is recommended over 0.25.0. We have included a lot of improvements related with performance in Orion 0.26.0. Regarding MongoDB, we have also found that 3.0 could be much better than 2.6, specially in update intensive scenarios.
Having said that, first of all you should locate the bottleneck. Useful tools to do this are top, mongostat and mongotop. It could be either Orion, MongoDB or the network connecting them. If the bottleneck is CB, maybe the performance tuning hints provided in this document may help. Slow queries information in MongoDB could be also pointing to bottlenecks at Orion. If the bottleneck is MongoDB, taking into account the large number of entities you have (500,000) maybe you should consider to implement sharding. If the bottleneck is the network, colocation both Orion and MongoDB may help.
Finally, some things you can also try in order to get more insight into the problem:
Run some tests outside AWS (i.e. virtual machines in local premises) to compare. I don't know too much about the overprovising policy in AWS but based in my previous experiences with other cloud providers the VM overprovisioning (specially if it varies along time) could impact in performance.
Analyze if the peformance is related with the number of entities. E.g. run test with 500, 5,000, 50,000 and 500,000 entities and get the performance figure in each case.
Analyze if the performance is related with the usage of servicePath, e.g. put all the 500,000 entities in the default service path / (moving the current content of the servicePath to another place, e.g. an entity attribute or part of the entity ID string) and test. Currently Orion uses a regex to filter for servicePath and that could be slow.

ElasticSearch vs. MongoDB for Caching User Data

Up to this point, I have been using MongoDB (Node.js + Mongoose) to save posts which belong to a user, so that I can later retrieve them to display in a stream (just like Facebook, Twitter, etc.)
It recently became necessary to allow the user to deeply search his stream; MongoDB's search was insufficient, so I implemented ElasticSearch on my servers (Amazon EC2 m1.large instances running CentOS, FWIW).
My question: I'm now in a position that I'm duplicating the data between MongoDB (where the user's stream is cached) and ElasticSearch (where it is searched).
Is there any disadvantage to moving my cache ENTIRELY into ElasticSearch, getting rid of the MongoDB all together? It seems a waste to double the storage, and there's no other place that I'm accessing this data (it is only used when presenting/searching the stream of posts).
Specifically, I want to make sure I'm not overlooking anything re: performance. I like the idea of reducing MongoDB as a bottleneck, yet I worry about the memory overhead of ElasticSearch. MongoDB runs on its own server in my cloud setup, whereas ElasticSearch is running on the same instances as node.js. This means I would have MORE ElasticSearch servers (the node.js servers are in an auto-scaling array), but they each are not DEDICATED servers (unlike MongoDB).
The only big obstacle to using ES as a "primary datasource" is that there isn't a good backup mechanism right now. The ES team is working on it and expect it to be out by the end of the year, but in the mean time, you'll have to implement your own backup scripts.
As far as performance, it's really hard to say because almost every situation is unique. ES benefits from memory - so more is always better. In particular, sorts/filters/facets/geo all like to eat memory. If you aren't doing much in the way of faceting, for example, you may be fine with less memory.
ES doesn't need to run on a dedicated node...but it will happily use as many resources as you give it.
Another option is to use just the elastic search indexes. You can choose to not save data in a readable format, so you search in ES and then retrieve documents from MongoDB to your user as needed.
The question bellow comments exactly on that.
Storing only selected fields and not storing _all in pyes/elasticsearch

MongoDB on EC2 server or AWS SimpleDB?

What scenario makes more sense - host several EC2 instances with MongoDB installed, or much rather use the Amazon SimpleDB webservice?
When having several EC2 instances with MongoDB I have the problem of setting the instance up by myself.
When using SimpleDB I have the problem of locking me into Amazons data structure right?
What differences are there development-wise? Shouldn't I be able to just switch the DAO of my service layers, to either write to MongoDB or AWS SimpleDB?
SimpleDB has some scalability limitations. You can only scale by sharding and it has higher latency than mongodb or cassandra, it has a throughput limit and it is priced higher than other options. Scalability is manual (you have to shard).
If you need wider query options and you have a high read rate and you don't have so much data mongodb is better. But for durability, you need to use at least 2 mongodb server instances as master/slave. Otherwise you can lose the last minute of your data. Scalability is manual. It's much faster than simpledb. Autosharding is implemented in 1.6 version.
Cassandra has weak query options but is as durable as postgresql. It is as fast as mongo and faster on higher data size. Write operations are faster than read operations on cassandra. It can scale automatically by firing ec2 instances, but you have to modify config files a bit (if I remember correctly). If you have terabytes of data cassandra is your best bet. No need to shard your data, it was designed distributed from the 1st day. You can have any number of copies for all your data and if some servers are dead it will automatically return the results from live ones and distribute the dead server's data to others. It's highly fault tolerant. You can include any number of instances, it's much easier to scale than other options. It has strong .net and java client options. They have connection pooling, load balancing, marking of dead servers,...
Another option is hadoop for big data but it's not as realtime as others, you can use hadoop for datawarehousing. Neither cassandra or mongo have transactions, so if you need transactions postgresql is a better fit. Another option is Amazon RDS, but it's performance is bad and price is high. If you want to use databases or simpledb you may also need data caching (eg: memcached).
For web apps, if your data is small I recommend mongo, if it is large cassandra is better. You don't need a caching layer with mongo or cassandra, they are already fast. I don't recommend simpledb, it also locks you to Amazon as you said.
If you are using c#, java or scala you can write an interface and implement it for mongo, mysql, cassandra or anything else for data access layer. It's simpler in dynamic languages (eg rub,python,php). You can write a provider for two of them if you want and can change the storage maybe in runtime by a only a configuration change, they're all possible. Development with mongo,cassandra and simpledb is easier than a database, and they are free of schema, it also depends on the client library/connector you're using. The simplest one is mongo. There's only one index per table in cassandra, so you've to manage other indexes yourself, but with the 0.7 release of cassandra secondary indexes will bu possible as I know. You can also start with any of them and replace it in the future if you have to.
I think you have both a question of time and speed.
MongoDB / Cassandra are going to be much faster, but you will have to invest $$$ to get them going. This means you'll need to run / setup server instances for all them and figure out how they work.
On the other hand, you don't have to per a "per transaction" cost directly, you just pay for the hardware which is probably more efficient for larger services.
In the Cassandra / MongoDB fight here's what you'll find (based on testing I'm personally involved with over the last few days).
Cassandra:
Scaling / Redundancy is very core
Configuration can be very intense
To do reporting you need map-reduce, for that you need to run a hadoop layer. This was a pain to get configured and a bigger pain to get performant.
MongoDB:
Configuration is relatively easy (even for the new sharding, this week)
Redundancy is still "getting there"
Map-reduce is built-in and it's easy to get data out.
Honestly, given the configuration time required for our 10s of GBs of data, we went with MongoDB on our end. I can imagine using SimpleDB for "must get these running" cases. But configuring a node to run MongoDB is so ridiculously simple that it may be worth skipping the "SimpleDB" route.
In terms of DAO, there are tons of libraries already for Mongo. The Thrift framework for Cassandra is well supported. You can probably write some simple logic to abstract away connections. But it will be harder to abstract away things more complex than simple CRUD.
Now 5 years later it is not hard to set up Mongo on any OS. Documentation is easy to follow, so I do not see setting up Mongo as a problem. Other answers addressed the questions of scalability, so I will try to address the question from the point of view of a developer (what limitations each system has):
I will use S for SimpleDB and M for Mongo.
M is written in C++, S is written in Erlang (not the fastest language)
M is open source, installed everywhere, S is proprietary, can run only on amazon AWS. You should also pay for a whole bunch of staff for S
S has whole bunch of strange limitations. M limitations are way more reasonable. The most strange limitations are:
maximum size of domain (table) is 10 GB
attribute value length (size of field) is 1024 bytes
maximum items in Select response - 2500
maximum response size for Select (the maximum amount of data S can return you) - 1Mb
S supports only a few languages (java, php, python, ruby, .net), M supports way more
both support REST
S has a query syntax very similar to SQL (but way less powerful). With M you need to learn a new syntax which looks like json (also it is straight-forward to learn the basics)
with M you have to learn how you architect your database. Because many people think that schemaless means that you can throw any junk in the database and extract this with ease, they might be surprised that Junk in, Junk out maxim works. I assume that the same is in S, but can not claim it with certainty.
both do not allow case insensitive search. In M you can use regex to somehow (ugly/no index) overcome this limitation without introducing the additional lowercase field/application logic.
in S sorting can be done only on one field
because of 5s timelimit count in S can behave strange. If 5 seconds passed and the query has not finished, you end up with a partial number and a token which allows you to continue query. Application logic is responsible for collecting all this data an summing up.
everything is a UTF-8 string, which makes it a pain in the ass to work with non string values (like numbers, dates) in S. M type support is way richer.
both do not have transactions and joins
M supports compression which is really helpful for nosql stores, where the same field name is stored all-over again.
S support just a single index, M has single, compound, multi-key, geospatial etc.
both support replication and sharding
One of the most important things you should consider is that SimpleDB has a very rudimentary query language. Even basic things like group by, sum average, distinct as well as data manipulation is not supported, so the functionality is not really way richer than Redis/Memcached. On the other hand Mongo support a rich query language.