Best way to host many orient databases - orientdb

For our architecture we are contemplating something like multi-tenancy. In our approach each tenant would get their own database. When I say database, I don't mean server. I mean a database within an OrientDB server.
The question is... Is there a best practice way to do this. The three options we see are:
Stand up an entire OrientDB server to host a single database.
This seems inefficient. Especially since we are going to look towards a clustered / replicated architecture.
Put multiple databases into a single OrientDB Server
Here I am curious as to scalability. Is there a practical limit to how many databases a single OrientDB cluster can hold? Each tenant may make many connections to the database. If say each tenant makes 20 or so database connections and we have 1,000 tenants, I now have 20,000 connections going to the database. Obviously we would have many servers supporting this load so that would be distributed.
Some middle ground where we have a certain number of tenants hosted in each clustered instance of OrientDB
Not sure how to draw the line here.
Just wondering if there are best practices around this? Thanks and keep up the good work.

The physical limits are given by memory size, number of transactions managed per second and number of open files on the OS.
Every db in OrientDB is just a folder on the filesystem, if you never access the db it does not use system resources, but as soon as you access and query it, OrientDB has to keep files open, establish connections to the clients, allocate disk cache and so on.
My suggestion is to have at most a few tens of small databases on the same OrientDB instance.

Related

AWS database solution for storing non-relational data

Whats the best AWS database for the below requirement
I need to store around 50,000 - 1,00,000 entries in the database.
Each of the entry would have a String as a key and a Json array as the value.
I should be able to retrieve the JSON array using the key.
The size of JSON data is around 20-30KB
I expect around 10,000 - 40,000 reads per hour.
Around 50,000 - 1,00,000 writes/week
I have to consider the cost as well.
Ease of integration/development
I am bit confused between MongoDB, DynamoDB and PostgreSQL. Please share your thoughts on this.
DynamoDB:-
DynamoDB is a fully managed proprietary NoSQL database service that supports key-value and document data structures. For the typical use case that you have described in OP, it would serve the purpose.
DynamoDB can handle more than 10 trillion requests per day and support
peaks of more than 20 million requests per second.
DynamoDB has good AWS SDK for all operations. The read and write capacity units can be configured for the table.
DynamoDB tables using on-demand capacity mode automatically adapt to
your application’s traffic volume. On-demand capacity mode instantly
accommodates up to double the previous peak traffic on a table. For
example, if your application’s traffic pattern varies between 25,000
and 50,000 strongly consistent reads per second where 50,000 reads per
second is the previous traffic peak, on-demand capacity mode instantly
accommodates sustained traffic of up to 100,000 reads per second. If
your application sustains traffic of 100,000 reads per second, that
peak becomes your new previous peak, enabling subsequent traffic to
reach up to 200,000 reads per second.
One point to note is that it doesn't allow to query the table based on non-key attributes. This means if you don't know the hash key of the table, you may need to do full table scan to get the data. However, there is a Secondary Index option which you can explore to get around the problem. You may need to have all the Query Access Patterns of your use case before you design and make informed decision.
MongoDB:-
MongoDB is not a fully managed service on AWS. However, you can setup the database using AWS service such as EC2, VPC, IAM, EBS etc. This requires some AWS cloud experience to setup the database. The other option is to use MongoDB Atlas service.
MongoDB is more flexible in terms of querying. Also, it has a powerful aggregate functions. There are lots of tools available to query the database directly to explore the data like SQL.
In terms of Java API, the Spring MongoDB can be used to perform typical database operation. There are lots of open source frameworks available on various languages for MongoDB (example Mongoose Nodejs) as well.
The MongoDB has support for many programming languages and the APIs are mature as well.
PostgreSQL:-
PostgreSQL is a fully managed database on AWS.
PostgreSQL has become the preferred open source relational database
for many enterprise developers and start-ups, powering leading
geospatial and mobile applications. Amazon RDS makes it easy to set
up, operate, and scale PostgreSQL deployments in the cloud.
I think I don't need to write much about this database and its API. It is very mature database and has good APIs.
Points to consider:-
Query Access Pattern
Easy setup
Database maintenance
API and frameworks
Community support

mongodb connection handling for multi-tenant architecture

We are currently designing a SaaS application that has a subscriber/user based mode of operation. For example, a single subscriber can have 5, 10 or up to 25 users in their account dependent on what type of package they are on.
At the moment we are going with a single database per tenant approach. This has several advantages for us from the standpoint of the application.
I have read about the connection limits associated with Mongo and I am a little confused and worried. I was hoping someone could clarify it for me in simple terms, as I haven't worked much with Mongo.
From what I understand, there is a hard limit of 20,000 connections available on the mongod process and the mongos processes.
How does that translate to this multi-tenant approach? I am trying to basically asses how I would deploy the application in general in terms of replica sets and if sharding is necessary such that I don't hit these limits. How does one handle such a scenario for example if you have 10,000 tenants with multiple users that would exceed the limit.
Our application generally wouldn't need sharding as the each tenants collection wouldn't reach the point where it would need to be sharded. From what I understand though, MongoDB will create databases in a round robin fashion on each shard and will distribute the load which may help with the connection issues.
This is me just trying to make sense of what I've read and I'm hoping someone can clear this up for me.
Thanks in advance
EDIT
If I just add replica sets, will this alleviate this problem? Even though only the primary can accept writes from what I understand?
You just have to store a database connection in a pool and reuse it if you access the same database again. This will limit the number of connections to a reasonable number unless you aren't using 10,000s of databases which wouldn't be a good idea anyway.

Should I split NoSQL database on separate servers?

We are still in the early stages of determining whether we go with RDBMS or NoSQL.
One of the areas of interest is if we go with NoSQL (probably CouchDB although could be MongoDB), would separate NoSQL database on different servers be better than one instance of a NoSQL server?
We will be building a file management system where certain files/videos will be grouped on different servers. Accounts related files/videos will be stored on the accounts server etc. To query for an accounts related file we will most likely search the database on the accounts server.
I can see in the future that someone will say "why cant I search over all servers for a type of file or video"?
Clearly having one database would be better here. However apart from the latency in http requests to query the servers, are there better ways to do this or pros and cons of having a large database?
JD
The idea of (most) NoSQL products is that they provide horizontal scalability. This means that you single logical instance could be on dozens of servers. In MongoDB, for example, you can use auto-sharding. For your program, this is completely transparent: Your code is (almost) the same you'd use for a single database server, but the data resides on, say, 5 servers.
There are many upsides: you have a central spot to administer your database, you can query across all databases if necessary, you don't need to mess around w/ multiple db connections in your code, the db automatically balances those collections that need to be balanced, map/reduce operations can run in parallel if the query allows it, etc.
The most important ones for me: there's not much administration overhead, and you don't have to put too much thought into this now, because you can add auto-sharding later.
I would not try to do sharding myself, because it is re-inventing the wheel, and it's not easy either. This was one of the key drivers for NoSQL in the first place.

Best NoSQL approach to handle 100+ million records

I am working on a project were we are batch loading and storing huge volume of data in Oracle database which is constantly getting queried via Hibernate against this 100+ million records table (the reads are much more frequent than writes).
To speed things up we are using Lucene for some of queries (especially geo bounding box queries) and Hibernate second level cache but thats still not enough. We still have bottleneck in Hibernate queries against Oracle (we dont cache 100+ million table entities in Hibernate second level cache due to lack of that much memory).
What additional NoSQL solutions (apart from Lucene) I can leverage in this situation?
Some options I am thinking of are:
Use distributed ehcache (Terracotta) for Hibernate second level to leverage more memory across machines and reduce duplicate caches (right now each VM has its own cache).
To completely use in memory SQL database like H2 but unfortunately those solutions require loading 100+ mln tables into single VM.
Use Lucene for querying and BigTable (or distributed hashmap) for entity lookup by id.
What BigTable implementation will be suitable for this? I was considering HBase.
Use MongoDB for storing data and for querying and lookup by id.
recommending Cassandra with ElasticSearch for a scalable system (100 million is nothing for them). Use cassandra for all your data and ES for ad hoc and geo queries. Then you can kill your entire legacy stack. You may need a MQ system like rabbitmq for data sync between Cass. and ES.
It really depends on your data sets. The number one rule to NoSQL design is to define your query scenarios first. Once you really understand how you want to query the data then you can look into the various NoSQL solutions out there. The default unit of distribution is key. Therefore you need to remember that you need to be able to split your data between your node machines effectively otherwise you will end up with a horizontally scalable system with all the work still being done on one node (albeit better queries depending on the case).
You also need to think back to CAP theorem, most NoSQL databases are eventually consistent (CP or AP) while traditional Relational DBMS are CA. This will impact the way you handle data and creation of certain things, for example key generation can be come trickery.
Also remember than in some systems such as HBase there is no indexing concept. All your indexes will need to be built by your application logic and any updates and deletes will need to be managed as such. With Mongo you can actually create indexes on fields and query them relatively quickly, there is also the possibility to integrate Solr with Mongo. You don’t just need to query by ID in Mongo like you do in HBase which is a column family (aka Google BigTable style database) where you essentially have nested key-value pairs.
So once again it comes to your data, what you want to store, how you plan to store it, and most importantly how you want to access it. The Lily project looks very promising. THe work I am involved with we take a large amount of data from the web and we store it, analyse it, strip it down, parse it, analyse it, stream it, update it etc etc. We dont just use one system but many which are best suited to the job at hand. For this process we use different systems at different stages as it gives us fast access where we need it, provides the ability to stream and analyse data in real-time and importantly, keep track of everything as we go (as data loss in a prod system is a big deal) . I am using Hadoop, HBase, Hive, MongoDB, Solr, MySQL and even good old text files. Remember that to productionize a system using these technogies is a bit harder than installing Oracle on a server, some releases are not as stable and you really need to do your testing first. At the end of the day it really depends on the level of business resistance and the mission-critical nature of your system.
Another path that no one thus far has mentioned is NewSQL - i.e. Horizontally scalable RDBMSs... There are a few out there like MySQL cluster (i think) and VoltDB which may suit your cause.
Again it comes to understanding your data and the access patterns, NoSQL systems are also Non-Rel i.e. non-relational and are there for better suit to non-relational data sets. If your data is inherently relational and you need some SQL query features that really need to do things like Cartesian products (aka joins) then you may well be better of sticking with Oracle and investing some time in indexing, sharding and performance tuning.
My advice would be to actually play around with a few different systems. Look at;
MongoDB - Document - CP
CouchDB - Document - AP
Redis - In memory key-value (not column family) - CP
Cassandra - Column Family - Available & Partition Tolerant (AP)
HBase - Column Family - Consistent & Partition Tolerant (CP)
Hadoop/Hive
VoltDB - A really good looking product, a relation database that is distributed and might work for your case (may be an easier move). They also seem to provide enterprise support which may be more suited for a prod env (i.e. give business users a sense of security).
Any way thats my 2c. Playing around with the systems is really the only way your going to find out what really works for your case.
As you suggest MongoDB (or any similar NoSQL persistence solution) is an appropriate fit for you. We've run tests with significantly larger data sets than the one you're suggesting on MongoDB and it works fine. Especially if you're read heavy MongoDB's sharding and/or distributing reads across replicate set members will allow you to speed up your queries significantly. If your usecase allows for keeping your indexes right balanced your goal of getting close to 20ms queries should become feasable without further caching.
You should also check out the Lily project (lilyproject.org). They have integrated HBase with Solr. Internally they use message queues to keep Solr in sync with HBase. This allows them to have the speed of solr indexing (sharding and replication), backed by a highly reliable data storage system.
you could group requests & split them specific to a set of data & have a single (or a group of servers) process that, here you can have the data available in the cache to improve performance.
e.g.,
say, employee & availability data are handled using 10 tables, these can be handled b a small group of server (s) when you configure hibernate cache to load & handle requests.
for this to work you need a load balancer (which balances load by business scenario).
not sure how much of it can be implemented here.
At the 100M records your bottleneck is likely Hibernate, not Oracle. Our customers routinely have billions of records in the individual fact tables of our Oracle-based data warehouse and it handles them fine.
What kind of queries do you execute on your table?

What is the recommended approach towards multi-tenant databases in MongoDB?

I'm thinking of creating a multi-tenant app using MongoDB. I don't have any guesses in terms of how many tenants I'd have yet, but I would like to be able to scale into the thousands.
I can think of three strategies:
All tenants in the same collection, using tenant-specific fields for security
1 Collection per tenant in a single shared DB
1 Database per tenant
The voice in my head is suggesting that I go with option 2.
Thoughts and implications, anyone?
I have the same problem to solve and also considering variants.
As I have years of experience creating SaaS multi-tenant applicatios I also was going to select the second option based on my previous experience with the relational databases.
While making my research I found this article on mongodb support site (way back added since it's gone):
https://web.archive.org/web/20140812091703/http://support.mongohq.com/use-cases/multi-tenant.html
The guys stated to avoid 2nd options at any cost, which as I understand is not particularly specific to mongodb. My impression is that this is applicable for most of the NoSQL dbs I researched (CoachDB, Cassandra, CouchBase Server, etc.) due to the specifics of the database design.
Collections (or buckets or however they call it in different DBs) are not the same thing as security schemas in RDBMS despite they behave as container for documents they are useless for applying good tenant separation. I couldn't find NoSQL database that can apply security restrictions based on collections.
Of course you can use mongodb role based security to restrict the access on database/server level. (http://docs.mongodb.org/manual/core/authorization/)
I would recommend 1st option when:
You have enough time and resources to deal with the complexity of the
design, implementation and testing of this scenario.
If you are not going to have much differences in structure and
functionality in the database for different tenants.
Your application design will allow tenants to make only minimal
customizations at runtime.
If you want to optimize space and minimize usage of hardware
resources.
If you are going to have thousands of tenants.
If you want to scale out fast and at good cost.
If you are NOT going to backup data based on tenants (keep separate
backups for each tenant). It is possible to do that even in this
scenario but the effort will be huge.
I would go for variant 3 if:
You are going to have small list of tenants (several hundred).
The specifics of the business requires you to be able to support big differences in the database structure for different tenants (e.g. integration with 3rd-party systems, import-export of data).
Your application design will allow customers (tenants) to make significant changes in the application runtime (adding modules, customizing the fields etc.).
If you have enough resources to scale out with new hardware nodes quickly.
If you are required to keep versions/backups of data per tenant. Also the restore will be easy.
There are legal/regulatory restrictions that forces you to keep different tenants in different databases (even data centers).
If you want to fully utilize the out-of-the-box security features of mongodb such as roles.
There are big differences in matter of size between tenants (you have many small tenants and few very large tenants).
If you post additional details about your application, perhaps I can give you more detailed advice.
I found a good answer in the comments in this link:
http://blog.boxedice.com/2010/02/28/notes-from-a-production-mongodb-deployment/
Basically option #2 seems to be the best way to go.
Quote from David Mytton's comment:
We decided not to have a database per
customer because of the way MongoDB
allocates its data files. Each
database uses it’s own set of files:
The first file for a database is
dbname.0, then dbname.1, etc. dbname.0
will be 64MB, dbname.1 128MB, etc., up
to 2GB. Once the files reach 2GB in
size, each successive file is also
2GB.
Thus if the last datafile present is
say, 1GB, that file might be 90% empty
if it was recently reached.
from the manual.
As users sign up to the trial and give
things a go, we’d get more and more
databases that were at least 2GB in
size, even if the whole of the data
file wasn’t use. We found this used a
massive amount of disk space compared
to having several databases for all
customers where the disk space can be
used to maximum efficiency.
Sharding will be on a per collection
basis as standard which presents a
problem where the collection never
reaches the minimum size to start
sharding, as is the case for quite a
few of ours (e.g. collections just
storing user login details). However,
we have requested that this will also
be able to be done on a per database
level. See
http://jira.mongodb.org/browse/SHARDING-41
There are no performance tradeoffs
using lots of collections. See
http://www.mongodb.org/display/DOCS/Using+a+Large+Number+of+Collections
There is a reasonable article on MSDN about multi-tenant data architecture which you might wish to refer to. Some key topics touched on by this article:
Economic considerations
Security
Tenant considerations
Regulatory (legal)
Skill set concerns
Also touched upon are some patterns for Software as a Service (SaaS) configuration.
Additionally, worth a gander is an interesting write-up from the SQL Anywhere guys.
My own personal take - unless you are certain of enforced security / trust, I would go with option 3, or if scalability concerns prohibit fallback to option 2 at a minimum. That said... I'm no pro with MongoDB. I get pretty nervous using a shared "schema" - but I will happily defer to more experienced practitioners.
I would go for option 2.
However you could set mongod.exe command line option --smallfiles. This means that the biggest file size of an extent will be 0.5 gigabyte and not 2 gigabyte. I tested this with mongo 1.42 . So option 3 is not impossible.
According to my research in MongoDB. Trucos y consejos. Aplicaciones multitenant.
that option is not recommended if you do not know how many tenants you can have, it could be thousands and it would be complicated when it comes to sharding, also imagine having thousands of collections in a single database ... So in your case it is recommended to use option one. Now if you are going to have a limited number of users, it is already different and yes, you could use option two as you thought.
While the discussion here is on NoSQL and primarily MongoDB, we at Citus are using PostgreSQL and building a distributed/sharded multi-tenant database.
Our use-case guide walks through an example app, covering the schema and various multi-tenant specific features.
For more unstructured data, we use PostgreSQL's JSONB column to store such and tenant-specific data.