Since i have zero experience in developing web applications which can be scaled up horizontally, i need someone with experience guide me in right direction.
I had difficulties to figure out the right way of storing login sessions in database, so i came to the question is that even right to store them in databases when i am planing to use replication in future ? and if not what are the alternates ??
I need different clients(android,Windows, ...) be connected to server with their own sessions related to the same user and i am using:
1 - Cent-OS as OS
2 - PostgreSQL as DBMS
3 - Tomee as HTTP server and Servlet container
4 - Partitioned Tables (Inherited Tables in PostgreSQL) to improve performance, chance of in memory index scan, prevent fragmentation and etc
My problem raise from the fact that i need to check session availability in every received request from clients (every session has its own encryption keys) and it is possible to have millions of sessions, in a distributed environment i can not be sure that the created session will be available in replicated database at the right time.
Thanks for helping
Storing the user sessions in RDBMS will evantually decrease the perfermance of your application. You should take a look at distributed caching mechanisms to store and read user sessions. I strongly recommend NoSQL database solutions for user sessions such as Redis which enables you in-memory storage of your key-value pairs and responds in a very high speed. In addition, you can tune the configuration file in order to store the in-memory key-value pairs to disk for persistence. You also need to focus on distribution of your key-value pairs across multiple NoSQL instances and internal data structures such as HashMaps. You can use your own hashing algorithm to distribute your key-value pairs on multiple instances that will enable you high availability. You should keep in mind that shardening your data horizontally will result in some issues, you can refer to CAP theorem - Availability and Partition Tolerance fore details.
Related
Whats the best AWS database for the below requirement
I need to store around 50,000 - 1,00,000 entries in the database.
Each of the entry would have a String as a key and a Json array as the value.
I should be able to retrieve the JSON array using the key.
The size of JSON data is around 20-30KB
I expect around 10,000 - 40,000 reads per hour.
Around 50,000 - 1,00,000 writes/week
I have to consider the cost as well.
Ease of integration/development
I am bit confused between MongoDB, DynamoDB and PostgreSQL. Please share your thoughts on this.
DynamoDB:-
DynamoDB is a fully managed proprietary NoSQL database service that supports key-value and document data structures. For the typical use case that you have described in OP, it would serve the purpose.
DynamoDB can handle more than 10 trillion requests per day and support
peaks of more than 20 million requests per second.
DynamoDB has good AWS SDK for all operations. The read and write capacity units can be configured for the table.
DynamoDB tables using on-demand capacity mode automatically adapt to
your application’s traffic volume. On-demand capacity mode instantly
accommodates up to double the previous peak traffic on a table. For
example, if your application’s traffic pattern varies between 25,000
and 50,000 strongly consistent reads per second where 50,000 reads per
second is the previous traffic peak, on-demand capacity mode instantly
accommodates sustained traffic of up to 100,000 reads per second. If
your application sustains traffic of 100,000 reads per second, that
peak becomes your new previous peak, enabling subsequent traffic to
reach up to 200,000 reads per second.
One point to note is that it doesn't allow to query the table based on non-key attributes. This means if you don't know the hash key of the table, you may need to do full table scan to get the data. However, there is a Secondary Index option which you can explore to get around the problem. You may need to have all the Query Access Patterns of your use case before you design and make informed decision.
MongoDB:-
MongoDB is not a fully managed service on AWS. However, you can setup the database using AWS service such as EC2, VPC, IAM, EBS etc. This requires some AWS cloud experience to setup the database. The other option is to use MongoDB Atlas service.
MongoDB is more flexible in terms of querying. Also, it has a powerful aggregate functions. There are lots of tools available to query the database directly to explore the data like SQL.
In terms of Java API, the Spring MongoDB can be used to perform typical database operation. There are lots of open source frameworks available on various languages for MongoDB (example Mongoose Nodejs) as well.
The MongoDB has support for many programming languages and the APIs are mature as well.
PostgreSQL:-
PostgreSQL is a fully managed database on AWS.
PostgreSQL has become the preferred open source relational database
for many enterprise developers and start-ups, powering leading
geospatial and mobile applications. Amazon RDS makes it easy to set
up, operate, and scale PostgreSQL deployments in the cloud.
I think I don't need to write much about this database and its API. It is very mature database and has good APIs.
Points to consider:-
Query Access Pattern
Easy setup
Database maintenance
API and frameworks
Community support
I am still not able to relate in real-time how nosql is beneficial whereas we have indexes too in traditional RDBMS's. If someone can suggest columnar databases advantages in real application particularly in terms of using structure, semistructured or unstructured data.
Largely, it depends on what you want your datastore to do. If you want to be able to scale to meet storage or operational demands, a RDBMS can only take you so far.
It comes down how you can scale to meet demand. A RDBMS is really only capable of scaling vertically. That is, add more RAM, add more disk, etc. A distributed (NoSQL) database makes scaling easier by allowing you to add more machine instances. This is known as scaling horizontally.
Here's an example using Cassandra:
Let's say I have a 3 node cluster, and my keyspace (database) is also configured with a replication factor (RF) of 3. This means that each node is responsible for 100% of the data. I load my data, and it takes up 100GB of disk space (on each node). Now, while I might have 300GB of data total in my cluster, a single copy of my data is 100GB.
So my product team comes to me and says they need to double the amount of data they have. I know that I built their 3 node cluster with 200GB drives. If I did nothing, those drives would pretty much fill-up (and if they didn't they wouldn't leave room for much else).
Now it's up to me to scale the cluster to meet their space demands. I'll start by adding 3 new nodes to the cluster (for a total of 6), but I'll leave my RF at 3. This makes each node responsible for 50% of the data, or 50GB. When my product team loads more data to meet their "doubling" requirement, each node should climb back up to about 100GB. A single copy of the data is now 200GB. But with each node responsible for 50%, each 200GB drive still only has 100GB.
Example #2:
Let's say that the cluster above with 6 nodes is capable of supporting an operational load of 10,000 operations per second (ops). My product team comes to me again, saying that for the holiday season they project needing to support 20,000 ops. As the current cluster can only support half of that, it will choke under the intense throughput, and one or more nodes may crash.
As Cassandra scales linearly, the way to achieve this is to (again) double the size of the cluster. So I increase it from 6 nodes to 12 nodes, while still maintaining my RF of 3. After running some performance testing, they verify that it can indeed support 20,000 ops. As a single copy of my data is 200GB, the total data footprint remains 600GB. With 12 nodes, each node is now responsible for only 25% of the data, or 50GB.
So scalability is the advantage. But how about modeling the data? The main idea in distributed database modeling, is two-fold:
Build a table structure which is keyed to distribute well. We don't want uneven amounts of data on each node.
Build the key on the table so that it matches our query requirements.
One of the drawbacks of a NoSQL database, is that your query patterns become restricted. In an effort to cut down on network time, you want to ensure that your query can be served by a single node.
This usually means using natural keys, as those are more in-line with what you are asking of your data. Surrogate keys (alpha, numerical, or both) distribute well, but aren't really useful for querying. User "Bob Jones" might be id "3582346556230" in my system. But when I want to query Bob's data, I'll probably never want to ask for it by "3582346556230," because that doesn't mean anything to the application or the context in which the data is used.
Also, you want your data to have structure. Unstructured data is un-queryable data. Simple as that. If you want unstructured data to be queryable, you need to parse-out its identifying aspects to be used as keys. You don't want to "search" or run SELECT * FROM queries. Full table scans in NoSQL databases are even more resource consuming than their RDBMS counterparts, because they have to check each node, sort through replicas, and thus incurs extra network time.
NoSQL databases give you the ability to scale (for increases in data or demand). But it's important to note that their scalability can make some things (which a RDBMS might be good at), more difficult than you're used to.
The R in RDBMS, relational, is the biggest thing missing from Mongo. There's very little to no way to make the database understand how entries in different tables collections relate to each other. One of the big strengths of RDBMSs is the ability to define constraints which the database will enforce, most typically foreign key constraints which ensure that an id in one table refers to an existing id in another table.
One requirement for the database to be able to enforce such constraints is obviously that everything needs to go through one source of truth and there needs to be one central entity cross-checking the data; it cannot be decentralised since discrepancies between two different primary sources can lead to data inconsistencies.
In Mongo, each data blob is pretty much independent. It doesn't refer to other entries in any way enforced by the database. Mongo also has weak to no ACID guarantees, meaning there's little protection against race condition inserts or updates. In a word: Mongo makes little guarantees with regards to data consistency and mostly offloads these kinds of concerns to the application layer. That allows it to work more decentralised.
E.g. a good way to scale Mongo is to have many secondary servers which replicate a primary server for read-only access. There's no guarantee that the primary and secondaries will be in sync at any given time, it may take a couple of seconds for data written to the primary to trickle to the secondaries. But this allows you to have a virtually unlimited number of secondary read-only servers, which is great for scaling a database under heavy read load.
The way specifically Mongo handles its clusters also allows it to have a very high uptime, as the cluster will reorganise itself into primaries and secondaries automatically if a server goes down. This even allows for rolling maintenance without any client downtime.
Not having to enforce complex constraints or transaction consistency during writing also allows a more fire-and-forget style of writing to the database, which can be much faster. Again, at the cost of allowing inconsistent data. Which is why most writing pretty much means atomically updating a single document in a collection with no guarantees about other documents, which is something of a different paradigm than RDBMS transactional updates across many tables.
I would not recommend Mongo for storing things like a financial ledger, which heavily relies on transactional guarantees for consistency. However, things like Twitter are a perfect case for it: many independent snippets of data which must be read by a massive number of clients.
I'm reading about ArangoDB and it is more interesting but I can't find where in the documentation how ArangoDB scales. Does ArangoDB scale and can it use sharding like MongoDB or CouchDB?
EDIT
ArangoDB supports sharding since Version 2.0.
Version 3.0 will bring VelocyPack, which is a binary JSON representation optimized for compactness, parseability and composeability. It supersedes the shape concept / shaped JSON.
/EDIT
I am the chief architect of ArangoDB.
monkegjinni is right, ArangoDB did not support sharding, but replication. Why?
Short version:
Offering a support for fairly complex data models like graphs and documents gets into conflicts with how sharding works. However, with the efficiency of modern SSD and computers, we believe that almost all projects no longer need sharding. Today's computer will easily store all data on a single nodes. What these projects need is replication for load distribution which is supported by ArangoDB.
Long version:
There are actually to separate scaling issues.
The first issue is distributing the request over several servers to balance the request load.
ArangoDB will support this through synchronous replication of writes and distribution of the read requests.
Note that most database systems follow a very similar path, i.e., they support distributing the requests either with restricted consistency guarantees or they allow writes only on one node and distribute the read requests. They have this restriction because distributing write requests and supporting full consistency is impossible to do efficiently. And doing it inefficiently will negate the gain that we wanted to achieve through distribution.
The second issue is distributing the data over several servers to allow larger datasets.
ArangoDB does not support distributing the data over several servers.
We have made this decision, because distributing the data over several servers always comes at a price.
This price can be very explicit. For example it can be that the data model is very limited. This is the route that key value stores such as Dynamo or RIAK have taken. Here the data model and the supported queries are so simple, that it is always possible to direct a query to the server (or the small number of servers) on which the requested value live.
Note that we do believe that this approach is valid for some applications (e.g. Amazons database). But we believe that the number of applications that truly need to store so much data that they must distribute it over a large number of servers and must therefore restrict the access pattern to key-value is very small.
Or the price can be hidden. This is for example the case if the data is distributed and the database system allows general queries. In that case the query must be distributed over all servers (because the data you are looking for may live on any of the servers). That makes the queries inefficient.
The ArangoDB approach is rather to squeeze the most onto one server (well ArangoDB supports multiple servers - but to support availability). For this it uses two main strategies.
One strategy is to make use of SSDs. Note that the capacity of SSDs is growing larger at an incredible rate (you can buy a Terabyte of SSD for by far less money that a second server would cost you). And endurance (the total amount of data that can be written to a SSD) goes up to Petabytes (now that vendors finally get the wear leveling algorithms right) - so reliability of SSDs is no longer an issue. And the performance of those SSDs is very nice (closer to main memory than to ordinary disks).
The other strategy is to store the data efficiently. ArangoDB uses shapes to store documents: A shape is the information which attributes and attribute types a document has - all document with the same shape share the representation of this information. This means that documents can be stored in less space than the JSON or BSON representation would require.
As I understand, it does not allow sharding (prior to version 2.0), but replication.
From the link
AvocadoDB effortlessly permits replications. We like the “zero-admin principle”. Making replications with AvocadoDB is really easy: Insert IP address and go!
Following replication types are intended for version 2:
master-master synchronous,
master-master asynchronous,
master-slave synchronous,
master-slave asynchronous
I am working on a project were we are batch loading and storing huge volume of data in Oracle database which is constantly getting queried via Hibernate against this 100+ million records table (the reads are much more frequent than writes).
To speed things up we are using Lucene for some of queries (especially geo bounding box queries) and Hibernate second level cache but thats still not enough. We still have bottleneck in Hibernate queries against Oracle (we dont cache 100+ million table entities in Hibernate second level cache due to lack of that much memory).
What additional NoSQL solutions (apart from Lucene) I can leverage in this situation?
Some options I am thinking of are:
Use distributed ehcache (Terracotta) for Hibernate second level to leverage more memory across machines and reduce duplicate caches (right now each VM has its own cache).
To completely use in memory SQL database like H2 but unfortunately those solutions require loading 100+ mln tables into single VM.
Use Lucene for querying and BigTable (or distributed hashmap) for entity lookup by id.
What BigTable implementation will be suitable for this? I was considering HBase.
Use MongoDB for storing data and for querying and lookup by id.
recommending Cassandra with ElasticSearch for a scalable system (100 million is nothing for them). Use cassandra for all your data and ES for ad hoc and geo queries. Then you can kill your entire legacy stack. You may need a MQ system like rabbitmq for data sync between Cass. and ES.
It really depends on your data sets. The number one rule to NoSQL design is to define your query scenarios first. Once you really understand how you want to query the data then you can look into the various NoSQL solutions out there. The default unit of distribution is key. Therefore you need to remember that you need to be able to split your data between your node machines effectively otherwise you will end up with a horizontally scalable system with all the work still being done on one node (albeit better queries depending on the case).
You also need to think back to CAP theorem, most NoSQL databases are eventually consistent (CP or AP) while traditional Relational DBMS are CA. This will impact the way you handle data and creation of certain things, for example key generation can be come trickery.
Also remember than in some systems such as HBase there is no indexing concept. All your indexes will need to be built by your application logic and any updates and deletes will need to be managed as such. With Mongo you can actually create indexes on fields and query them relatively quickly, there is also the possibility to integrate Solr with Mongo. You don’t just need to query by ID in Mongo like you do in HBase which is a column family (aka Google BigTable style database) where you essentially have nested key-value pairs.
So once again it comes to your data, what you want to store, how you plan to store it, and most importantly how you want to access it. The Lily project looks very promising. THe work I am involved with we take a large amount of data from the web and we store it, analyse it, strip it down, parse it, analyse it, stream it, update it etc etc. We dont just use one system but many which are best suited to the job at hand. For this process we use different systems at different stages as it gives us fast access where we need it, provides the ability to stream and analyse data in real-time and importantly, keep track of everything as we go (as data loss in a prod system is a big deal) . I am using Hadoop, HBase, Hive, MongoDB, Solr, MySQL and even good old text files. Remember that to productionize a system using these technogies is a bit harder than installing Oracle on a server, some releases are not as stable and you really need to do your testing first. At the end of the day it really depends on the level of business resistance and the mission-critical nature of your system.
Another path that no one thus far has mentioned is NewSQL - i.e. Horizontally scalable RDBMSs... There are a few out there like MySQL cluster (i think) and VoltDB which may suit your cause.
Again it comes to understanding your data and the access patterns, NoSQL systems are also Non-Rel i.e. non-relational and are there for better suit to non-relational data sets. If your data is inherently relational and you need some SQL query features that really need to do things like Cartesian products (aka joins) then you may well be better of sticking with Oracle and investing some time in indexing, sharding and performance tuning.
My advice would be to actually play around with a few different systems. Look at;
MongoDB - Document - CP
CouchDB - Document - AP
Redis - In memory key-value (not column family) - CP
Cassandra - Column Family - Available & Partition Tolerant (AP)
HBase - Column Family - Consistent & Partition Tolerant (CP)
Hadoop/Hive
VoltDB - A really good looking product, a relation database that is distributed and might work for your case (may be an easier move). They also seem to provide enterprise support which may be more suited for a prod env (i.e. give business users a sense of security).
Any way thats my 2c. Playing around with the systems is really the only way your going to find out what really works for your case.
As you suggest MongoDB (or any similar NoSQL persistence solution) is an appropriate fit for you. We've run tests with significantly larger data sets than the one you're suggesting on MongoDB and it works fine. Especially if you're read heavy MongoDB's sharding and/or distributing reads across replicate set members will allow you to speed up your queries significantly. If your usecase allows for keeping your indexes right balanced your goal of getting close to 20ms queries should become feasable without further caching.
You should also check out the Lily project (lilyproject.org). They have integrated HBase with Solr. Internally they use message queues to keep Solr in sync with HBase. This allows them to have the speed of solr indexing (sharding and replication), backed by a highly reliable data storage system.
you could group requests & split them specific to a set of data & have a single (or a group of servers) process that, here you can have the data available in the cache to improve performance.
e.g.,
say, employee & availability data are handled using 10 tables, these can be handled b a small group of server (s) when you configure hibernate cache to load & handle requests.
for this to work you need a load balancer (which balances load by business scenario).
not sure how much of it can be implemented here.
At the 100M records your bottleneck is likely Hibernate, not Oracle. Our customers routinely have billions of records in the individual fact tables of our Oracle-based data warehouse and it handles them fine.
What kind of queries do you execute on your table?
I'm evaluating a storage platform for an upcoming project and keep coming back to Cassandra. For this project loosing any amount of data is unacceptable. So far we've used a relational database (Microsoft SQL Server), but the data is so varied and large that it has become an issue to store and query.
Is Cassandra robust enough to use as a primary data store? Or should it only be used to mirror existing data to speed up access?
Anecdotally: yes, Twitter, Digg, Ooyala, SimpleGeo, Mahalo, and others are using or moving to Cassandra for a primary data store (http://n2.nabble.com/Cassandra-users-survey-td4040068.html).
Technically: yes; besides supporting replication (including to multiple datacenters), each Cassandra node has an fsync'd commit log to make sure writes are durable; from there writes are turned into SSTables which are immutable until compaction (which combines multiple SSTables to GC old versions). Snapshotting is supported at any time, including automatic snapshot-before-compaction.
Whether to use Cassandra for your application or not depends purely on your data workloads. Cassandra is optimised for write-intensive workloads, therefore, it is suitable for applications where a large amount of data needs to be inserted (such as infrastructure logging information at Facebook).
If however, you require fast retrievals and insertion speed is not an issue, then perhaps you should have a look at say HBase (which is optimised of read-intensive workloads).