I want to design a product which allows customers to create their own websites. A customer will be able to maintain his website's data model on the fly, do queries on it and display the output on a html page. I doubt a traditional RDMBS is the right choice for two reasons; with every customer the amount of data will grow and the RDBMS might reach its limits even if scaled. As the data model is highly dynamic doing many DDL queries will slow down the whole system.
I'm trying to figure out which database/datastorage system might be the best option for such a system. Recently I read a lot through NoSQL solutions like Cassandra and MongoDB and it looks promising in terms of performance but comes with a flaw: it's not relational data so data have to be denormalized.
I don't know what will be the impact of denormalizing a dynamic customer defined data model, because the customer models and inserts data first (in a relational way) and then does the queries afterwards. The denormalization has to happen automatically which leads to another problem: Can I create one table for each query, even if some queries might be similar? There might exist a high redundancy of data after a while.
Does creating/updating tables on the fly have any impact?
Every time the customer changes data the same data has to be changed in all tables which hold a copy of the same entity (like the name of an employee has to be changed in "team member" and also in "project task"). Are those updates costly?
Is it possible to nest data with unlimited depth like {"team": {"members": [{"name": "Ben"}]}}?
There might be even better/other approaches, I'm happy for any hints.
Adding clarification to the requirements
My question actually is, how can I use a NoSQL DB like Cassandra to maintain relational data and will the solution still perform better compared to a RDBMS?
The customer thinks relational (because in fact, data are always relational in my opinion) no matter what DBMS is used. And this service is not about letting the customer chose the underlying data storage. There can only be one.
A customer can define his own relational data model by using a management frontend provided by the application. The data model may be changed at any time by the customer. In RDBMS a DDL on a production system is not a good idea. On top of the data schema the customer can add named queries and use them as a data source on any web page he creates.
An example would be a query for News given the name "news" and in a web page it would be used like <ul><li query="news"><h1>[news.title]</h1></li></ul>, which would execute the query and iterate through the data and repeat the <li> on each iteration. That is the most simple example though.
In more complex examples if using SQL there might be extensive use of sub queries which performs bad. In NoSQL it seems there is the option to first denormalize and prepare a table with the data needed by the query and then just query that table. Any changes to involved data would lead to an update for that table. That means for every query the customer creates the system will automatically create and maintain a table and its data, so there will be a lot of data redundancy. Benchmarks state that Cassandra is fast in writing so that might be an option.
Let me put my 2 cents in.
Talking about of ability for users having own data models is not about SaaS.
In the pure SaaS paradigm, each user has the same functionality and data model. He could add his own objects, but not the classes of objects.
So scaling in this paradigm is a rather obvious (though frankly, it could be not so trivial) solution. You can get cloud DB with built-in multi-tenant support (like Azure, for example), you can use Amazon's RDS and add more instances as the user amount growth, you can use sharding (for instance, a partition by users) if the database supports it, etc.
But when we're talking about custom data model for each user is more like IaaS (infrastructure). It is some more low-level thing and you just say: "Ok, guys, you may build any data model you want, whatever".
And I believe that if you move the responsibility for the data model creation to the user, you should also move the responsibility for database selection, as IaaS provides. So the user would say:" "Ok, I need key-value database here" and you provide him Cassandra's table for example. If he wants RDBMS, you provide him one also.
Otherwise, you have to consider not the data model itself, but also the data strategy that your customer needs. So some customer may need to have key-value storage (that needs to be backed by some noSQL DB), the other may need RDBMS. How would you know it?
For instance, consider the entity from your example: {"team": {"members": [{"name": "Ben"}]}}. One user would use this model for the single type of queries something like "get the members for the team" and "add the member for the team". Another one user may need to query frequently for some stats information (average team player age, games played). And these two scenarios could demand different database types: first is for key-value search, the other is RDBMS. How would you guess the database type and structure as key-value storages are modeled around queries?
Technically, you may even try to guess the database type depending on the users' data model and queries, but you need to add some restrictions for users' creativity. Otherwise, it would be very untrivial task.
And about scaling, as each model is unique, you need to have add database instances as users grow. Of course, you can have multiple users in the single database instance in the different schemas, and you will need to determine the users' amount per instance by experiments or performance testing.
You may also look at the document-oriented databases, but I think that you need review your concept and make some changes. Maybe you have some obvious restrictions yet, but I just didn't get it from your post.
Related
I'm not sure if stack-overflow is a right platform to ask the title question. I'm in dilemma as to which front-end and back-end stack should i consider for developing a health related web application?
I heartily appreciate any suggestions or recommendations. Thanks.
You will need to have a look at your data, if it is relational, I would personally go for a SQL server such Microsoft SQL Server, MySQL or Postgres. If your data is non-relational you can go for something like Mongo.
Here is an image that explains how relational data and non-relational data work:
I'm not saying that MongoDB is bad, it all depends on your data and how you would like to structure your data. Obviously when you're working with healthcare data such as patient data there are certain laws you need to adhere to, especially in the United States with HIPPA, but I am sure almost every country has one of those.
The implications might be that you need to encrypt any data stored in the database, and that's one of the benefits of a relational database as most of them have either TDE (Transparent Data Encryption) or Encryption at Rest. Which means that your data is secured when in use and when not in use, respectively.
When it comes to the front-end you can look at Javascript frameworks such as Angular, Vue, React and then for your backend you can choose pretty much anything that you know well such as NodeJS or .NET Core or Go, pick your poison, each of them have their advantages and drawbacks so you will need to investigate your options before committing to one or the other.
It depends on your data structures. You can use MongoDB with dynamic schemas that eliminates the need for a predefined data structure. So you can use MongoDB when you have a dynamic dataset which is less relational. In the other hand, MongoDB is natively scalable. So you can store a large amount of data without much trouble.
Use a relational DB system when you have highly relational entities. SQL enables you to have complex transactions between entities with high reliability.
MongoDB/NoSQL
High write load
Unstable Schemas
Can handle big amount of data
High availability
SQL
Data structure fits for tables and row
Strict relationships among entities
Complex queries
Frequent updates in a large number of records
I have recently been introduced to MongoDB and I've come to like a lot (compared to MySQL i used for all projects).
However in some certain situations, storing my data with documents "linking" to each other with simple IDs makes more sense (to reduce duplicated data).
For example, I may have Country and User documents, where a user's location is actually an ID to a Country (since a Country document includes more data, hence duplicating Country data in each user makes no sense).
What I am curious about is.. why would MongoDB be inferior compared to using a proper relationship database?
Is it because I can save transactions by doing joins (as opposed to doing two transactions with MongoDB)?
Thats a good question..!!
I would say there is definitely nothing wrong in using nosql db for the type of data you have described. For simple usecases it will work perfectly well.
The only point is that relational databases have been designed long time back to serve the purpose of storing and querying WELL STRUCTURED DATA.. with proper relations defined. Hence for a large amount of well structured data the performance and the features provided will be a lot more than that provided by a nosql database. Since they are more matured.. its their ball game..!!
On the other hand nosql databases have been designed to handle very large amount of unstructured data and has out of the box support for distributed environment scaling. So its a completely different ball game now..
They basically treat data differently and hence have different strategies / execution plans to fetch a given data..
MongoDB was designed from the ground up to be scalable over multiple servers. When a MongoDB database gets too slow or too big for a single server, you can add additional servers by making the larger collections "sharded". That means that the collection is divided between different servers and each one is responsible for managing a different part of the collection.
The reason why MongoDB doesn't do JOINs is that it is impossible to have JOINs perform well when one or both collections are sharded over multiple nodes. A JOIN requires to compare each entry of table/collection A with each one of table/collection B. There are shortcuts for this when all the data is on one server. But when the data is distributed over multiple servers, large amounts of data need to be compared and synchronized between them. This would require a lot of network traffic and make the operation very slow and expensive.
Is it correct that you have only two tables, country and user. If so, it seems to me the only data duplicated is a foreign key, which is not a big deal. If there is more duplicated, then I question the DB design itself.
In concept, you can do it in NOSQL but why? Just because NOSQL is new? OK, then do it to learn but remember, "if it ain't broke, don't fix it." Apparently the application is already running on relational. If the data is stored in separate documents in MongoDB and you want to interrelate them, you will need to use a link, which will be more work than a join and be slower. You will have to store a link, which would be no better than storing the foreign key. Alternatively, you can embed one document in another in MongoDB, which might even increase duplication.
If it is currently running on MySQL then it is not running on distributed servers, so Mongo's use of distributed servers is irrelevant. You would have to add servers to take advantage of that. If the tables are properly indexed in relational, it will not have to search through large amounts of data.
However, this is not a complex application and you can use either. If the data is stored on an MPP environment with relational, it will run very well and will not need to search to large amounts of data at all. There are two requirements, however, in choosing a partitioning key in MPP: 1. pick one that will achieve an even distribution of data; and 2. pick a key that can allow collocation of data. I recommend you use the same key as the partitioning key (shard key) in both files.
As much as I love MongoDB, I don't see the value in moving your app.
let's say that I have an ecommerce website with million of products, that have millions of pageviews a day, mostly for product details pages.
Let's say that I currently have all my data in a relational DB, the old good way.
What would be the pros and cons of keeping the data in the relational DB for doing queries, aggreating and filtering products and all that...but using flat json files for the product details?
So, having 1 file per 1 product, with all details serialized to json. These files would be placed under a high-performance cdn, geographically distributed and all that. When the user goes to
www.mysite.com/prods/00123
the server (or even the client) would load a template file for the layout, and then fill it with the data it reads from something like cdn.mysite.com/prods/00123.json
So I basically don't need to do queries in this case - I jump straight to the file named after the product id. I guess it should be very fast, and yet I would delegate the scalability / caching / geographic distribution to an external strong partner (cdn like akamai, amazon etc.) instead of building my own (expensive and hard to maintain) distributed db server?
I look forward to your suggestions / feedback...especially if it comes to real world experience :)
Thanks!
As per your requirements,
It is better to store product descriptions in a schema free database like MongoDB since your products can have very different fields with wide variation in number of attributes (and corresponding fields). Also such information is written far less often then they are read. MongoDB has collection level write locks which deter write heavy applications if you like to do consistent writes. However the reads in MongoDB are very fast because you dont have to do joins or fetch field values from a EAV schema table. Needless to say, based on your data volume sharding and replication needs to be done in a production environment.
It is better than storing in a flat file since MongoDB's read performance is very good because of memory mapped files and you get replication/sharding as well.
However, if the filesystem (or the filesystem network) provides the security, speed and accessibility provided by the database then storing the data in filesystem is not a bad idea. The traditional db vs flat-file argument does not hold true if the flat files are configured to be served in an efficient manner.
However, you should not store information like shopping cart, checkout transaction, etc in MongoDB since you dont have ACID transactions and frequent writes and updates 'with consistency' is not MongoDB's cup of tea.
Does it make sense to break up the data model of an application into different database systems? For example, the application stores all user data and relationships in a graph database (ideal for storing relationships), while storing other data in a document database, such as CouchDB or MongoDB? This would require the user graph database to reference unique ids in the document databases and vice versa.
Is this over complicating the data model and application? Or is this using the best uses of both types of database systems for scaling your application?
It definitely can make sense and depends fully on the requirements of your application. If you can use other database systems for things in which they are really good at.
Take for example full text search. Of course you can do more or less complex full text searches with a relational database like MySql. But there are systems like e.g. Lucene/Solr which are optimized for such things and can search fast in millions of documents. So you could use these systems for their special task (here: make a nifty full text search), then you return the identifiers and maybe load the relational structured data from the RDBMS.
Or CouchDB. I use couchDB in some projects as a caching systems. In combination with a relational database. Of course I need to care about consistency, but it it's definitely worth the effort. It pushed performance in the projects a lot and decreases for example load on the server from 2 to 0.2. :)
Something like this is for instance called cross-store persistence. As you mentioned you would store certain data in your relational database, social relationships in a graphdb, user-generated data (documents) in a document-db and user provided multimedia files (pictures, audio, video) in a blob-store like S3.
It is mainly about looking at the use-cases and making sure that from wherever you need it you might access the "primary" or index key of each store (back and forth). You can encapsulate the actual lookup in your domain or dao layer.
Some frameworks like the Spring Data projects provide some initial kind of cross-store persistence out of the box, mostly integrating JPA with a different NOSQL datastore. For instance Spring Data Graph allows it to store your entities in JPA and add social graphs or other highly interconnected data as a secondary concern and leverage a graphdb for the typical traversal and other graph operations (e.g. ranking, suggestions etc.)
Another term for this is polyglot persistence.
Here are two contrary positions on the question:
Pro:
"Contrary to that, I’m a big fan of polyglot persistence. This simply means using the right storage backend for each of your usecases. For example file storages, SQL, graph databases, data ware houses, in-memory databases, network caches, NoSQL. Today there are mostly two storages used, files and SQL databases. Both are not optimal for every usecase."
http://codemonkeyism.com/nosql-polyglott-persistence/
Con:
"I don’t think I need to say that I’m a proponent of polyglot persistence. And that I believe in Unix tools philosophy. But while adding more components to your system, you should realize that such a system complexity is “exploding” and so will operational costs grow too (nb: do you remember why Twitter started to into using Cassandra?) . Not to mention that the more components your system has the more attention and care must be invested figuring out critical aspects like overall system availability, latency, throughput, and consistency."
http://nosql.mypopescu.com/post/1529816758/why-redis-and-memcached-cassandra-lucene
I would like to test the NoSQL world. This is just curiosity, not an absolute need (yet).
I have read a few things about the differences between SQL and NoSQL databases. I'm convinced about the potential advantages, but I'm a little worried about cases where NoSQL is not applicable. If I understand NoSQL databases essentially miss ACID properties.
Can someone give an example of some real world operation (for example an e-commerce site, or a scientific application, or...) that an ACID relational database can handle but where a NoSQL database could fail miserably, either systematically with some kind of race condition or because of a power outage, etc ?
The perfect example will be something where there can't be any workaround without modifying the database engine. Examples where a NoSQL database just performs poorly will eventually be another question, but here I would like to see when theoretically we just can't use such technology.
Maybe finding such an example is database specific. If this is the case, let's take MongoDB to represent the NoSQL world.
Edit:
to clarify this question I don't want a debate about which kind of database is better for certain cases. I want to know if this technology can be an absolute dead-end in some cases because no matter how hard we try some kind of features that a SQL database provide cannot be implemented on top of nosql stores.
Since there are many nosql stores available I can accept to pick an existing nosql store as a support but what interest me most is the minimum subset of features a store should provide to be able to implement higher level features (like can transactions be implemented with a store that don't provide X...).
This question is a bit like asking what kind of program cannot be written in an imperative/functional language. Any Turing-complete language and express every program that can be solved by a Turing Maching. The question is do you as a programmer really want to write a accounting system for a fortune 500 company in non-portable machine instructions.
In the end, NoSQL can do anything SQL based engines can, the difference is you as a programmer may be responsible for logic in something Like Redis that MySQL gives you for free. SQL databases take a very conservative view of data integrity. The NoSQL movement relaxes those standards to gain better scalability, and to make tasks that are common to Web Applications easier.
MongoDB (my current preference) makes replication and sharding (horizontal scaling) easy, inserts very fast and drops the requirement for a strict scheme. In exchange users of MongoDB must code around slower queries when an index is not present, implement transactional logic in the app (perhaps with three phase commits), and we take a hit on storage efficiency.
CouchDB has similar trade-offs but also sacrifices ad-hoc queries for the ability to work with data off-line then sync with a server.
Redis and other key value stores require the programmer to write much of the index and join logic that is built in to SQL databases. In exchange an application can leverage domain knowledge about its data to make indexes and joins more efficient then the general solution the SQL would require. Redis also require all data to fit in RAM but in exchange gives performance on par with Memcache.
In the end you really can do everything MySQL or Postgres do with nothing more then the OS file system commands (after all that is how the people that wrote these database engines did it). It all comes down to what you want the data store to do for you and what you are willing to give up in return.
Good question. First a clarification. While the field of relational stores is held together by a rather solid foundation of principles, with each vendor choosing to add value in features or pricing, the non-relational (nosql) field is far more heterogeneous.
There are document stores (MongoDB, CouchDB) which are great for content management and similar situations where you have a flat set of variable attributes that you want to build around a topic. Take site-customization. Using a document store to manage custom attributes that define the way a user wants to see his/her page is well suited to the platform. Despite their marketing hype, these stores don't tend to scale into terabytes that well. It can be done, but it's not ideal. MongoDB has a lot of features found in relational databases, such as dynamic indexes (up to 40 per collection/table). CouchDB is built to be absolutely recoverable in the event of failure.
There are key/value stores (Cassandra, HBase...) that are great for highly-distributed storage. Cassandra for low-latency, HBase for higher-latency. The trick with these is that you have to define your query needs before you start putting data in. They're not efficient for dynamic queries against any attribute. For instance, if you are building a customer event logging service, you'd want to set your key on the customer's unique attribute. From there, you could push various log structures into your store and retrieve all logs by customer key on demand. It would be far more expensive, however, to try to go through the logs looking for log events where the type was "failure" unless you decided to make that your secondary key. One other thing: The last time I looked at Cassandra, you couldn't run regexp inside the M/R query. Means that, if you wanted to look for patterns in a field, you'd have to pull all instances of that field and then run it through a regexp to find the tuples you wanted.
Graph databases are very different from the two above. Relations between items(objects, tuples, elements) are fluid. They don't scale into terabytes, but that's not what they are designed for. They are great for asking questions like "hey, how many of my users lik the color green? Of those, how many live in California?" With a relational database, you would have a static structure. With a graph database (I'm oversimplifying, of course), you have attributes and objects. You connect them as makes sense, without schema enforcement.
I wouldn't put anything critical into a non-relational store. Commerce, for instance, where you want guarantees that a transaction is complete before delivering the product. You want guaranteed integrity (or at least the best chance of guaranteed integrity). If a user loses his/her site-customization settings, no big deal. If you lose a commerce transation, big deal. There may be some who disagree.
I also wouldn't put complex structures into any of the above non-relational stores. They don't do joins well at-scale. And, that's okay because it's not the way they're supposed to work. Where you might put an identity for address_type into a customer_address table in a relational system, you would want to embed the address_type information in a customer tuple stored in a document or key/value. Data efficiency is not the domain of the document or key/value store. The point is distribution and pure speed. The sacrifice is footprint.
There are other subtypes of the family of stores labeled as "nosql" that I haven't covered here. There are a ton (122 at last count) different projects focused on non-relational solutions to data problems of various types. Riak is yet another one that I keep hearing about and can't wait to try out.
And here's the trick. The big-dollar relational vendors have been watching and chances are, they're all building or planning to build their own non-relational solutions to tie in with their products. Over the next couple years, if not sooner, we'll see the movement mature, large companies buy up the best of breed and relational vendors start offering integrated solutions, for those that haven't already.
It's an extremely exciting time to work in the field of data management. You should try a few of these out. You can download Couch or Mongo and have them up and running in minutes. HBase is a bit harder.
In any case, I hope I've informed without confusing, that I have enlightened without significant bias or error.
RDBMSes are good at joins, NoSQL engines usually aren't.
NoSQL engines is good at distributed scalability, RDBMSes usually aren't.
RDBMSes are good at data validation coinstraints, NoSQL engines usually aren't.
NoSQL engines are good at flexible and schema-less approaches, RDBMSes usually aren't.
Both approaches can solve either set of problems; the difference is in efficiency.
Probably answer to your question is that mongodb can handle any task (and sql too). But in some cases better to choose mongodb, in others sql database. About advantages and disadvantages you can read here.
Also as #Dmitry said mongodb open door for easy horizontal and vertical scaling with replication & sharding.
RDBMS enforce strong consistency while most no-sql are eventual consistent. So at a given point in time when data is read from a no-sql DB it might not represent the most up-to-date copy of that data.
A common example is a bank transaction, when a user withdraw money, node A is updated with this event, if at the same time node B is queried for this user's balance, it can return an outdated balance. This can't happen in RDBMS as the consistency attribute guarantees that data is updated before it can be read.
RDBMs are really good for quickly aggregating sums, averages, etc. from tables. e.g. SELECT SUM(x) FROM y WHERE z. It's something that is surprisingly hard to do in most NoSQL databases, if you want an answer at once. Some NoSQL stores provide map/reduce as a way of solving the same thing, but it is not real time in the same way it is in the SQL world.