I am trying to create a redis based datastore with multiple fields that can be used to fetch the entity based on its value. The data would be something like;
Person<Entity>
Name
Address
Purchases<Another Entity>
Reviews<list of another Entity>
The same will also exist in other entites as this will be a many-to-many relationship between the different entities.
I am not considering traditional databases as I am looking for scalability and fault tolerance in such example.
What I am creating is the following
Hash of Entity id mapped to each entity object
Sets containing the association of say Person to Purchases and another for Purchases to Person and so on - one for both sides of a many to many relationship.
Since this design will involve a lot of overhead, I suspect there is some flaw in keeping this unnormalized. As for the choice of using a memory store over a database, I am considering query response time to be of critical value. I am looking for suggestions about my design as I am implementing this example to learn how to handle bigdata challenges.
I am looking for suggestions about my design as I am implementing this
example to learn how to handle bigdata challenges.
On what basis do you believe your challenges are Big Data? How much data we talking about? You need to ask yourself that question first before discounting relational databases as a solution that may likely meet your needs.
I am not considering traditional databases as I am looking for
scalability and fault tolerance in such example.
Redis and relational databases have the same scalability issue; they don't scale well horizontally unless you either implement or use a custom sharding technique. Redis Cluster is meant to address this, but it's a work in progress and not yet production ready, in the meantime you can use twemproxy. Developed by Twitter, it's a proxying solution to distribute keys across a cluster of redis servers.
I am trying to create a redis based datastore with multiple fields
that can be used to fetch the entity based on its value.
Redis is not designed to query based on values, period; read up on this and this to better understand why.
Related
I am designing an enterprise application and there is a big question for me if it is ok to use one Database per each aggregates in Domain-Driven Design and apply CQRS for them.
For example I have one Domain that contains several Bounded Context and each BC have two or more aggregates, so can i use a relational Database like MSSQL and no-sql Database like MongoDb for one or more aggregate?
The concept (Domain-Driven Design) do not discuss the exact implementations. So it does not limit the use of database implementations. Go ahead with what you are trying if it suits your use case. The only thing is to go through some planning ahead for design, which can change if required sometime later. I would recommend having event sourcing in the blend as well. It'll really help the through denormalization in the mix with CQRS.
The main concern is to take care of commands reflecting state consistently through all databases. For example, if you have one aggregate root having an entity and some value objects spread over multiple databases, make sure that all the adapters behave similarly so that the domain then has no concern over how the data is stored (separated) across databases. If that is achieved neatly, then domains are free to have only domain logic. I mean this in terms of how the interfaces are designed for multiple databases. If the NoSQL DB interface shows methods that convey documents and the SQL DB shows it works on the tables, the domain will definitely take a hit switching between documents and tables. Abstract that logic (may be using Hexagonal architecture) and you're in a good position with multiple DBs.
For the use-case of Shopping cart (and checkout process) for E-commerce web application, what is better to use a Relational DB (RDBMS) or NoSQL DB as MongoDB/Cassandra/others ?
For the catalog perspective, NoSQL makes ideal use-case with flexible schema, horizontal scaling of data/nodes.
What are the pros/cons of each approach for Shopping Cart use-case?
There are many differences between SQL and noSQL databases. Those differences are what gives each storage type its pros and cons on different situations.
Since both database types would work in the end, it all really depends on the context or on your implementation.
In this specific case (shopping cart), the pros and cons are probably all related to the consistency of your data and scalability.
noSQL databses are better (pros) suited for more "dynamic" applications (data analysis, IoT, multimedia, etc.). Such applications use data that usually doesn't have a rigid structure and comes in very large volumes. This means that there's no need to develop a complex database model and it's cheaper to store large amounts of data throughout separate "nodes". This also makes noSQL databases easier to expand and scale. The main problem (cons) is the lack of structure. This will make it harder for you to run analysis and to keep track of every detail of the database.
Meanwhile, SQL databases are useful (pros) when your data is well-structured and mostly consistent. As you know, SQL stores data in columns and rows, this gives SQL an advantage if you want to generate detailed statistics of your data and also if you want to keep an organized record of everything that happens in your app. The main downside (cons) is that the design of an SQL database takes more time and also it's probably more expensive (scalability and physical storage require more hardware) to maintain a SQL database.
Performancewise, I would argue that in this usecase there wouldn't be any major difference.
If you think about all of what i just wrote, I would say that in the context of a shopping cart, the SQL model is the way to go. A shopping cart won't require lots of upgrades and changes (scalability), its data is always structured (name of item, price, etc.) and you might want to keep track of every transaction a user makes in your ecommerce application (for accountability and safety reasons).
tl;dr use SQL because the data in a shoppingcart usecase is structured and consistent.
good luck!
The general pros/cons of something like Cassandra vs postgres/mysql look like:
Cassandra handles multi-DC HA much better.
Cassandra handles high write volume much better.
Cassandra allows you to reboot hosts without downtime because you'll have multiple replicas (and you wont have to worry about WAL replay or binlog replay or weird master-master replication problems, though some RDBMS addons make this easier for MySQL and Postgres than it used to be).
Cassandra allows you to scale better (linear scaling with number of instances up to ~1200 or so instances)
MySQL/Postgres allow you to build queries as your business requirements evolve by adding indices to existing tables; Cassandra expects you to know the queries in advance and do data modeling before you start writing data.
MySQL/Postgres tends to be easier to use, and you'll find a ton of libraries/UIs/etc to help you get started
MySQL/Postgres offer real transactions / MVCC - Casssandra has lightweight transactions limited to operations on a single key with much weaker isolation/atomicity guarantees.
Ultimately, though, unless you believe your shopping cart is going to handle thousands of concurrent users, it probably doesn't matter (as long as you use something with real data durability guarantees): use what you're most comfortable using. I'd use Cassandra because I know Cassandra very well, but if you're not great with Cassandra (or whatever), use what you know best.
I want to design a product which allows customers to create their own websites. A customer will be able to maintain his website's data model on the fly, do queries on it and display the output on a html page. I doubt a traditional RDMBS is the right choice for two reasons; with every customer the amount of data will grow and the RDBMS might reach its limits even if scaled. As the data model is highly dynamic doing many DDL queries will slow down the whole system.
I'm trying to figure out which database/datastorage system might be the best option for such a system. Recently I read a lot through NoSQL solutions like Cassandra and MongoDB and it looks promising in terms of performance but comes with a flaw: it's not relational data so data have to be denormalized.
I don't know what will be the impact of denormalizing a dynamic customer defined data model, because the customer models and inserts data first (in a relational way) and then does the queries afterwards. The denormalization has to happen automatically which leads to another problem: Can I create one table for each query, even if some queries might be similar? There might exist a high redundancy of data after a while.
Does creating/updating tables on the fly have any impact?
Every time the customer changes data the same data has to be changed in all tables which hold a copy of the same entity (like the name of an employee has to be changed in "team member" and also in "project task"). Are those updates costly?
Is it possible to nest data with unlimited depth like {"team": {"members": [{"name": "Ben"}]}}?
There might be even better/other approaches, I'm happy for any hints.
Adding clarification to the requirements
My question actually is, how can I use a NoSQL DB like Cassandra to maintain relational data and will the solution still perform better compared to a RDBMS?
The customer thinks relational (because in fact, data are always relational in my opinion) no matter what DBMS is used. And this service is not about letting the customer chose the underlying data storage. There can only be one.
A customer can define his own relational data model by using a management frontend provided by the application. The data model may be changed at any time by the customer. In RDBMS a DDL on a production system is not a good idea. On top of the data schema the customer can add named queries and use them as a data source on any web page he creates.
An example would be a query for News given the name "news" and in a web page it would be used like <ul><li query="news"><h1>[news.title]</h1></li></ul>, which would execute the query and iterate through the data and repeat the <li> on each iteration. That is the most simple example though.
In more complex examples if using SQL there might be extensive use of sub queries which performs bad. In NoSQL it seems there is the option to first denormalize and prepare a table with the data needed by the query and then just query that table. Any changes to involved data would lead to an update for that table. That means for every query the customer creates the system will automatically create and maintain a table and its data, so there will be a lot of data redundancy. Benchmarks state that Cassandra is fast in writing so that might be an option.
Let me put my 2 cents in.
Talking about of ability for users having own data models is not about SaaS.
In the pure SaaS paradigm, each user has the same functionality and data model. He could add his own objects, but not the classes of objects.
So scaling in this paradigm is a rather obvious (though frankly, it could be not so trivial) solution. You can get cloud DB with built-in multi-tenant support (like Azure, for example), you can use Amazon's RDS and add more instances as the user amount growth, you can use sharding (for instance, a partition by users) if the database supports it, etc.
But when we're talking about custom data model for each user is more like IaaS (infrastructure). It is some more low-level thing and you just say: "Ok, guys, you may build any data model you want, whatever".
And I believe that if you move the responsibility for the data model creation to the user, you should also move the responsibility for database selection, as IaaS provides. So the user would say:" "Ok, I need key-value database here" and you provide him Cassandra's table for example. If he wants RDBMS, you provide him one also.
Otherwise, you have to consider not the data model itself, but also the data strategy that your customer needs. So some customer may need to have key-value storage (that needs to be backed by some noSQL DB), the other may need RDBMS. How would you know it?
For instance, consider the entity from your example: {"team": {"members": [{"name": "Ben"}]}}. One user would use this model for the single type of queries something like "get the members for the team" and "add the member for the team". Another one user may need to query frequently for some stats information (average team player age, games played). And these two scenarios could demand different database types: first is for key-value search, the other is RDBMS. How would you guess the database type and structure as key-value storages are modeled around queries?
Technically, you may even try to guess the database type depending on the users' data model and queries, but you need to add some restrictions for users' creativity. Otherwise, it would be very untrivial task.
And about scaling, as each model is unique, you need to have add database instances as users grow. Of course, you can have multiple users in the single database instance in the different schemas, and you will need to determine the users' amount per instance by experiments or performance testing.
You may also look at the document-oriented databases, but I think that you need review your concept and make some changes. Maybe you have some obvious restrictions yet, but I just didn't get it from your post.
I am newbie in Mongodb.
Currently I am working on a project using MEAN stack. I am using schema less orientation for storing data i.e mongodb as client and not mongoose.
After my research from internet I found that schema less database performs(speed) well when compared to schema based nosql database and hence decided for schema less approach.
Scenario I am facing:
I have few entities that share common properties. Say for Entity A has name,location,phone no and Entity B has additional properties in common with those properties from that of A.
Suppose considering my application will scale upto 1 billion users
Question for the above scenario discussed
1) Is it better to store as different collection for different types
or prefer inheritance type of approach for storing those entities.
If inheritance type is preferred how it is done in mongodb.
Few other general questions(considering large scalability)
1) Is my schema less approach right choice
2) Is it better to use ODM tool or directly write code in my dao
layer to access the database without using object approach
Many may feel that this is totally out of scope from mongodb, I am asking this question basically from design and performance perspective.
So need advice from experts who has really worked on large scale application development using mongodb.
What are the advantages of using NoSQL databases? I've read a lot about them lately, but I'm still unsure why I would want to implement one, and under what circumstances I would want to use one.
Relational databases enforces ACID. So, you will have schema based transaction oriented data stores. It's proven and suitable for 99% of the real world applications. You can practically do anything with relational databases.
But, there are limitations on speed and scaling when it comes to massive high availability data stores. For example, Google and Amazon have terabytes of data stored in big data centers. Querying and inserting is not performant in these scenarios because of the blocking/schema/transaction nature of the RDBMs. That's the reason they have implemented their own databases (actually, key-value stores) for massive performance gain and scalability.
NoSQL databases have been around for a long time - just the term is new. Some examples are graph, object, column, XML and document databases.
For your 2nd question: Is it okay to use both on the same site?
Why not? Both serves different purposes right?
NoSQL solutions are usually meant to solve a problem that relational databases are either not well suited for, too expensive to use (like Oracle) or require you to implement something that breaks the relational nature of your db anyway.
Advantages are usually specific to your usage, but unless you have some sort of problem modeling your data in a RDBMS I see no reason why you would choose NoSQL.
I myself use MongoDB and Riak for specific problems where a RDBMS is not a viable solution, for all other things I use MySQL (or SQLite for testing).
If you need a NoSQL db you usually know about it, possible reasons are:
client wants 99.999% availability on
a high traffic site.
your data makes
no sense in SQL, you find yourself
doing multiple JOIN queries for
accessing some piece of information.
you are breaking the relational
model, you have CLOBs that store
denormalized data and you generate
external indexes to search that data.
If you don't need a NoSQL solution keep in mind that these solutions weren't meant as replacements for an RDBMS but rather as alternatives where the former fails and more importantly that they are relatively new as such they still have a lot of bugs and missing features.
Oh, and regarding the second question it is perfectly fine to use any technology in conjunction with another, so just to be complete from my experience MongoDB and MySQL work fine together as long as they aren't on the same machine
Martin Fowler has an excellent video which gives a good explanation of NoSQL databases. The link goes straight to his reasons to use them, but the whole video contains good information.
You have large amounts of data - especially if you cannot fit it all on one physical server as NoSQL was designed to scale well.
Object-relational impedance mismatch - Your domain objects do not fit well in a relaitional database schema. NoSQL allows you to persist your data as documents (or graphs) which may map much more closely to your data model.
NoSQL is a database system where data is organized into the document (MongoDB), key-value pair (MemCache, Redis), and graph structure form(Neo4J).
Maybe there are possible questions and answer for "When to go for NoSQL":
Require flexible schema or deal with tree-like data?
Generally, in agile development we start designing systems without knowing all requirements upfront, whereas later on throughout the development database system may need to accommodate frequent design changes, showcasing MVP (Minimal Viable product).
Or you are dealing with a data schema that is dynamic in nature.
e.g. System logs, very precise example is AWS cloudtrail logs.
Data set is vast/big?
Yes NoSQL databases are the better candidate for applications where the database needs to manage millions or even billions of records without compromising performance and availability while may be trading for inconsistency(though modern databases are exception here where it allows tunable consistency over availability e.g. Casandra, Cloud provider databases CosmosDB, DynamoDB).
Trade-off between scaling over consistency
Unlike RDMS, NoSQL databases may make the dataset consistent across other nodes eventually which is the default behavior, but it's easy to scale in terms of performance and availability.
Example: This may be good for storing people who are online in the instant messaging app, API tokens in DB, and logging website traffic stats.
Performing Geolocation Operations:
MongoDB hash rich support for doing GeoQuerying & Geolocation operations. I really loved this feature of MongoDB. So does the PostresSQL but ease of implementation is something that depends on the use case
In nutshell, MongoDB is a great fit for applications where you can store dynamic structured data on a large scale.
Edits:
Updated the answer about the consistency of the database.
Some essential information is missing to answer the question: Which use cases must the database be able to cover? Do complex analyses have to be performed from existing data (OLAP) or does the application have to be able to process many transactions (OLTP)? What is the data structure? That is far from the end of question time.
In my view, it is wrong to make technology decisions on the basis of bold buzzwords without knowing exactly what is behind them. NoSQL is often praised for its scalability. But you also have to know that horizontal scaling (over several nodes) also has its price and is not free. Then you have to deal with issues like eventual consistency and define how to resolve data conflicts if they cannot be resolved at the database level. However, this applies to all distributed database systems.
The joy of the developers with the word "schema less" at NoSQL is at the beginning also very big. This buzzword is quickly disenchanted after technical analysis, because it correctly does not require a schema when writing, but comes into play when reading. That is why it should correctly be "schema on read". It may be tempting to be able to write data at one's own discretion. But how do I deal with the situation if there is existing data but the new version of the application expects a different schema?
The document model (as in MongoDB, for example) is not suitable for data models where there are many relationships between the data. Joins have to be done on application level, which is additional effort and why should I program things that the database should do.
If you make the argument that Google and Amazon have developed their own databases because conventional RDBMS can no longer handle the flood of data, you can only say: You are not Google and Amazon. These companies are the spearhead, some 0.01% of scenarios where traditional databases are no longer suitable, but for the rest of the world they are.
What's not insignificant: SQL has been around for over 40 years and millions of hours of development have gone into large systems such as Oracle or Microsoft SQL. This has to be achieved by some new databases. Sometimes it is also easier to find an SQL admin than someone for MongoDB. Which brings us to the question of maintenance and management. A subject that is not exactly sexy, but that is a part of the technology decision.
Handling A Large Number Of Read Write Operations
Look towards NoSQL databases when you need to scale fast. And when do you generally need to scale fast?
When there are a large number of read-write operations on your website & when dealing with a large amount of data, NoSQL databases fit best in these scenarios. Since they have the ability to add nodes on the fly, they can handle more concurrent traffic & big amount of data with minimal latency.
Flexibility With Data Modeling
The second cue is during the initial phases of development when you are not sure about the data model, the database design, things are expected to change at a rapid pace. NoSQL databases offer us more flexibility.
Eventual Consistency Over Strong Consistency
It’s preferable to pick NoSQL databases when it’s OK for us to give up on Strong consistency and when we do not require transactions.
A good example of this is a social networking website like Twitter. When a tweet of a celebrity blows up and everyone is liking and re-tweeting it from around the world. Does it matter if the count of likes goes up or down a bit for a short while?
The celebrity would definitely not care if instead of the actual 5 million 500 likes, the system shows the like count as 5 million 250 for a short while.
When a large application is deployed on hundreds of servers spread across the globe, the geographically distributed nodes take some time to reach a global consensus.
Until they reach a consensus, the value of the entity is inconsistent. The value of the entity eventually gets consistent after a short while. This is what Eventual Consistency is.
Though the inconsistency does not mean that there is any sort of data loss. It just means that the data takes a short while to travel across the globe via the internet cables under the ocean to reach a global consensus and become consistent.
We experience this behaviour all the time. Especially on YouTube. Often you would see a video with 10 views and 15 likes. How is this even possible?
It’s not. The actual views are already more than the likes. It’s just the count of views is inconsistent and takes a short while to get updated.
Running Data Analytics
NoSQL databases also fit best for data analytics use cases, where we have to deal with an influx of massive amounts of data.
I came across this question while looking for convincing grounds to deviate from RDBMS design.
There is a great post by Julian Brown which sheds lights on constraints of distributed systems. The concept is called Brewer's CAP Theorem which in summary goes:
The three requirements of distributed systems are : Consistency, Availability and Partition tolerance (CAP in short). But you can only have two of them at a time.
And this is how I summarised it for myself:
You better go for NoSQL if Consistency is what you are sacrificing.
I designed and implemented solutions with NoSQL databases and here is my checkpoint list to make the decision to go with SQL or document-oriented NoSQL.
DON'Ts
SQL is not obsolete and remains a better tool in some cases. It's hard to justify use of a document-oriented NoSQL when
Need OLAP/OLTP
It's a small project / simple DB structure
Need ad hoc queries
Can't avoid immediate consistency
Unclear requirements
Lack of experienced developers
DOs
If you don't have those conditions or can mitigate them, then here are 2 reasons where you may benefit from NoSQL:
Need to run at scale
Convenience of development (better integration with your tech stack, no need in ORM, etc.)
More info
In my blog posts I explain the reasons in more details:
7 reasons NOT to NoSQL
2 reasons to NoSQL
Note: the above is applicable to document-oriented NoSQL only. There are other types of NoSQL, which require other considerations.
Ran into this thread and wanted to add my experience.. Many SQL databases support json data in columns and support querying of this json. So what I have used is a hybrid using a relational database with columns containing json..