This question is only for understanding purpose. This might be a noob question.
Assume that I have a tabular or document NoSQL database which do not support transactions. And I have a locations table/document which has an is_active column. A user can have multiple locations, but can have a SINGLE is_active:true location only. Now, if the user wants to
Change an is_active location, how do we handle it? As I need to set is_active false for 1 row and true for another
A use case where a user wants to create a new location for himself and set that location as is_active
How do I handle these logics without a transaction in a NoSQL? Do I need to model my tables in some other way?
Let's assume that:
I cannot use an SQL
I should not use transaction support provided by DBs like Mongo
NOTE: These might be a lot of assumptions and might not be real-world use cases. But I am just trying to understand HOW we should model NoSQL databases.
This is a typical data modelling question for nosql systems. I won't put here the whole theory, but these are links to check: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-relational-modeling.html and https://cassandra.apache.org/doc/latest/cassandra/data_modeling/index.html
The key take away is that you don't want to create a normalized version of SQL like db in a nosql system. In your particular case, you have to explore access patterns for reads and writes; and that will help you to define you data structures. For example, you may want to get rid of "locations" table and keep those as properties of users - this is almost classical use case in nosql world.
I have a monolithic application that currently uses a PostgreSQL DB and the schemas are set up as you would expect for most relational databases with various table data being linked back to the user via FKs on the user_id.
I'm trying to learn more about microservices am trying to migrate my python API to a microservice architecture. I have a reasonable understanding of how I'm going to break up the larger app into smaller parts, however, I'm not entirely clear on how I'm supposed to deal with the data side of things.
I understand that one single large DB is against general design principles of microservices but I'm not clear on what the alternative would be.
My biggest concern is cascading across individual databases that would hold microservice data. In a simple rdb, I can just cascade on delete and the DB will handle the work across the various tables. In the case of microservices, how would that work? Would I need to have a separate service that handles deleting user data across the other service DBs?
I don't really understand how I would migrate a traditional application with a relational DB to a microservice architecture?
EDIT:
To clarify - a specific architectural/design problem I'm facing is as follows:
I have split up my application into a few microservices. The ones that are in my mind still relational are:
Geolocation - A service that checks geometry data, records in PostGIS, and returns certain information. A primary purpose is to record the location of a particular user for referencing later
Image - A simple upload service to upload images and store meta data in the db.
Load-Image - A simple service that returns a random set of images based on parameters such as location, and user profile data such as Age, Gender, etc
Profile - A service that simply manages user data such as Age, Gender, etc
Normally, these three items would have a table each in a larger db rather than their own individual dbs. Filtering images by say location and age is a very simple JOIN and filter.
How would something like that work in a microservice architecture? If the data is held in different dbs entirely how would I setup the logic to filter the data? I could duplicate data that doesn't change often like profile info and add it to a MongoDB document that would contain image data including user_id and profile data - however, location data can change regularly and constant updates doesn't sound practical.
What would be the best approach? Or should I stick with a shared RDBMS for just those few services?
It comes down to the duplication of data, why we want it, and how we manage it.
Early in our careers we were taught about the duplication of data to make it redundant, for example in database replication or backups. We were also taught that data can be modelled in a relational manner, with constraints enforcing the integrity of the model. In fact, the integrity of the model is sacrosanct. Without integrity, how can you have consistency? The answer is that you can't. Kinda.
When you work with distributed systems and service orientation, you do so because you want to minimise interactions thereby reducing coupling between components. However, there is a cost to this. The more distributed your architecture, the less coupling it has, and the more duplication of data will be necessary. This is taken to an extreme with microservices, where effectively the same data may be present in many different places, in varying degrees of consistency.
Instead of being bad, however, in this context data duplication is an essential feature of your system. It is an enabler of an architectural style with many great benefits. Put another way, without duplication of data, you get less distribution, you get more coupling, which makes your system more expensive to build, own, and change.
So, now we understand duplication of data and why we want it, let's move onto how we manage having lots of duplication. Let's try an example:
In a relational database, let's say we have a table called Customers, which contains a customer ID, and customer details, and another table called Orders which contains the order ID, customer ID, and the order details. Let's say we also have an ordering application, which needs to delete all the customer's orders if the customer is deleted for GDPR.
Because we are migrating our system to microservices, we decide to create a service called Customers.
So we create a service with the following operation:
DELETE /customers/{customerId} - deletes a customer
We create another service called Orders with the following operations:
GET /orders/customers/{customerId} - gets all the orders for a customer
DELETE /orders/{orderId} - deletes an order
We build a UX screen for deleting a customer. The UX first calls the orders service to get all the orders for the customer. Then it iterates over the list of orders, calling the orders service to delete the order. Then it calls the customers service to delete the user.
This example is very simplistic, but as you can see, there is no option but to orchestrate the "Delete Customer" operation from the caller, which in this case is the user interface. Of course, what would be a single atomic transaction in a database does not translate to multiple HTTP/s calls, so it is possible that some of the calls may not succeed, leaving the system as a whole in an inconsistent state. In this instance the inconsistency would need to be resolved via some recovery mechanism.
In a microservice architecture, we have both the option, either use database per service or a shared database. There are advantages and disadvantages to both the pattern. Database per service architecture is the best practice but when the monolithic application has lots of function, procedure or database-specific feature on database level then we can use the Shared database approach, I know this is not the best practice if you have time and bandwidth then you should go for database per service.
As your concern is cascading over individual databases, you need to remove cascading from the database and implement global transaction handling in your application and execute all cascading related queries from that transaction.
I am developing multi tenant store web application (software as a service) which will be used by many customers. I would like to use just one database. I would appreciate suggestions/feedback on how to go about this in the database:
Separate schemas for each customer. Whenever new customer signs up, I create separate schema.
Single schema with all the customers. And creating a CUSTOMER table with customerId that is referenced in all other tables (eg. orders, payments, etc). Whenever new customer signs up, I create an entry in CUSTOMER table.
Incase if you want to know what technologies are being used:
Postgres, Spring Boot MVC, REST, Maven, JPA.
Thanks.
There are major tradeoffs here. With customer id's your foreign keys become more complex (the customer id should probably be a part of every foreign key) and that means additional indexes. It also means you have to have some means of enforcing this restriction. The big issue is that bugs in your application can quite easily disclose material from other customers.
With multiple schemas you have an issue that you have many more tables and this can cause performance problems for pg_dump in particular. However with appropriate search paths it is a bit harder to compromise other clients' data. However this is harder to use with a connection pool.
In general I think the schema approach is better because you can always scale out by partitioning by customer set, and the better security is important. However it means you must have a good understanding of search_path and set it to a sensible value on every database transaction.
I have only looked in Azure table, but it may well apply for other NoSQL databases as well.
If I have an entity consisting of these following properties
First name - Last name - Hometown - Country
In Azure table there is no concept of relations therefore if I have thousands of data, and I want to change all entities that has 'Canada' in it, to some other country. Then in this scenario there is a possibility it has to go through thousands of data to find entities with 'Canada' and change it to something else.
I wonder, is the benefit of NoSQL only if you have data that is static and not changed after you have written it? Or could this problem be solved for NoSQLs?
In the case of NoSQL data stores the advantages are different from SQL databases. Things like scalability or availability can be better in a NoSQL database like Azure table, but there are tradeoffs. For example you are generally unable to efficiently query any part of a record, only the keys.
In designing your schema for Azure Table you have to consider the use cases of your data layer and let that dictate the schema. In this example, if I thought I would have to update all records in a given country, I would make that part of the partition or row key. That way your query to get all data in a given country is fast and can be updated quickly.
I'm just exploring the whole NoSQL concept. I've been playing around with Amazons DynamoDB and I really like the concept. However that said I am not quite sure how the data should be separated. By this I mean should I create a new Table for related data features like you would in a relational database or do I use a single table to store all the applications data?
As an example, in a relational DB I might have a table called users and a table called users_details. I would then for example, create a 1:1 relationship between the two tables. With the NoSQL concept I could theoretically create two tables as well but it strikes me as more efficient to have all the data in a single table.
If that is the case then when do you stop? Is the idea to store all the application data for a given user in a single table?
First ask yourself: why did I separate the users from the user details in RDBMS in the first place.
On a more general note, when talking about NoSQL you really shouldn't look at relationships between tables. You shouldn't think about "joining" information from different tables but rather prepare your data in a way that can be retrieved optimally.
In the user/user_details scenario, put all information in a users table, and query what you want by specifying the attributes to get.