I'm accessing a mongoDb database using the repository patterns. Where should I check for data Integrity? - mongodb

I'm kind of new to mongodb and NoSQL data design in general.
I'm building a mongodb database that will have some denormalized data. For exemple, my "User" documents contains a reference (just the id) to zero or more "Article" documents and my Article documents contains references to zero or more users.
Since I'm using the repository pattern, no parts of my Data Access Layer knows about Articles AND Users. Where in my code should I check to make sure that all my documents are consistent with each others? Should I simply let the DAL's users code do the checks?
Would it be a good idea to have a Data Integrity Script run once in a while to check if everything is consistent?

Here is Microsoft's write-up on the Repository Pattern. From that document:
Use a repository to separate the logic that retrieves the data and maps it to the entity model from the business logic that acts on the model.
You have a couple of questions:
Where in my code should I check to make sure that all my documents are consistent with each others?
Based on the statement above, I think it's clear that this logic belongs in the Repository. The relation between these objects only exists at the layer of "business logic", the database cannot enforce these types of rules.
Should I simply let the DAL's users code do the checks?
How could they? As the writer of the repository, you are the DAL user. For MongoDB, the DAL is basically the driver.
You could possibly write a wrapper around the driver that would wrap the multiple writes in some form of transactions. But you would have to write this, MongoDB has no notion of transactions.
Would it be a good idea to have a Data Integrity Script run once in a while to check if everything is consistent?
At the end of the day, whoever writes the repository is going to be responsible for the integrity of the data. Such a script might be useful, but it would definitely suck a lot of CPU cycles.
My suggestion for N:M mappings is to start building some basic blocks for handling the multiple writes that are required to keep these two in sync. One idea is to Queue the changes and let a background job make the updates. This way you don't have to worry about multiple writes and roll-backs causing bad data.

Related

How to query axon aggregates

Is there a way to see the current state of the aggregates stored in axon?
Our application uses a Oracle backed axon event store.
I tried querying the domainevententry and snapshotevententry tables, but they are empty.
Is there a way to see the current state of the aggregates stored in axon?
In short, yes, although it is not recommended. Granted, if you are planning to employ CQRS. CQRS, or Command-Query Responsibility Separation, dictates that the Command Model and the Query Model are separate.
The aggregate support Axon delivers supplies an easy means to construct a Command Model. As the name suggests, it's intended for commands. On the flip side, you have Query Models, which are designed for queries. AxonIQ has this to say on CQRS; maybe that clarifies some things.
I tried querying the domainevententry and snapshotevententry tables, but they are empty.
That's interesting on its own account! When you publish events in Axon, either through the AggregateLifecycle#apply(Object...) or EventGateway#publish(Object...) method, the published event should end up in your domain_event_entry table. If that's not the case, then either your JPA/JDBC configuration has a misser or some other exceptions occurring in your application.
Would you be able to update your issue with samples of your configuration and/or stack traces that you are seeing?
Replaying production issues locally
What I've done in the past to be able to replay behavior occurring in a production environment is by loading the Aggregate's event stream from that environment into a local dev/test event store. To be able to query this, you only need the aggregate identifier. As the aggregate identifier is indexed, retrieving all events for a specific aggregate (differently named, the aggregate stream) is straightforward.
By doing so, I could run the application locally to flow through the aggregate step-by-step. This gave the benefit of knowing exactly which event caused what state change, leading to the problematic scenario.
However, why your events are not present in your domainevententry is unclear to me. If you're still facing issues with that, I still recommend that you update the question with more specifics on your project.

Relational DB in microservices

I have a monolithic application that currently uses a PostgreSQL DB and the schemas are set up as you would expect for most relational databases with various table data being linked back to the user via FKs on the user_id.
I'm trying to learn more about microservices am trying to migrate my python API to a microservice architecture. I have a reasonable understanding of how I'm going to break up the larger app into smaller parts, however, I'm not entirely clear on how I'm supposed to deal with the data side of things.
I understand that one single large DB is against general design principles of microservices but I'm not clear on what the alternative would be.
My biggest concern is cascading across individual databases that would hold microservice data. In a simple rdb, I can just cascade on delete and the DB will handle the work across the various tables. In the case of microservices, how would that work? Would I need to have a separate service that handles deleting user data across the other service DBs?
I don't really understand how I would migrate a traditional application with a relational DB to a microservice architecture?
EDIT:
To clarify - a specific architectural/design problem I'm facing is as follows:
I have split up my application into a few microservices. The ones that are in my mind still relational are:
Geolocation - A service that checks geometry data, records in PostGIS, and returns certain information. A primary purpose is to record the location of a particular user for referencing later
Image - A simple upload service to upload images and store meta data in the db.
Load-Image - A simple service that returns a random set of images based on parameters such as location, and user profile data such as Age, Gender, etc
Profile - A service that simply manages user data such as Age, Gender, etc
Normally, these three items would have a table each in a larger db rather than their own individual dbs. Filtering images by say location and age is a very simple JOIN and filter.
How would something like that work in a microservice architecture? If the data is held in different dbs entirely how would I setup the logic to filter the data? I could duplicate data that doesn't change often like profile info and add it to a MongoDB document that would contain image data including user_id and profile data - however, location data can change regularly and constant updates doesn't sound practical.
What would be the best approach? Or should I stick with a shared RDBMS for just those few services?
It comes down to the duplication of data, why we want it, and how we manage it.
Early in our careers we were taught about the duplication of data to make it redundant, for example in database replication or backups. We were also taught that data can be modelled in a relational manner, with constraints enforcing the integrity of the model. In fact, the integrity of the model is sacrosanct. Without integrity, how can you have consistency? The answer is that you can't. Kinda.
When you work with distributed systems and service orientation, you do so because you want to minimise interactions thereby reducing coupling between components. However, there is a cost to this. The more distributed your architecture, the less coupling it has, and the more duplication of data will be necessary. This is taken to an extreme with microservices, where effectively the same data may be present in many different places, in varying degrees of consistency.
Instead of being bad, however, in this context data duplication is an essential feature of your system. It is an enabler of an architectural style with many great benefits. Put another way, without duplication of data, you get less distribution, you get more coupling, which makes your system more expensive to build, own, and change.
So, now we understand duplication of data and why we want it, let's move onto how we manage having lots of duplication. Let's try an example:
In a relational database, let's say we have a table called Customers, which contains a customer ID, and customer details, and another table called Orders which contains the order ID, customer ID, and the order details. Let's say we also have an ordering application, which needs to delete all the customer's orders if the customer is deleted for GDPR.
Because we are migrating our system to microservices, we decide to create a service called Customers.
So we create a service with the following operation:
DELETE /customers/{customerId} - deletes a customer
We create another service called Orders with the following operations:
GET /orders/customers/{customerId} - gets all the orders for a customer
DELETE /orders/{orderId} - deletes an order
We build a UX screen for deleting a customer. The UX first calls the orders service to get all the orders for the customer. Then it iterates over the list of orders, calling the orders service to delete the order. Then it calls the customers service to delete the user.
This example is very simplistic, but as you can see, there is no option but to orchestrate the "Delete Customer" operation from the caller, which in this case is the user interface. Of course, what would be a single atomic transaction in a database does not translate to multiple HTTP/s calls, so it is possible that some of the calls may not succeed, leaving the system as a whole in an inconsistent state. In this instance the inconsistency would need to be resolved via some recovery mechanism.
In a microservice architecture, we have both the option, either use database per service or a shared database. There are advantages and disadvantages to both the pattern. Database per service architecture is the best practice but when the monolithic application has lots of function, procedure or database-specific feature on database level then we can use the Shared database approach, I know this is not the best practice if you have time and bandwidth then you should go for database per service.
As your concern is cascading over individual databases, you need to remove cascading from the database and implement global transaction handling in your application and execute all cascading related queries from that transaction.

MongoDB data modelling: any drawbacks in using lots of databases?

I have recently moved to MongoDB part of the back-end of a web app, the web app itself is a validation tool, and the workflow looks like:
the user uploads a file (typically hundreds of thousands of lines)
the validator checks it outputting a lot of messages (possibly more than one per line)
...and finally provide a few statistics
I modelled my application so that each user has it's own DB containing:
The file (saved through GridFS)
A collection containing the messages (possibly over a million lines, in some cases)
A collection with the statistics
We have a few hundreds of users, so MongoDB will end up having a few hundreds DBs.
Of course I could have hold all the data in the same DB, using namespaces to separate data from different users. However I felt it was handy to send the DB in the connection URI, and I found more intuitive to issue a "drop database" statement to purge a user, rather than searching and removing its data in the large DB.
I am pretty new to MongoDB, so my question is: is there any drawback in having several DBs in the same MongoDB instance? Or is there any special consideration that I should give to the problem?
I'm not familiar with MongoDB specifically. In general, openning a connection to a database is a relatively slow operation and ties up system resources. Whether this is enough to matter in your case I can't say.
Having a different db for each user would make it difficult to perform queries that access data for multiple users. Maybe you have no need to do this.
Still, I would think it would be a whole lot simpler in general to just put a user id in each record rather than create a separate database. What's the gain of separate databases? Okay, deleting a user means saying "drop database". But deleting a user from a single database should mean saying "delete from tableX where user=?; delete from tableY where user=?" etc for however many relevant tables you have. I can't imagine it's hundreds, right? Maybe half a dozen lines of code or so?

Entity Framework Self Tracking Entities - Synchronize between 2 databases

I am using Self Tracking Entities with the Entity Framework 4. I have 2 databases, with the exact same schema. However, tables in one database will be added to/edited etc (and I mean data will be added/edited, not the actual table definitions) and at certain points of the day I will need to synchronize all the changes between this database and the other database.
I can create a separate context for both of them. But if I read a large graph from one database, how can I update the other database with the graph? Is there an easy way?
My database model is large and complex and fully relational. So it would be a big job to go through every single entity and do a read from the other database to see if it exists or not, update/insert it if need be, and then carry this on through the full object graph!
Any ideas?
This is not a use case for EF. In EF you will have to do exactly what you've described. Self tracking entities are able to track changes to these object instances - they know nothing about changes made to their own database over time and they will not know anything about state of your second database as well.
Try to look at SQL server native features (including mirroring, transaction log shipping or SSIS) and MS Sync framework. Depending on your detailed requirements these tools can suite you better.

CQRS: Synchronizing the Write and Read databases

Can anyone please give me some direction in regards to various ways to
synchronize the Write and Read databases?
What are different technologies out there, and how do you evaluate each, in
terms of realiability, performance, cost to implement, etc.
Typically in CQRS, the write DB is used to store transitional data for long running processes (sagas). If you are synchronizing the read and write DB (I'm assuming you mean both ways), you might be doing something wrong.
For a long running process where a service expects multiple messages, it needs a way to temporary store data before the all the messages arrives. An example of this is customer registration where an approval from manager, which takes a week to process, is required. The service needs a way to temporarily store the customer information before the approval arrives. This is where the write DB is used to store this piece of temporary data. Note that before the customer is approved, nothing is written to the read DB yet.
When the approval finally arrives, the service will take the customer information from the write DB, complete the registration process and write it to the read DB. At this time, the temporary customer information in the write DB has done its job and can be removed from the write DB. Notice that there isn't any two-way sync'ing involved.
For simpler process such as change customer first name, the change can be written to the read DB right away. Writing to the write DB is not required because there is no temporary data in this case.
Query model need not be consistent.. it needs to be eventually consistent. Query model is also the view model, i.e. tables are already joined as per requirement of user interface. So you can use even an in memory cache, or like Redis.
Command side is like command objects which contain all relevant information to update database. These objects may fill up a messaging queue. The command objects are processed by a command processor which transactionally updates the query cache and the write database. The write database can be an RDBMS.. but as is apparent, should be write optimized like MongoDB.
You can update read database via a messaging system too.
Some good messaging systems for this purpose are RabbitMQ and 0MQ.
If you, like me see the read store as the db that the Query service use (and its denormalized)
and the write db as the database where the Domain events are stored , then if you need to Synch them to a particular moment then what you can do is just replay the events that you have stored.
In the case you want to be as up to date as possible then you need not to restrict by version
If you are using CQRS, then probably you will have a repository that looks somewhat like this
public interface IRepository<T> where T : AggregateRoot, new()
{
void Save(AggregateRoot aggregate, int expectedVersion);
T GetById(Guid id);
T GetById(Guid id, int version);
}
Hope this helps
Cheers