Commit to a log like Kafka + database with ACID properties? - postgresql

I'm planning in test how make this kind of architecture to work:
http://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/
Where all the data is stored as facts in a log, but the validations when posted a change must be against a table. For example, If I send a "Create Invoice with Customer 1" I will need to validate if the customer exist and other stuff, then when the validation pass commit to the log and put the current change to the table, so the table have the most up-to-date information yet I have all the history of the changes.
I could put the logs into the database in a table (I use PostgreSql). However I'm concerned about the scalability of doing that, also, I wish to suscribe to the event stream from multiple clients and PG neither other RDBMS I know let me to do this without polling.
But if I use Kafka I worry about the ACID between both storages, so Kafka could get wrong data that PG rollback or something similar.
So:
1- Is possible to keep consistency between a RDBMS and a log storage OR
2- Is possible to suscribe in real time and tune PG (or other RDBMS) for fast event storage?

Easy(1) answers for provided questions:
Setting up your transaction isolation level properly may be enough to achieve consistency and not worry about DB rollbacks. You still can occasionally create inconsistency, unless you set isolation level to 'serializable'. Even then, you're guaranteed to be consistent, but still could have undesirable behaviors. For example, client creates a customer and puts an invoice in a rapid succession using an async API, and invoice event hits your backed system first. In this case invoice event would be invalidated and a client will need to retry hoping that customer was created by that time. Easy to avoid if you control clients and mandate them to use sync API.
Whether it is possible to store events in a relational DB depends on your anticipated dataset size, hardware and access patterns. I'm a big time Postgres fan and there is a lot you can do to make event lookups blazingly fast. My rule of thumb -- if your operating table size is below 2300-300GB and you have a decent server, Postgres is a way to go. With event sourcing there are typically no joins and a common access pattern is to get all events by id (optionally restricted by time stamp). Postgres excels at this kind of queries, provided you index smartly. However, event subscribers will need to pull this data, so may not be good if you have thousands of subscribers, which is rarely the case in practice.
"Conceptually correct" answer:
If you still want to pursue streaming approach and fundamentally resolve race conditions then you have to provide event ordering guarantees across all events in the system. For example, you need to be able to order 'add customer 1' event and 'create invoice for customer 1' event so that you can guarantee consistency at any time. This is a really hard problem to solve in general for a distributed system (see e.g. vector clocks). You can mitigate it with some clever tricks that would work for your particular case, e.g. in the example above you can partition your events by 'customerId' early as they hit backend, then you can have a guarantee that all event related to the same customer will be processed (roughly) in order they were created.
Would be happy to clarify my points if needed.
(1) Easy vs simple: mandatory link

Related

Understanding CQRS and EventSourcing

I read several blogs and watched video about usefulness of CQRS and ES. I am left with implementation confusion.
CQRS: when use separate table, one for "Write, Update and delete" and other for Read operation. So then how the data sync from write table to read table. Do we required to use cron job to sync data to read only table from write table or any other available options ?
Event Sourcing: Do we store only all Immutable sequential operation as record for each update happened upon once created in one storage. Or do we also store mutable record I mean the same record is updated in another storage
And Please explain RDBMS, NoSQL and Messaging to be used and where they fit into it
when use separate table, one for "Write, Update and delete" and other for Read operation. So then how the data sync from write table to read table.
You design an asynchronous process that understands how to transform the data from its "write" representation to its "read" representation, and you design a scheduler to decide when that asynchronous process runs.
Part of the point is that it's just plumbing, and you can choose whatever plumbing you want that satisfies your operational needs.
Event Sourcing
On the happy path, each "event stream" is a append only sequence of immutable events. In the case where you are enforcing a domain invariant over the contents of the stream, you'll normally have a "first writer wins" conflict policy.
But "the" stream is the authoritative copy of the events. There may also be non-authoritative copies (for instance, events published to a message bus). They are typically all immutable.
In some domains, where you have to worry about privacy and "the right to be forgotten", you may need affordances that allow you to remove information from a previously stored event. Depending on your design choices, you may need mutable events there.
RDBMS
For many sorts of queries, especially those which span multiple event streams, being able to describe the desired results in terms of relations makes the programming task much easier. So a common design is to have asynchronous process that read information from the event streams and update the RDBMS. The usual derived benefit is that you get low latency queries (but the data returned by those queries may be stale).
RDBMS can also be used as the core of the design of the event store / message store itself. Events are common written as blob data, with interesting metadata exposed as additional columns. The message store used by eventide-project is based on postgresql.
NoSQL
Again, can potentially be used as your cache of readable views, or as your message store, depending on your needs. Event Store would be an example of a NoSQL message store.
Messaging
Messaging is a pattern for temporal decoupling; the ability to store/retrieve messages in a stable central area affords the ability to shut down a message producer without blocking the message consumer, and vice versa. Message stores also afford some abstraction - the producer of a message doesn't necessarily know all of the consumers, and the consumer doesn't necessarily know all of the producers.
My Question is about Event Sourcing. Do we required only immutable sequence events to be stored and where to be stored ?
In event sourcing, the authoritative representation of the state is the sequence of events - your durable copy of that event sequence is the book of truth.
As for where they go? Well, that is going to depend on your architecture and storage choices. You could manage files on disk yourself, you could write them in to your own RDBMS; you could use an RDBMS designed by somebody else, you could use a NoSQL document store, you could use a dedicated message store.
There could be multiple stores -- for instance, in a micro service architecture, the service that accepts orders might be different from the service that tracks order fulfillment, and they could each be writing events into different storage appliances.

Do Firebase/Firestore Transactions create internal queues?

I'm wondering if transactions (https://firebase.google.com/docs/firestore/manage-data/transactions) are viable tools to use in something like a ticketing system where users maybe be attempting to read/write to the same collection/document and whoever made the request first will be handled first and second will be handled second etc.
If not what would be a good structure for such a need with firestore?
Transactions just guarantee atomic consistent update among the documents involved in the transaction. It doesn't guarantee the order in which those transactions complete, as the transaction handler might get retried in the face of contention.
Since you tagged this question with google-cloud-functions (but didn't mention it in your question), it sounds like you might be considering writing a database trigger to handle incoming writes. Cloud Functions triggers also do not guarantee any ordering when under load.
Ordering of any kind at the scale on which Firestore and other Google Cloud products operate is a really difficult problem to solve (please read that link to get a sense of that). There is not a simple database structure that will impose an order where changes are made. I suggest you think carefully about your need for ordering, and come up with a different solution.
The best indication of order you can get is probably by adding a server timestamp to individual documents, but you will still have to figure out how to process them. The easiest thing might be to have a backend periodically query the collection, ordered by that timestamp, and process things in that order, in batch.

Transactional guarantee in mongodb

So, I am doing research on MongoDB, in line with upper management decision to embrace open source and to migrate existing product database from SQL Server to MongoDB and revamp the entire thing. Do note that our database should focus on data consistency and transactional guarantee.
And i discover this post: Click here. A summary of the post is as follow:
MongoDB claims to be strongly consistent, but a lot of evidence
recently has shown this not to be the case in certain scenarios (when
network partitioning occurs, which can happen under heavy load). This
means that you can potentially lose records that MongoDB has
acknowledged as "successfully written".
In terms of your application, if you have a need to have transactional
guarantees (meaning if you can't make a durable write, you need the
transaction to fail), you should avoid MongoDB. Example scenarios
where strong consistency and durability are essential include "making
a deposit into a bank account" or "creating a record of birth". To put
it another way, these are scenarios where you would get punched in the
face by your customer if you indicated an operation succeeded and it
didn't.
So, my question is as follow:
1) To what extend does "lost data" still valid in current version of MongoDB?
2) What approach can be take to ensure transactional guarantee in MongoDB?
I am pretty sure that if company like PayPal do use MongoDB, there is certainly a way of overcoming these issue.
The references in that post have been discussed here before (for example, here is one: MongoDB: Does Write Concern guarantee the write on primary plus atleast one secondary ). No need to duplicate your question.
The blog "Aphyr" mostly uses these articles to tout it's own tech (if you read the entire blog you will realise they have their own database which they market). Every database they show loses data except their own.
2) What approach can be take to ensure transactional guarantee in MongoDB?
I agree you should be handling database problems in client code, if not then how is your client side ever going to remain consistent in the event of partitions?
Since you are not Harry Potter (are you?) I will say that you need to check for exceptions thrown in your client code and react to them as needed.
1) To what extend does "lost data" still valid in current version of MongoDB?
As for the bug he mentions in 2.4.3: he fails (as I mention in the linked post) to state the bug reference so again, no comment still.
Besides 2 writes in 6,000? That's less data loss than I have seen in MySQL on a partition! So not too shabby.
I have not noticed such behaviour in my own app and, from small to extremely large sites, I have not noticed anyone reproduce benchmark type scenarios as displayed in that article, I doubt very much you will.
I am pretty sure that if company like PayPal do use MongoDB, there is certainly a way of overcoming these issue.
They would have tight coding to ensure consistency in distributed environments.
Of course, they would start by choosing the right tech for the situation...
Write Concern Reference
Write concern describes the level of acknowledgement requested from MongoDB for write operations to a standalone mongod or to replica sets or to sharded clusters. In sharded clusters, mongos instances will pass the write concern on to the shards.
https://docs.mongodb.org/v3.0/reference/write-concern/

SQL vs NoSQL for an inventory management system

I am developing a JAVA based web application. The primary aim is to have inventory for products being sold on multiple websites called channels. We will act as manager for all these channels.
What we need is:
Queues to manage inventory updates for each channel.
Inventory table which has a correct snapshot of allocation on each channel.
Keeping Session Ids and other fast access data in a cache.
Providing a facebook like dashboard(XMPP) to keep the seller updated asap.
The solutions i am looking at are postgres(our db till now in a synchronous replication mode), NoSQL solutions like Cassandra, Redis, CouchDB and MongoDB.
My constraints are:
Inventory updates cannot be lost.
Job Queues should be executed in order and preferably never lost.
Easy/Fast development and future maintenance.
I am open to any suggestions. thanks in advance.
Queues to manage inventory updates for each channel.
This is not necessarily a database issue. You might be better off looking at a messaging system(e.g. RabbitMQ)
Inventory table which has a correct snapshot of allocation on each channel.
Keeping Session Ids and other fast access data in a cache.
session data should probably be put in a separate database more suitable for the task(e.g. memcached, redis, etc)
There is no one-size-fits-all DB
Providing a facebook like dashboard(XMPP) to keep the seller updated asap.
My constraints are:
1. Inventory updates cannot be lost.
There are 3 ways to answer this question:
This feature must be provided by your application. The database can guarantee that a bad record is rejected and rolled back, but not guarantee that every query will get entered.
The app will have to be smart enough to recognize when an error happens and try again.
some DBs store records in memory and then flush memory to disk peridocally, this could lead to data loss in the case of a power failure. (e.g Mongo works this way by default unless you enable journaling. CouchDB always appends to the records(even a delete is a flag appended to the record so data loss is extremely difficult))
Some DBs are designed to be extremely reliable, even if an earthquake, hurricane or other natural disaster strikes, they remain durable. these include Cassandra, Hbase, Riak, Hadoop, etc
Which type of durability are your referring to?
Job Queues should be executed in order and preferably never lost.
Most noSQL solutions prefer to run in parallel. so you have two options here.
1. use a DB that locks the entire table for every query(slower)
2. build your app to be smarter or evented(client side sequential queuing)
Easy/Fast development and future maintenance.
generally, you will find that SQL is faster to develop at first, but changes can be harder to implement
noSQL may require a little more planning, but is easier to do ad hoc queries or schema changes.
The questions you probably need to ask yourself are more like:
"Will I need to have intense queries or deep analysis that a Map/Reduce is better suited to?"
"will I need to my change my schema frequently?
"is my data highly relational? in what way?"
"does the vendor behind my chosen DB have enough experience to help me when I need it?"
"will I need special feature such as GeoSpatial indexing, full text search, etc?"
"how close to realtime will I need my data? will it hurt if I don't see the latest records show up in my queries until 1sec later? what level of latency is acceptable?"
"what do I really need in terms of fail-over"
"how big is my data? will it fit in memory? will it fit on one computer? is each individual record large or small?
"how often will my data change? is this an archive?"
If you are going to have multiple customers(channels?) each with their own inventory schemas, a document based DB might have it's advantages. I remember one time I looked at an ecommerce system with inventory and it had almost 235 tables!
Then again, if you have certain relational data, a SQL solution can really have some advantages too.
I can certainly see how I could build a solution using mongo, couch, riak or orientdb with the given constraints. But as for which is the best? I would try talking directly DB vendors, and maybe watch the nosql tapes
Addressing your constraints:
Most NoSQL solutions give you a configurable tradeoff of consistency vs. performance. In MongoDB, for instance, you can decide how durable a write should be. If you want to, you can force the write to be fsync'ed on all your replica set servers. On the other extreme, you can choose to send the command and don't even wait for the server's response.
Executing job queues in order seems to be an application code issue. I'd say a timestamp in the db and an order by type of query should do for most applications. If you have multiple application servers and your queues need to be perfect, you'd have to use a truly distributed algorithm that provides ordering, but that is not a typical requirement, and it's very tricky indeed.
We've been using MongoDB for some time now, and I'm convinced this gives your app development speed a real boost. There's no big difference in maintenance, maintaining data is a pain either way. Not having a schema gives you added flexibility (lazy migrations), but it's more elaborate and requires some care.
In summary, I'd say you can do it both ways. The NoSQL is more code driven, and transactions and relational integrity are mostly managed by your code. If you're uncomfortable with that, go for a relational DB.
However, if you're data grows huge, you'll have to code some of this logic manually because you probably wouldn't want to do real-time joins on a 10B row database. Still, you can implement that with SQL as well.
A good way to find the boundary for different databases is to consider what you can cache. Data that can be cached and reconstructed at any time are a great way to start introducing a new layer, because there's no big risks there. Also, cached data usually doesn't keep any relations so you're not sacrificing any consistency here.
NoSQL is not correct for this application.
I mean, you can use it sure, but you will end up re-implementing a lot of what SQL offers for you. For example I see a lot of relations there. You also want ACID (although some NoSQL solutions do offer that).
There is no reason you can't use both - keep relational data in relational databases, and non-relational data in key/value stores.

Is a document/NoSQL database a good candidate for storing a balance sheet?

If I were to create a basic personal accounting system (because I'm like that - it's a hobby project about a domain I'm familiar enough with to avoid getting bogged-down in requirements), would a NoSQL/document database like RavenDB be a good candidate for storing the accounts and more importantly, transactions against those accounts? How do I choose which entity is the "document"?
I suspect this is one of those cases were actually a SQL database is the right fit and trying to go NoSQL is the mistake, but then when I think of what little I know of CQRS and event sourcing, I wonder if the entity/document is actually the Account, and the transactions are Events stored against it, and that when these "events" occur, maybe my application also then writes out to a easily queryable read store like a SQL database.
Many thanks in advance.
Personally think it is a good idea, but I am a little biased because my full time job is building an accounting system which is based on CQRS, Event Sourcing, and a document database.
Here is why:
Event Sourcing and Accounting are based on the same principle. You don't delete anything, you only modify. If you add a transaction that is wrong, you don't delete it. You create an offset transaction. Same thing with events, you don't delete them, you just create an event that cancels out the first one. This means you are publishing a lot of TransactionAddedEvent.
Next, if you are doing double entry accounting, recording a transaction is different than the way your view it on a screen (especially in a balance sheet). Hence, my liking for cqrs again. We can store the data using correct accounting principles but our read model can be optimized to show the data the way you want to view it.
In a balance sheet, you want to view all entries for a given account. You don't want to see the transaction because the transaction has two sides. You only want to see the entry that affects that account.
So in your document db you would have an entries collection.
This makes querying very easy. If you want to see all of the entries for an account you just say SELECT * FROM Entries WHERE AccountId = 1. I know that is SQL but everyone understands the simplicity of this query. It just as easy in a document db. Plus, it will be lightning fast.
You can then create a balance sheet with a query grouping by accountid, and setting a restriction on the date. Notice no joins are needed at all, which makes a document db a great choice.
Theory and Architecture
If you dig around in accounting theory and history a while, you'll see that the "documents" ought to be the source documents -- purchase order, invoice, check, and so on. Accounting records are standardized summaries of those usually-human-readable source documents. An accounting transaction is two or more records that hit two or more accounts, tied together, with debits and credits balancing. Account balances, reports like a balance sheet or P&L, and so on are just summaries of those transactions.
Think of it as a layered architecture -- the bottom layer, the foundation, is the source documents. If the source is electronic, then it goes into the accounting system's document storage layer -- this is where a nosql db might be useful. If the source is a piece of paper, then image it and/or file it with an index number that is then stored in the accounting system's document layer. The next layer up is digital records summarizing those documents; each document is summarized by one or more unbalanced transaction legs. The next layer up is balanced transactions; each transaction is composed of two or more of those unbalanced legs. The top layer is the financial statements that summarize those balanced transactions.
Source Documents and External Applications
The source documents are the "single source of truth" -- not the records that describe them. You should always be able to rebuild the entire db from the source documents. In a way, the db is just an index into the source documents in the first place. Way too many people forget this, and write accounting software in which the transactions themselves are considered the source of truth. This causes a need for a whole 'nother storage and workflow system for the source documents themselves, and you wind up with a typical modern corporate mess.
This all implies that any applications that write to the accounting system should only create source documents, adding them to that bottom layer. In practice though, this gets bypassed all the time, with applications directly creating transactions. This means that the source document, rather than being in the accounting system, is now way over there in the application that created the transaction; that is fragile.
Events, Workflow, and Digitizing
If you're working with some sort of event model, then the right place to use an event is to attach a source document to it. The event then triggers that document getting parsed into the right accounting records. That parsing can be done programatically if the source document is already digital, or manually if the source is a piece of paper or an unformatted message -- sounds like the beginnings of a workflow system, right? You still want to keep that original source document around somewhere though. A document db does seem like a good idea for that, particularly if it supports a schema where you can tie the source documents to their resulting parsed and balanced records and vice versa.
You can certainly create such a system.
In that scenario, you have the Account Aggregate, and you also have the TimePeriod Aggregate.
The time period is usually a Month, a Quarter or a Year.
Inside each TimePeriod, you have the Transactions for that period.
That means that loading the current state is very fast, and you have the full log in which you can go backward.
The reason for TimePeriod is that this is usually the boundary in which you actually think about such things.
In this case, a relational database is the most appropriate, since you have relational data (eg. rows and columns)
Since this is just a personal system, you are highly unlikely to have any scale or performance issues.
That being said, it would be an interesting exercise for personal growth and learning to use a document-based DB like RavenDB. Traditionally, finance has always been a very formal thing, and relational databases are typically considered more formal and rigorous than document databases. But, like you said, the domain for this application is under your control, and is fairly straight forward, so complexity and requirements would not get in the way of designing the system.
If it was my own personal pet project, and I wanted to learn more about a new-ish technology and see if it worked in a particular domain, I would go with whatever I found interesting and if it didn't work very well, then I learned something. But, your mileage may vary. :)