Using both XA and non-XA datasource for the same database - jboss5.x

There are some places in our project where we need to use XA transactions. But in most of the project the regular non-XA datasource will do. I've been wondering, do I need to define 2 versions of the datasource, XA and non-XA, for the same database? I'm afraid that XA transactions could be costly, therefore I'd like to avoid them if possible.

That's a very reasonable approach if you are worried about performance, as the 2PC commit protocol can be up to 4 times slower. Of course this needs to be qualified
this depends on the number of resource managers that are participating in the xa txn
2PC itself is more expensive as it requires more roundtrips.
In addition transactions can go into the in-doubt state when a failure occurs, which essentially locks all records that were changed as part of the transaction until they are recovered. By not using XA this can be avoided.

Related

How much does transactions for all write operations increase data consistency?

I use gorm for ORM with PostgresSQL and I noticed this in it's documentation
GORM perform write (create/update/delete) operations run inside a transaction to ensure data consistency, you can disable it during initialization if it is not required, you will gain about 30%+ performance improvement after that
https://gorm.io/docs/transactions.html
Consistency is important for my use case but I'm wondering if this is really necessary, and worth the performance hit
There is a saying: you can make it arbitrarily fast, if you don't have to do it correctly. What good are broken data?
But in this case, I have to doubt the claim. In an ACID compliant relational database, you always pay the price for transactional processing.
By default, every statement in PostgreSQL runs in its own transaction. So if you start an explicit transaction that spans several data modifying statements, you actually gain performance, since you don't have to pay the price for a commit as often.
The only consideration is the network latency you incur four times when sending BEGIN; and COMMIT;. But if you have high network latency, you can say goodbye to OLTP performance anyway.

Making multiple users access to PSQL database

I'm a rookie in this topic, all I ever did was making a connection to database for one user, so I'm not familiar with making multiple user access to database.
My case is: 10 facilities will use my program for recording when workers are coming and leaving, the database will be on the main server and all I made was one user while I was programming/testing that program. My question is: Can multiple remote locations use one user for database to connect (there should be no collision because they are all writing different stuff, but at the same tables) and if that's not the case, what should I do?
Good relational databases handle this quite well, it is the “I” in the the so-called ACID properties of transactions in relational databases; it stands for isolation.
Concurrent processes are protected from simultaneously writing the same table row by locks that block other transactions until one transaction is done writing.
Readers are protected from concurrent writing by means of multiversion concurrency control (MVCC), which keeps old versions of the data around to serve readers without blocking anybody.
If you have enclosed all data modifications that belong together into a transaction, so that they happen atomically (the “A” in ACID), and your transactions are simple and short, your application will probably work just fine.
Problems may arise if these conditions are not satisfied:
If your data modifications are not protected by transactions, a concurrent session may see intermediate, incomplete results of a different session and thus work with inconsistent data.
If your transactions are complicated, later statements inside a transaction may rely on results of previous statements in indirect ways. This assumption can be broken by concurrent activity that modifies the data. There are three approaches to that:
Pessimistic locking: lock all data the first time you use them with something like SELECT ... FOR UPDATE so that nobody can modify them until your transaction is done.
Optimistic locking: don't lock, but whenever you access the data a second time, check that nobody else has modified them in the meantime. If that has been the case, roll the transaction back and try it again.
Use high transaction isolation levels like REPEATABLE READ and SERIALIZABLE which give better guarantees that the data you are using don't get modified concurrently. You have to be prepared to receive serialization errors if the database cannot keep the guarantees, in which case you have to roll the transaction back and retry it.
These techniques achieve the same goal in different ways. The discussion when to use which one exceeds the scope of this answer.
If your transactions are complicated and/or take a long time (long transactions are to be avoided as much as possible, because they cause all kinds of problems in a database), you may encounter a deadlock, which is two transactions locking each other in a kind of “deadly embrace”.
The database will detect this condition and interrupt one of the transactions with an error.
There are two ways to deal with that:
Avoid deadlocks by always locking resources in a certain order (e.g., always update the account with the lower account number first).
When you encounter a deadlock, your code has to retry the transaction.
Contrary to common believe, a deadlock is not necessarily a bug.
I recommend that you read the chapter about concurrency control in the PostgreSQL documentation.

Commit to a log like Kafka + database with ACID properties?

I'm planning in test how make this kind of architecture to work:
http://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/
Where all the data is stored as facts in a log, but the validations when posted a change must be against a table. For example, If I send a "Create Invoice with Customer 1" I will need to validate if the customer exist and other stuff, then when the validation pass commit to the log and put the current change to the table, so the table have the most up-to-date information yet I have all the history of the changes.
I could put the logs into the database in a table (I use PostgreSql). However I'm concerned about the scalability of doing that, also, I wish to suscribe to the event stream from multiple clients and PG neither other RDBMS I know let me to do this without polling.
But if I use Kafka I worry about the ACID between both storages, so Kafka could get wrong data that PG rollback or something similar.
So:
1- Is possible to keep consistency between a RDBMS and a log storage OR
2- Is possible to suscribe in real time and tune PG (or other RDBMS) for fast event storage?
Easy(1) answers for provided questions:
Setting up your transaction isolation level properly may be enough to achieve consistency and not worry about DB rollbacks. You still can occasionally create inconsistency, unless you set isolation level to 'serializable'. Even then, you're guaranteed to be consistent, but still could have undesirable behaviors. For example, client creates a customer and puts an invoice in a rapid succession using an async API, and invoice event hits your backed system first. In this case invoice event would be invalidated and a client will need to retry hoping that customer was created by that time. Easy to avoid if you control clients and mandate them to use sync API.
Whether it is possible to store events in a relational DB depends on your anticipated dataset size, hardware and access patterns. I'm a big time Postgres fan and there is a lot you can do to make event lookups blazingly fast. My rule of thumb -- if your operating table size is below 2300-300GB and you have a decent server, Postgres is a way to go. With event sourcing there are typically no joins and a common access pattern is to get all events by id (optionally restricted by time stamp). Postgres excels at this kind of queries, provided you index smartly. However, event subscribers will need to pull this data, so may not be good if you have thousands of subscribers, which is rarely the case in practice.
"Conceptually correct" answer:
If you still want to pursue streaming approach and fundamentally resolve race conditions then you have to provide event ordering guarantees across all events in the system. For example, you need to be able to order 'add customer 1' event and 'create invoice for customer 1' event so that you can guarantee consistency at any time. This is a really hard problem to solve in general for a distributed system (see e.g. vector clocks). You can mitigate it with some clever tricks that would work for your particular case, e.g. in the example above you can partition your events by 'customerId' early as they hit backend, then you can have a guarantee that all event related to the same customer will be processed (roughly) in order they were created.
Would be happy to clarify my points if needed.
(1) Easy vs simple: mandatory link

Concurrency, Atomicty, and Isolation in Entity Framework

Based on some periodically and concurrently incoming data, I'm performing an operation that will either insert a new row into a table, or update an existing row in the same table. Whether it inserts or updates a row is dependent on the states of the existing rows. So, the result of this operation will be affected by previous runs of this operation, and affect subsequent runs. I need to ensure atomicity/isolation using transactions, or locks, or something. There seems to be so many options and caveats with Entity Framework (and I'm a complete newbie with database stuff in general too) that I have no idea what direction I should be headed. TransactionScope, BeginTransaction, ambient transactions? Serializable or RepeatableRead? SaveChanges and AcceptAllChanges? Do I even need to do anything special? The fact that a new row can be added makes me worry especially about phantom rows, though I barely understand what that means. Any guidance on the subject would be greatly appreciated.
This tutorial may be helpful to you - http://www.asp.net/mvc/tutorials/getting-started-with-ef-using-mvc/handling-concurrency-with-the-entity-framework-in-an-asp-net-mvc-application
Quote:
Pessimistic Concurrency (Locking)
If your application does need to prevent accidental data loss in
concurrency scenarios, one way to do that is to use database locks.
This is called pessimistic concurrency. For example, before you read a
row from a database, you request a lock for read-only or for update
access. If you lock a row for update access, no other users are
allowed to lock the row either for read-only or update access, because
they would get a copy of data that's in the process of being changed.
If you lock a row for read-only access, others can also lock it for
read-only access but not for update. Managing locks has some
disadvantages. It can be complex to program. It requires significant
database management resources, and it can cause performance problems
as the number of users of an application increases (that is, it
doesn't scale well). For these reasons, not all database management
systems support pessimistic concurrency. The Entity Framework provides
no built-in support for it, and this tutorial doesn't show you how to
implement it.
Optimistic Concurrency
The alternative to pessimistic concurrency is optimistic concurrency.
Optimistic concurrency means allowing concurrency conflicts to happen,
and then reacting appropriately if they do. For example, John runs the
Departments Edit page, changes the Budget amount for the English
department from $350,000.00 to $100,000.00. (John administers a
competing department and wants to free up money for his own
department.)*
There are code examples for both models in the in the tutorial.

Is there any method to guarantee transaction from the user end

Since MongoDB does not support transactions, is there any way to guarantee transaction?
What do you mean by "guarantee transaction"?
There are two conepts in MongoDB that are similar;
Atomic operations
Using safe mode / getlasterror ...
http://www.mongodb.org/display/DOCS/Last+Error+Commands
If you simply need to know if there was an error when you run an update for example you can use the getlasterror command, from the docs ...
getlasterror is primarily useful for
write operations (although it is set
after a command or query too). Write
operations by default do not have a
return code: this saves the client
from waiting for client/server
turnarounds during write operations.
One can always call getLastError if
one wants a return code.
If you're writing data to MongoDB on
multiple connections, then it can
sometimes be important to call
getlasterror on one connection to be
certain that the data has been
committed to the database. For
instance, if you're writing to
connection # 1 and want those writes to
be reflected in reads from connection #2, you can assure this by calling getlasterror after writing to
connection # 1.
Alternatively, you can use atomic operations for cases where you need to increment a value for example (like an upvote, etc.) more about that here:
http://www.mongodb.org/display/DOCS/Atomic+Operations
As a side note, MySQL's default storage engine doesn't have transaction either! :)
http://dev.mysql.com/doc/refman/5.1/en/myisam-storage-engine.html
MongoDB only supports atomic operations. There is no ways implement transaction in the sense of ACID on top of MongoDB. Such a transaction support must be implemented in the core. But you will never see full transaction support due to the CARP theorem. You can not have speed, durability and consistency at the same time.
I think ti's one of the things you choose to forego when you choose a NoSQL solution.
If transactions are required, perhaps NoSQL is not for you. Time to go back to ACID relational databases.
Unfortunately MongoDB does't support transaction out of the box, but actually you can implement ACID optimistic transactions on top on it. I wrote an example and some explanation on a GitHub page.