I am currently working with a very large database (>50GB) and trying to understand the most efficient, usable approach that plays well with Akka's inherent threading.
Regarding the "wrapping everything inside withSession{ }" approach, while this would be an easier fix, I am concerned that this would restrict Akka's threading between actors. I am not that knowledgeable on how Akka's threading works, and how wrapping an entire actor system inside of withSession would effect it.
Another approach is to call withSession whenever the database is accessed, which is too inefficient. The "withSession {" code segment takes ~6ms to execute, and we are making millions of queries.
Essentially: what is the best way to rapidly access a database with Slick and Akka without breaking threading?
I have heard of approaches using implicit sessions and transactions, but I am struggling to find documentation on either of these.
Better late than never:
The recommended way is using a jdbc connection pool (e.g. c3p0). You need to make sure a session is acquired and returned from and in between kept on the same thread. withSession lazily acquires a connection from the pool and returns it at the end of the scope. Quickly acquire connections when you need them and return them to the pool immediately after.
Related
I'm building a Scala REST API to handle orders and I have been following https://github.com/alexandru/scala-best-practices to ensure I am following best practices as it's been a while since I wrote any Scala code.
So far, all of my code is thread safe and non blocking, using Futures where applicable and immutability throughout. Now, i've integrated a data store that handles the creation of orders, my problem is that how do I deal with transactions (such as two orders exactly the same being created at the same time)?
Are there any patterns or examples that I can follow for handling transactions whilst preserving thread safety and avoiding non blocking code? I come from a Java background and typically would use synchronised blocks but that would not respect:
4.8. https://github.com/alexandru/scala-best-practices/blob/master/sections/4-concurrency-parallelism.md
Meet Amdahl's Law. Synchronizing with locks drastically limits the parallelization possible and thus vertical scalability. Reads are embarrassingly paralellizable, so avoid at all times doing this:
def fetch = synchronized { someValue }
Come up with better synchronization schemes that does not involve synchronizing reads, like atomic references or STM. If you aren't able to do that, then avoid this altogether by using proper abstractions.
I would appreciate guidance and/or examples of how this can be achieved.
It's my understanding that you can use prepared statements or connection pooling (with tools like pgPool/pgBouncer) with Postgresql, but can benefit from only one at the same time (at least with Npgsql driver for .NET, plus library author suggests turning off clients-side connection pooling when using PgBouncer). Am I right?
If so - is this true for other runtimes and languages, like Java, Python, Go? Or is it a implementation-specific issue?
It's a complex question, but here are some answers.
As #laurenz-albe writes, you can use pgbouncer and prepared statements but need to use session pooling. This allows you to use prepared statements for the duration of your connection (i.e. as long as your NpgsqlConnection instance is open). However, if you're in a short-lived connection scenario (e.g. web app which opens and closes a connection for each HTTP request), you're out of luck. In this sense, one could say that pooling and prepared statements aren't compatible.
However, if you use Npgsql's internal pooling mechanism (on by default) instead of pgbouncer, then your prepared statements are automatically persisted across connection open/close. In other words, when you call NpgsqlCommand.Prepare(), if the physical connection happened to have already prepared the SQL, then the prepared statement is reused. This is done specifically to unlock the speed advantages of prepared statements for short-lived connection scenarios. This is a pretty unique behavior of Npgsql, see the docs for more info.
This is one of the advantages of an in-process connection pool, as opposed to an out-of-process pool such as pgbouncer - Npgsql retains information on the physical connection as it is passed around, in this case a table of which statements are prepared (name and SQL).
I think this is a generic question, so I'll give a generic answer. What aspect is applicable to a specific connection pool implementation will probably vary.
There are several modes of connection pooling:
A thread retains a connection for the duration of a session (session pooling):
In that case, persistent state like prepared statements can be held for the duration of the session, but you should clean the state when the session is returned to the pool.
A thread retains a connection for the duration of a database transaction (transaction pooling):
In that case, you'd have to clean the state after each transaction, so prepared statements don't make much sense.
A thread retains a connectoin for the duration of a statement (statement poling):
This is only useful in very limited cases where you don't need transactions spanning more than a single statement. Obviously, no state like prepared statements can be shared.
It depends what kind of connection pool you use. Basically, the longer a thread retains a connection, the more sense it makes to use prepared statements.
Of course, if you know what you are doing, you can also create a prepared statement right after the database connection is established and never deallocate it. This will only work if all threads need the same prepared statements. It is easy to screw up with a setup like that.
I'm a rookie in this topic, all I ever did was making a connection to database for one user, so I'm not familiar with making multiple user access to database.
My case is: 10 facilities will use my program for recording when workers are coming and leaving, the database will be on the main server and all I made was one user while I was programming/testing that program. My question is: Can multiple remote locations use one user for database to connect (there should be no collision because they are all writing different stuff, but at the same tables) and if that's not the case, what should I do?
Good relational databases handle this quite well, it is the “I” in the the so-called ACID properties of transactions in relational databases; it stands for isolation.
Concurrent processes are protected from simultaneously writing the same table row by locks that block other transactions until one transaction is done writing.
Readers are protected from concurrent writing by means of multiversion concurrency control (MVCC), which keeps old versions of the data around to serve readers without blocking anybody.
If you have enclosed all data modifications that belong together into a transaction, so that they happen atomically (the “A” in ACID), and your transactions are simple and short, your application will probably work just fine.
Problems may arise if these conditions are not satisfied:
If your data modifications are not protected by transactions, a concurrent session may see intermediate, incomplete results of a different session and thus work with inconsistent data.
If your transactions are complicated, later statements inside a transaction may rely on results of previous statements in indirect ways. This assumption can be broken by concurrent activity that modifies the data. There are three approaches to that:
Pessimistic locking: lock all data the first time you use them with something like SELECT ... FOR UPDATE so that nobody can modify them until your transaction is done.
Optimistic locking: don't lock, but whenever you access the data a second time, check that nobody else has modified them in the meantime. If that has been the case, roll the transaction back and try it again.
Use high transaction isolation levels like REPEATABLE READ and SERIALIZABLE which give better guarantees that the data you are using don't get modified concurrently. You have to be prepared to receive serialization errors if the database cannot keep the guarantees, in which case you have to roll the transaction back and retry it.
These techniques achieve the same goal in different ways. The discussion when to use which one exceeds the scope of this answer.
If your transactions are complicated and/or take a long time (long transactions are to be avoided as much as possible, because they cause all kinds of problems in a database), you may encounter a deadlock, which is two transactions locking each other in a kind of “deadly embrace”.
The database will detect this condition and interrupt one of the transactions with an error.
There are two ways to deal with that:
Avoid deadlocks by always locking resources in a certain order (e.g., always update the account with the lower account number first).
When you encounter a deadlock, your code has to retry the transaction.
Contrary to common believe, a deadlock is not necessarily a bug.
I recommend that you read the chapter about concurrency control in the PostgreSQL documentation.
While building a large multi threaded application for the financial services industry, I utilized immutable classes and an Actor model for workflow everywhere I could. I'm pretty pleased with the outcome. It uses a fair amount of heap space (its in Java btw) but the JVM's GC works pretty well with short lived immutable classes.
I am just wondering if there are any downsides to using this kind of pattern going forward? When debugging a team mates code, I often find myself recomending this pattern in one way or another. I guess once one has a hammer, everything looks like a nail. So the question is: When would this design pattern (paradigm?) work poorly?
My hunch is when memory usage is a big issue or when the project restrictions require something along the lines of low-level C, etc.
Many science simulations codes are really memory intensive. For example for cellular automata models fast memory access is more important than CPU power. In that case, accessing and modifying in place a mutable array is always faster (at least in all my trials).
All depends of your project design.
If you have some resource and lot of actors use it then the common pattern is to design accessor actor. Then when some other actor needs to ask about some resource, he ask about it accessor actor. Then the answer is copied through message channel.
Now imagine - you have really heavy resource (eg map[String, BigObject]) and other actors frequently ask about some BigObject then you waste your bandwidth.
Better idea would be to share the resource to all actors in readonly mode, and make one actor to perform writes.
Other example would be database connector which connect to database with a lot of blob data. When database connector is thread safe (as normally is) it's better to share the connector object reference to all actors, then design some actor which provides the access.
All you need to remember every that communication between actors is done by copying messages.
I'm working on a .NET program that executes arbitrary scripts against a database.
When a colleage started writing the database access code, he simply exposed one command object to the rest of the application which is re-used (setting CommandText/Type, calling ExecuteNonQuery() etc.) for each statement.
I imagine this is a big performance hit for repeated, identical statements, because they are parsed anew each time.
What I'm wondering about, though, is: will this also degrade execution speed if each statement is different from the previous one (not only different parameters, but an entirely different statement)? I couldn't easily find an answer on that in the documentation.
Btw, the RDBMS used is Oracle, but I guess this question is not really database specific.
P.S. I know exposing the same Command object is not thread safe, but that's not an issue here.
There is some overhead involved in creating new command objects, and so in certain circumstances it can make sense to re-use the same command. But as the general case enforced for an entire application it seems more than a little odd.
The performance hit usually comes from establishing a connection to the database, but ADO.NET creates a connection pool to help here.
If you wish to avoid parsing statements each time anew, you can put them into stored procedures.
I imagine your colleague just uses some old style approach that he's inherited from working on other platforms where reusing a command object did make a difference.