should i activate c3p0 statement pooling? - postgresql

we are running java6/hibernate/c3p0/postgresql stack.
Our JDBC Driver is 8.4-701.jdbc3
I have a few questions about Prepared Statements. I have read
excellent document about Prepared Statements
But i still have a question how to configure c3p0 with postgresql.
At the moment we have
c3p0.maxStatements = 0
c3p0.maxStatementsPerConnection = 0
In my understanding the prepared statements and statement pooling are two different things:
Our hibernate stack uses prepared statements. Postgresql is caching the
execution plan. Next time the same statement is used, postgresql reuses the
execution plan. This saves time planning statements inside DB.
Additionally c3p0 can cache java instances of "java.sql.PreparedStatement"
which means it is caching the java object. So when using
c3p0.maxStatementsPerConnection = 100 it caches at most 100 different
objects. It saves time on creating objects, but this has nothing to do with
the postgresql database and its prepared statements.
Right?
As we use about 100 different statements I would set
c3p0.maxStatementsPerConnection = 100
But the c3p0 docs say in c3p0 known shortcomings
The overhead of Statement pooling is
too high. For drivers that do not
perform significant preprocessing of
PreparedStatements, the pooling
overhead outweighs any savings.
Statement pooling is thus turned off
by default. If your driver does
preprocess PreparedStatements,
especially if it does so via IPC with
the RDBMS, you will probably see a
significant performance gain by
turning Statement pooling on. (Do this
by setting the configuration property
maxStatements or
maxStatementsPerConnection to a value
greater than zero.).
So: Is it reasonable to activate maxStatementsPerConnection with c3p0 and Postgresql?
Is there a real benefit activating it?
kind regards
Janning

I don't remember offhand if Hibernate actually stores PreparedStatement instances itself, or relies on the connection provider to reuse them. (A quick scan of BatcherImpl suggests it reuses the last PreparedStatement if executing the same SQL multiple times in a row)
I think the point that the c3p0 documentation is trying to make is that for many JDBC drivers, a PreparedStatement isn't useful: some drivers will end up simply splicing the parameters in client-side and then passing the built SQL statement to the database anyway. For these drivers, PreparedStatements are no advantage at all, and any effort to reuse them is wasted. (The Postgresql JDBC FAQ says this was the case for Postgresql before sever protocol version 3 and there is more detailed information in the documentation).
For drivers that do handle PreparedStatements usefully, it's still likely necessary to actually reuse PreparedStatement instances to get any benefit. For example if the driver implements:
Connection.prepareStatement(sql) - create a server-side statement
PreparedStatement.execute(..) etc - execute that server-side statement
PreparedStatement.close() - deallocate the server-side statement
Given this, if the application always opens a prepared statement, executes it once and then closes it again, there's still no benefit; in fact, it might be worse since there are now potentially more round-trips. So the application needs to hang on to PreparedStatement instances. Of course, this leads to another problem: if the application hangs on to too many, and each server-side statement consumes some resources, then this can lead to server-side issues. In the case where someone is using JDBC directly, this might be managed by hand- some statements are known to be reusable and hence are prepared; some aren't and just use transient Statement instances instead. (This is skipping over the other benefit of prepared statements: handling argument escaping)
So this is why c3p0 and other connection pools also have prepared statement caches- it allows application code to avoid dealing with all this. The statements are usually kept in some limited LRU pool, so common statements reuse a PreparedStatement instance.
The final pieces of the puzzle are that JDBC drivers may themselves decide to be clever and do this; and servers may themselves also decide to be clever and detect a client submitting a statement that is structurally similar to a previous one.
Given that Hibernate doesn't itself keep a cache of PreparedStatement instances, you need to have c3p0 do that in order to get the benefit of them. (Which should be reduced overhead for common statements due to reusing cached plans). If c3p0 doesn't cache prepared statements, then the driver will just see the application preparing a statement, executing it, and then closing it again. Looks like the JDBC driver has a "threshold" setting for avoiding the prepare/execute server overhead in the case where the application always does this. So, yes, you need to have c3p0 do statement caching.
Hope that helps, sorry it's a bit long winded. The answer is yes.

Remember that statements have to be cached per connection which will mean you're going to have to consume quite a chunk of memory and it will take a long time before you'll see any benefit. So if you set it to use 100 statements to be cached, that's actually 100*number of connections or else 100/no of connections but you will still need to take quite some time until your cache will have any meaningful effect.

Related

Postgresql - prepared statements vs connection pooling - is it a tradeoff?

It's my understanding that you can use prepared statements or connection pooling (with tools like pgPool/pgBouncer) with Postgresql, but can benefit from only one at the same time (at least with Npgsql driver for .NET, plus library author suggests turning off clients-side connection pooling when using PgBouncer). Am I right?
If so - is this true for other runtimes and languages, like Java, Python, Go? Or is it a implementation-specific issue?
It's a complex question, but here are some answers.
As #laurenz-albe writes, you can use pgbouncer and prepared statements but need to use session pooling. This allows you to use prepared statements for the duration of your connection (i.e. as long as your NpgsqlConnection instance is open). However, if you're in a short-lived connection scenario (e.g. web app which opens and closes a connection for each HTTP request), you're out of luck. In this sense, one could say that pooling and prepared statements aren't compatible.
However, if you use Npgsql's internal pooling mechanism (on by default) instead of pgbouncer, then your prepared statements are automatically persisted across connection open/close. In other words, when you call NpgsqlCommand.Prepare(), if the physical connection happened to have already prepared the SQL, then the prepared statement is reused. This is done specifically to unlock the speed advantages of prepared statements for short-lived connection scenarios. This is a pretty unique behavior of Npgsql, see the docs for more info.
This is one of the advantages of an in-process connection pool, as opposed to an out-of-process pool such as pgbouncer - Npgsql retains information on the physical connection as it is passed around, in this case a table of which statements are prepared (name and SQL).
I think this is a generic question, so I'll give a generic answer. What aspect is applicable to a specific connection pool implementation will probably vary.
There are several modes of connection pooling:
A thread retains a connection for the duration of a session (session pooling):
In that case, persistent state like prepared statements can be held for the duration of the session, but you should clean the state when the session is returned to the pool.
A thread retains a connection for the duration of a database transaction (transaction pooling):
In that case, you'd have to clean the state after each transaction, so prepared statements don't make much sense.
A thread retains a connectoin for the duration of a statement (statement poling):
This is only useful in very limited cases where you don't need transactions spanning more than a single statement. Obviously, no state like prepared statements can be shared.
It depends what kind of connection pool you use. Basically, the longer a thread retains a connection, the more sense it makes to use prepared statements.
Of course, if you know what you are doing, you can also create a prepared statement right after the database connection is established and never deallocate it. This will only work if all threads need the same prepared statements. It is easy to screw up with a setup like that.

Multiple connections to PostgreSQL with huge number of INSERTs

This question is involved by this one: How to speed up insertion performance in PostgreSQL
So, I have java application which is doing a lot of (aprox. billion) INSERTs into PostgreSQL database. It opens several JDBC connections to the same DB for doing these inserts in parallel. As I read in mentioned question-answer:
INSERT or COPY in parallel from several connections. How many depends
on your hardware's disk subsystem; as a rule of thumb, you want one
connection per physical hard drive if using direct attached storage.
But in my case I have only one disk storage for my DB.
So, my question is: does it really have sence to open several connections in this case? Could it reduce perfomance instead of desired increasing due to I/O operations competitions?
For clarifying, here is the picture with actual postgresql processes load:
Since you mentioned INSERT in Java application, I'd assume (utilizing plain JDBC) COPY is not what you're looking for. Without using API like JPA or framework such as Spring-data, may I introduce addBatch() and executeBatch() in case you haven't heard of these:
/*
the whole nine yards
*/
Connection c = ...;
PreparedStatement ps = c.prepareStatement("INSERT INTO table1(columnInt2,columnVarchar)VALUES(?,?)");
Then read data in a loop:
ps.setShort(1, someShortValue);
ps.setString(2, someStringValue);
ps.addBatch(); // one row at a time from human's perspective
When data of all rows are prepared:
ps.executeBatch();
May I also recommend:
Connection pooling which saves you a lot of resources; check out Commons DBCP, c3p0 and BoneCP.
When doing multiple CUD (create, update, delete) operations, one should think about transaction (so you can rollback in case any row goes wrong).

Does PostgreSQL cache Prepared Statements like Oracle

I have just moved to PostgreSQL after having worked with Oracle for a few years.
I have been looking into some performance issues with prepared statements in the application (Java, JDBC) with the PostgreSQL database.
Oracle caches prepared statements in its SGA - the pool of prepared statements is shared across database connections.
PostgreSQL documentation does not seem to indicate this. Here's the snippet from the documentation (https://www.postgresql.org/docs/current/static/sql-prepare.html) -
Prepared statements only last for the duration of the current database
session. When the session ends, the prepared statement is forgotten,
so it must be recreated before being used again. This also means that
a single prepared statement cannot be used by multiple simultaneous
database clients; however, each client can create their own prepared
statement to use.
I just want to make sure that I am understanding this right, because it seems so basic for a database to implement some sort of common pool of commonly executed prepared statements.
If PostgreSQL does not cache these that would mean every application that expects a lot of database transactions needs to develop some sort of prepared statement pool that can be re-used across connections.
If you have worked with PostgreSQL before, I would appreciate any insight into this.
Yes, your understanding is correct. Typically if you had a set of prepared queries that are that critical then you'd have the application call a custom function to set them up on connection.
There are three key reasons for this afaik:
There's a long todo list and they get done when a developer is interested/paid to tackle them. Presumably no-one has thought it worth funding yet or come up with an efficient way of doing it.
PostgreSQL runs in a much wider range of environments than Oracle. I would guess that 99% of installed systems wouldn't see much benefit from this. There are an awful lot of setups without high-transaction performance requirement, or for that matter a DBA to notice whether it's needed or not.
Planned queries don't always provide a win. There's been considerable work done on delaying planning/invalidating caches to provide as good a fit as possible to the actual data and query parameters.
I'd suspect the best place to add something like this would be in one of the connection pools (pgbouncer/pgpool) but last time I checked such a feature wasn't there.
HTH

PostgreSQL. Slow queries in log file are fast in psql

I have an application written on Play Framework 1.2.4 with Hibernate(default C3P0 connection pooling) and PostgreSQL database (9.1).
Recently I turned on slow queries logging ( >= 100 ms) in postgresql.conf and found some issues.
But when I tried to analyze and optimize one particular query, I found that it is blazing fast in psql (0.5 - 1 ms) in comparison to 200-250 ms in the log. The same thing happened with the other queries.
The application and database server is running on the same machine and communicating using localhost interface.
JDBC driver - postgresql-9.0-801.jdbc4
I wonder what could be wrong, because query duration in the log is calculated considering only database processing time excluding external things like network turnarounds etc.
Possibility 1: If the slow queries occur occasionally or in bursts, it could be checkpoint activity. Enable checkpoint logging (log_checkpoints = on), make sure the log level (log_min_messages) is 'info' or lower, and see what turns up. Checkpoints that're taking a long time or happening too often suggest you probably need some checkpoint/WAL and bgwriter tuning. This isn't likely to be the cause if the same statements are always slow and others always perform well.
Possibility 2: Your query plans are different because you're running them directly in psql while Hibernate, via PgJDBC, will at least sometimes be doing a PREPARE and EXECUTE (at the protocol level so you won't see actual statements). For this, compare query performance with PREPARE test_query(...) AS SELECT ... then EXPLAIN ANALYZE EXECUTE test_query(...). The parameters in the PREPARE are type names for the positional parameters ($1,$2,etc); the parameters in the EXECUTE are values.
If the prepared plan is different to the one-off plan, you can set PgJDBC's prepare threshold via connection parameters to tell it never to use server-side prepared statements.
This difference between the plans of prepared and unprepared statements should go away in PostgreSQL 9.2. It's been a long-standing wart, but Tom Lane dealt with it for the up-coming release.
It's very hard to say for sure without knowing all the details of your system, but I can think of a couple of possibilities:
The query results are cached. If you run the same query twice in a short space of time, it will almost always complete much more quickly on the second pass. PostgreSQL maintains a cache of recently retrieved data for just this purpose. If you are pulling the queries from the tail of your log and executing them immediately this could be what's happening.
Other processes are interfering. The execution time for a query varies depending on what else is going on in the system. If the queries are taking 100ms during peak hour on your website when a lot of users are connected but only 1ms when you try them again late at night this could be what's happening.
The point is you are correct that the query duration isn't affected by which library or application is calling it, so the difference must be coming from something else. Keep looking, good luck!
There are several possible reasons. First if the database was very busy when the slow queries excuted, the query may be slower. So you may need to observe the load of the OS at that moment for future analysis.
Second the history plan of the sql may be different from the current session plan. So you may need to install auto_explain to see the actual plan of the slow query.

Is it a good idea to re-use ADO.NET command objects?

I'm working on a .NET program that executes arbitrary scripts against a database.
When a colleage started writing the database access code, he simply exposed one command object to the rest of the application which is re-used (setting CommandText/Type, calling ExecuteNonQuery() etc.) for each statement.
I imagine this is a big performance hit for repeated, identical statements, because they are parsed anew each time.
What I'm wondering about, though, is: will this also degrade execution speed if each statement is different from the previous one (not only different parameters, but an entirely different statement)? I couldn't easily find an answer on that in the documentation.
Btw, the RDBMS used is Oracle, but I guess this question is not really database specific.
P.S. I know exposing the same Command object is not thread safe, but that's not an issue here.
There is some overhead involved in creating new command objects, and so in certain circumstances it can make sense to re-use the same command. But as the general case enforced for an entire application it seems more than a little odd.
The performance hit usually comes from establishing a connection to the database, but ADO.NET creates a connection pool to help here.
If you wish to avoid parsing statements each time anew, you can put them into stored procedures.
I imagine your colleague just uses some old style approach that he's inherited from working on other platforms where reusing a command object did make a difference.