When inserting/updating data in postgresql, it is easy to execute multiple statements in one transaction. (My goal here is to avoid a server round-trip for each statement, although the transactional isolation is often useful.)
When querying, I'm unclear if this is possible. I'd somehow need to know what function is going to consume each bit and how to separate the bits.
connection c("dbname=test user=postgres hostaddr=127.0.0.1");
work w(c);
w.exec("SELECT a, b FROM my_table WHERE c = 3;");
w.exec("SELECT x, y, z FROM my_other_table WHERE c = 'dog';");
w.commit();
Assume I've got functions my_parse_function() and my_other_parse_function() that can read rows from each of these queries, were I doing them separately.
If your goal is to avoid round trips, transactions don't help.
Transaction isolation in Postgres (as with most RDBMSs) doesn't rely on the server executing all of your statements at once. Each statement in your transaction will be sent and executed at the point of the exec() call; isolation is provided by the engine's concurrency control model, allowing multiple clients to issue commands simultaneously, and presenting each with a different view of the database state.
If anything, wrapping a sequence of statements in a transaction will add more communication overhead, as additional round-trips are required to issue the BEGIN and COMMIT commands.
If you want to issue several commands in one round-trip, you can do so by calling exec() with a single semicolon-separated multi-statement string. These statements will be implicitly treated as a single transaction, provided that there is no explicit transaction already active, and that the string doesn't include any explicit BEGIN/COMMIT commands.
If you want to send multiple queries, the protocol does allow for multiple result sets to be returned in response to a multi-query string, but exec() doesn't give you access to them; it's just a wrapper for libpq's PQexec(), which discards all but the last result.
Instead, you can use a pipeline, which lets you issue asynchronous queries via insert(), and then retrieve() the results at your leisure (blocking until they arrive, if necessary). Setting a retain() limit will allow the pipeline to accumulate statements, and then send them together as a multi-command string.
Related
The Prepare and Execute combination in PostgreSQL permit the use of bound parameters. However, Prepare does not produce a plan optimized for one set of parameter bindings that can be reused with a different set of parameters bindings. Does anybody have pointers on implementing such functionality? With this, the plan would be optimized for the given set of parameter bindings but could be reused for another set. The plan might not be efficient for the subsequent set, but if the plan cost was recomputed using the new parameter bindings, it might be found to be efficient.
Reading and using parameter binding values for cardinality estimation is called "parameter sniffing" in SQL Server and "bind peeking" in Oracle. Basically, has anybody done anything similar in PostgreSQL.
PostgreSQL uses a heuristic to decide whether to do "bind peeking". It does peeking the first 5 times (I think it is) that a prepared statement is executed, and if none of those lead to better (expected-to-be-better) plans than the generic plan was, it stops checking in the future.
Starting in v12, you can change this heuristic by setting plan_cache_mode.
Note that some drivers implement their own heuristics--just because you call the driver's prepare method doesn't mean it actually transmits this to the server as a PREPARE. It might instead stash the statement text away, wait until you execute, and then quote/escape your parameters and bundle them up with your previously pseudo-prepared statement and send them to the server in one packet. That is, they might treat the prepare/execute separation simply as a way to prevent SQL injections, not as a way to increase performance.
Today I used the approach in this answer to great success, to replace names, insurance numbers and addresses with randomized garbage in multiple instances of the same database schema, depending on a "test" / "production" flag in the data.
Background: Trying to do
CREATE FUNCTION dbo.FailsToCreate()
RETURNS uniqueidentifier
AS
BEGIN
RETURN NEWID()
END
inevitably fails with
Msg 443, Level 16, State 1, Procedure FailsToCreate, Line 6 [Batch Start Line 27]
Invalid use of a side-effecting operator 'newid' within a function.
Now we can be badass enough to do
CREATE VIEW dbo.vwGuessWhat AS SELECT NEWID() Fooled
which surprisingly allows us to make it work with
CREATE FUNCTION dbo.SuddenlyWorks()
RETURNS uniqueidentifier
AS
BEGIN
RETURN (SELECT Fooled FROM vwGuessWhat)
END
Documentation is silent about consequences. It merely lists the functions that cannot be used, and does not mention a possibility to bypass the limitation.
Can I safely continue to use this approach in production code, or is there a danger in bypassing SQL Server's validation that will cause it to malfunction?
There is no risk per se. You present an interesting way to circumvent a limitation with SQL functions.
On a seperate note, one of the many problems with scalar udfs is that they kill parallelism. In other words, queries that use dbo.SuddenlyWorks() will always run serially, even if you use Adam Machanic's make_parallel() or traceflag 8649.
If you wanted a parallel plan you would need to make dbo.SuddenlyWorks() an inline table valued function.
I need to update a KDB table with new/updated/deleted rows while it is being read by other threads. Since writing to K structures while other threads access will not be thread safe, the only way I can think of is to clone the whole table and apply new changes to that. Even to do that, I need to first clone the table, then find a way to insert/update/delete rows from it.
I'd like to know if there are functions in C to:
1. Clone the whole table
2. Delete existing rows
3. Insert new rows easily
4. Update existing rows
Appreciate suggestions on new approaches to the same problem as well.
Based on the comments...
You need to do a set of operations on the KDB database "atomically"
You don't have "control" of the database, so you can't set functions (though you don't actually need to be an admin to do this, but that's a different story)
You have a separate C process that is connecting to the database to do the operations you require. (Given you said you don't have "access" to the database as admin, you can't get KDB to load your C binary to use within-process anyway).
Firstly I'm going to assume you know how to connect to KDB+ and issue via the C API (found here).
All you need to do then is to concatenate your "atomic" operation into a set of statements that you are going to issue in one call from C. For example say you want to update a table and then delete some entry. This is what your call might look like:
{update name:`me from table where name=`you; delete from `table where name=`other;}[]
(Caution: this is just a dummy example, I've assumed your table is in-memory so that the delete operation here would work just fine, and not saved to disk, etc. If you need specific help with the actual statements you require in your use case then that's a different question for this forum).
Notice that this is an anonymous function that will get called immediately on issue ([]). There is the assumption that your operations within the function will succeed. Again, if you need actual q query help it's a different question for this forum.
Even if your KDB database is multithreaded (started with -s or negative port number), it will not let you update global variables inside a peach thread. Therefore your operation should work just fine. But just in case there's something else that could interfere with your new anonymous function, you can wrap the function with protected evaluation.
In PostgreSQL, when are (SELECT) queries planned?
Is it:
at statement-prepare time, or
at the start of processing the SELECT, or
something else
The reason I ask is that there is a Stackoverflow question: same query, two different ways, vastly different performance
A lot of people seem to be thinking that the query is planned differently because in one case the query contains a string literal ('foo') and in another case it's a placeholder (?).
Now my thinking is that this is a red herring, because the query isn't planned at statement-prepare time, but is actually planned at SELECT time.
So, say, I could prepare a statement with a placeholder, then run the query multiple times with different bound values, and the query planner will be run for each different bound value.
I suspect that the question linked above boils down to the PostgreSQL data type of the value, which in the case of a 'foo' literal is known to be a string, but in the case of a placeholder, the type can't be divined, so is coming through to the query planner as some strange type, which it can't create an efficient plan for. In which case, the issue is not that the query is being planned differently because the value is a placeholder (at statement preparation time) per se but that the value is coming through to the query as a different PostgreSQL type, and that is what is influencing the query planner. To fix this would simply be a matter of binding the placeholder with an appropriate explicit type declaration.
I cannot talk about the client-side Perl interface itself but I can shed some light on the PostgreSQL server side.
PostgreSQL has prepared statements and unprepared statements. Unprepared statements are parsed, planned and executed immediately. They also do not support parameter substitution. On a plain psql shell you can show their query plan like this:
tmpdb> explain select * from sometable where flag = true;
On the other hand there are prepared statements: They are usually (see "exception" below) parsed and planned in one step and executed in a second step. They can be re-executed several times with different parameters, because they do support parameter substitution. The equivalent in psql is this:
tmpdb> prepare foo as select * from sometable where flag = $1;
tmpdb> explain execute foo(true);
You may see, that the plan is different from the plan in the unprepared statement, because planning did take place already in the prepare phase as described in the doc for PREPARE:
When the PREPARE statement is executed, the specified statement is parsed, rewritten, and planned. When an EXECUTE command is subsequently issued, the prepared statement need only be executed. Thus, the parsing, rewriting, and planning stages are only performed once, instead of every time the statement is executed.
This also means, that the plan is NOT optimized for the substituted parameters: In the first examples might use an index for flag because PostgreSQL knows that within a million entries only ten have the value true. This reasoning is impossible when PostgreSQL uses a prepared statement. In that case a plan is created which will work for all possible parameter values as good as possible. This might exclude the mentioned index because fetching the better part of the complete table via random access (due to the index) is slower than a plain sequential scan. The PREPARE doc confirms this:
In some situations, the query plan produced for a prepared statement will be inferior to the query plan that would have been chosen if the statement had been submitted and executed normally. This is because when the statement is planned and the planner attempts to determine the optimal query plan, the actual values of any parameters specified in the statement are unavailable. PostgreSQL collects statistics on the distribution of data in the table, and can use constant values in a statement to make guesses about the likely result of executing the statement. Since this data is unavailable when planning prepared statements with parameters, the chosen plan might be suboptimal.
BTW - Regarding plan caching the PREPARE doc also has something to say:
Prepared statements only last for the duration of the current database session. When the session ends, the prepared statement is forgotten, so it must be recreated before being used again.
Also there is no automatic plan caching and no caching/reuse over multiple connections.
EXCEPTION: I have mentioned "usually". The shown psql examples are not the stuff a client adapter like Perl DBI really uses. It uses a certain protocol. Here the term "simple query" corresponds to the "unprepared query" in psql, the term "extended query" corresponds to "prepared query" with one exception: There is a distinction between (one) "unnamed statement" and (possibly multiple) "named statements". Regarding named statements the doc says:
Named prepared statements can also be created and accessed at the SQL command level, using PREPARE and EXECUTE.
and also:
Query planning for named prepared-statement objects occurs when the Parse message is processed.
So in this case planning is done without parameters as described above for PREPARE - nothing new.
The mentioned exception is the "unnamed statement". The doc says:
The unnamed prepared statement is likewise planned during Parse processing if the Parse message defines no parameters. But if there are parameters, query planning occurs every time Bind parameters are supplied. This allows the planner to make use of the actual values of the parameters provided by each Bind message, rather than use generic estimates.
And here is the benefit: Although the unnamed statement is "prepared" (i.e. can have parameter substitution), it also can adapt the query plan to the actual parameters.
BTW: The exact handling of the unnamed statement has changed several times in the past releases of the PostgreSQL server. You can lookup the old docs for details if you really want.
Rationale - Perl / any client:
How a client like Perl uses the protocol is a completely different question. Some clients like the JDBC driver for Java basically say: Even if the programmer uses a prepared statement, the first five (or so) executions are internally mapped to a "simple query" (i.e. effectively unprepared), after that the driver switches to "named statement".
So a client has these choices:
Force (re)planning each time by using the "simple query" protocol.
Plan once, execute multiple times by using the "extended query" protocol and the "named statement" (plan might be bad because planning is done without parameters).
Parse once, plan for each execution (with current PostgreSQL version) by using the "extended query" protocol and the "unnamed statement" and obeying some more things (provide some params during "parse" message)
Play completely different tricks like the JDBC driver.
What Perl does currently: I don't know. But the mentioned "red herring" is not very unlikely.
By default, the parameter statement treshold is set to 5, instead of 1. That is,
((PGStatement) my_statement).getPrepareThreshold()
always returns 5 by default.
What would be the reason for that? Why would I want not to have to use the server-side prepared statement for the first 4 times the query is executed? I fail to understand why I would ever set this to another value than 1 and why this isn't by default set to 1.
Can you explain? Thanks a lot.
Server side prepared statements consume server side resources to store the execution plan for the statement. The threshold provides a heuristic that causes statements that are actually used "often" to be prepared. The definition of "often" defaults to 5.
Note that server side prepared statements can cause poor execution plans because they are not based on the parameters passed during the prepare. If the parameters passed to a prepared statement have a different selectivity on a particular index (for example), then the general query plan of the prepared statement may be suboptimal. As another example, if you have a situation where the execution of the query is much greater than the cost to create an explain plan, and the explain plan isn't properly set due to lack of bind parameters, you may be better off not using server side prepared statements.
When the driver reaches the threshold, it will prepare the statement as follows:
if (!oneShot)
{
// Generate a statement name to use.
statementName = "S_" + (nextUniqueID++);
// And prepare the new statement.
// NB: Must clone the OID array, as it's a direct reference to
// the SimpleParameterList's internal array that might be modified
// under us.
query.setStatementName(statementName);
query.setStatementTypes((int[])typeOIDs.clone());
}
The statement name is sent as part of the wire protocol, which tells Postgres to prepare it server side.
5 was picked at random and can be configured by the user.
Additionally to using resources on the server a named statement will require an extra round trip by the driver to describe the parameters for the statement. This is the primary reason the driver does not use a named statement by default.