Is it possible to use a stable function in an index in Postgres? - postgresql

I've been working on a project at work and have come to the realization that I must invoke a function in several of the queries' WHERE clauses. The performance isn't terrible exactly, but I would love to improve it. So I looked at the docs for indexes which mentioned that:
An index field can be an expression computed from the values of one or more columns of the table row.
Awesome. So I tried creating an index:
CREATE INDEX idx_foo ON foo_table (stable_function(foo_column));
And received an error:
ERROR: functions in index expression must be marked IMMUTABLE
So then I read about Function Volatility Categories which had this to say about stable volatility:
In particular, it is safe to use an expression containing such a function in an index scan condition.
Based on the phrasing "index scan condition" I'm guessing it doesn't mean an actual index. So what does it mean? Is it possible to utilize a stable function in an index? Or do we have to go all the way and ensure this would work as an immutable function?
We're using Postgres v9.0.1.

An "index scan condition" is a search condition, and can use a volatile function, which will be called for each row processed. An index definition can only use a function if it is immutable -- that is, that function will always return the same value when called with any given set of arguments, and has no user-visible side effects. If you think about it a little, you should be able to see what kind of trouble you could get into if the function might return a different value than what it did when the index entry was created.
You might be tempted to lie to the database and declare a function as immutable which isn't really; but if you do, the database will probably do surprising things that you would rather it didn't.
9.0.1 has bugs for which fixes are available. Please upgrade to 9.0.somethingrecent.
http://www.postgresql.org/support/versioning/

Related

kdb - How to pass a table by reference to kdb function

Define the question
Given an empty table myt defined by
myt:([] id:`int$(); score:`int$())
It is trivial to insert one or more records into it, for example
`myt upsert `id`score!1 100
But when it comes to defining a function to insert into a given table, it seems a different trick.
A first try version could be
upd:{[t] t upsert `id`score!42 314;}
upd[myt]
Apparently it updates nothing to myt itself but a local copy version of it.
Difficulties of Possible solutions
Possible solution 1: using the global variable instead
Let myt be a global variable, the variable will then be accessed inside a function.
upd:{`myt upsert `id`score!42 314;}
upd[]
It looks a good solution, expect if many myts are required. Under this situation, one have to provide a lot of copy for upd function as following
upd0:{`myt0 upsert `id`score!42 314;}
upd1:{`myt1 upsert `id`score!42 314;}
upd2:{`myt2 upsert `id`score!42 314;}
...
So, the global variable solution is not a good solution here.
Possible solution 2: amending table outside function
One can also solve the problem by amending myt just outside the function, returning the modified result by removing the ending ;.
upd:{[t] t upsert `id`score!42 314} / return inserted valued
myt:upd[myt]
It works! But after running this code for millions of times, it works slower and slower. Because this solution discards the "in-place" property of upsert operator, the copy overhead increases as the size of table getting larger.
Pass argument by reference?
Maybe the concept of "pass-by-reference" solution here. Or maybe q has its own solution for this problem and I have not get the essential idea.
[UPDATE] Solved by adding "`" to call-by-name
As cillianreilly answers, it is simple to add a "`" symbol in front of myt to declare it as a global variable when pass it into function. So the perfect solution is direct.
upd:{[t] t upsert `id`score!42 314;}
upd[`myt] / it works
Your first version should achieve what you want. If you pass the table name as a symbol, it will update the global variable and return the table name. If you pass the table itself, it will return the updated table, which you can use in an assignment, as you found in possible solution 2. Note that the actual table will not have been updated by this operation.
q){[t;x]t upsert x}[myt;`id`score!42 314]
id score
--------
42 314
q)count myt
0
q){[t;x]t upsert x}[`myt;`id`score!42 314]
`myt
q)count myt
1
For possible solution 1, why would you need hundreds of myt tables? Regardless, there is no need to hardcode the table name into the function. You can just pass the table name as a symbol as demonstrated above, which will update the global for you. The official kx kdb tick example given on their github uses insert for exactly this scenario, but in practice a lot of developers use upsert. https://github.com/KxSystems/kdb-tick/blob/master/tick/r.q#L6
Hope this helps.

Postgres LIKE query triggers full table scan

I have a Postgres query where we have several indices set up, including one on a text field where we have a GIN index. My understanding of this based on the pg_trgm documentation is that it's only applicable if the search string is made up of alphanumeric text. Testing bears this out and in a database with tens of millions of records, doing something like the following works great:
SELECT * FROM my_table WHERE target_field LIKE '%foo%'
I've read in various places that anything that's not an alphanumeric string is treated as a separate word in the trigram search, so something like the following also works quite well:
SELECT * FROM my_table WHERE target_field LIKE '%foo & bar%'
However someone ran a search that was literally just three question marks in a row and it triggered a full table scan. For some reason, when multiple ampersand or question marks are used alone in the query, they're being treated differently than a single one placed next to or among actual alpha-numeric characters.
The research I've done has implied that it might be how some database drivers handle the question mark, sometimes interpreting it as a parameter that needs to be supplied, but then gets confused because it can't find the parameters and triggers a table scan. I don't really believe this is the case. I might be inclined to believe it would throw an error rather than completing the query, but running it anyway seems like a design flaw.
What makes more sense is that a question mark isn't an alpha-numeric character and thus it's treated differently. In some technologies, common symbols such as & are considered alpha-numeric, but I don't think that's the case with Postgres. In fact, the documentation suggests that non-alphanumeric characters are treated as word boundaries in a GIN-based index.
What's weird is that I can search for %foo & bar%, which seems to work fine. I can even search for %&% and it returns quickly, though not with the results I wanted. But if I put (for example) three of them together like this: %&&&%, it triggers a full table scan.
After running various experiments, here's what I've seen:
%%: uses the index
%&%: uses the index
%?%: uses the index
%foo & bar%: uses the index
%foo ? bar%: uses the index
%foo && bar%: uses the index
%foo ?? bar%: uses the index
%&&%: triggers a full table scan
%??%: triggers a full table scan
%foo&bar%: uses the index, but returns no results
I think that all of those make sense until you get to #8 and #9. And if if the ampersand were a word boundary, shouldn't #10 return results?
Anyone have an explanation of why multiple consecutive punctuation characters would be treated differently than a single punctuation character?
I can't reproduce this in v11 on a table full of md5 hashes: I get seq scans (full table scans) for the first 3 of your patterns.
If I force them to use the index by setting enable_seqscan=false, then I go get it to use the index, but it is actually slower than doing the seq scan. So it made the right call there. How about for you? You shouldn't force it to use the index just on principle when it is actually slower.
It would be interesting to see the estimated number of rows it thinks it will return for all of those examples.
In fact, the documentation suggests that non-alphanumeric characters are treated as word boundaries in a GIN-based index.
The G in GIN is for "generalized". You can't make blanket statements like that about something which is generalized. They don't even need to operate on text at all. But in your case, you are using the LIKE operator, and the LIKE operator doesn't care about word boundaries. Any GIN index which claims to support the LIKE operator must return the correct results for the LIKE operator. If it can't do that, then it is a bug for it to claim to support it.
It is true that pg_trgm treats & and ? the same as white space when extracting trigrams, but it is obliged to insulate LIKE from the effects if this decision. It does this by two methods. One is that it returns "MAYBE" results, meaning all the tuples it reports must be rechecked to see if they actually satisfy the LIKE. So '%foo&bar%' and '%foo & bar%' will return the same set of tuples to the heap scan, but the heap scan will recheck them and so finally return a different set to the user, depending on which ones survive the recheck. The second thing is, if the pg_trgm can't extract any trigrams at all out of the query string, then it must return the entire table to then be rechecked. This is what would happen with '%%', '%?%', '%??%', etc. Of course rechecking all rows is slower than just doing the seq scan in the first place.

Weak points while comparing columns in trigger as hstore-data

On some occasion I'd like before UPDATE to make sure which columns are changed. To make it as generic as possible, I don't want to use schema, table or column names in function. I found some solution here in SO and other places, and particularly liked idea to use hstore from this answer
Downside of hstore, as said widely, is it that this way I lose data types, everything is stringified.
But using it in context of trigger (while having no complex cols like json or hstore), where both NEW and OLD have same set of cols with according datatypes, I could think of just one problem: NULL and empty values will be not distinguishable.
What other problems I may be faced with, when I detect changes in trigger function like this:
changes := hstore(NEW) - hstore(OLD);
Alternative seems to use jsonb and then write some jsonb_diff function to discover changes. hstore's offered internal subtract-operation seems way more robust, but maybe I have not considered all weak points there.
I'm using Postgres 9.6.

PostgreSQL Probabilities: EXPLAIN on CREATE INDEX

I am using PostgreSQL to compute empirical probability density functions for pairs of variables across all my data. I am trying to determine if/when it is more effective to index before computing the PDF. I run EXPLAIN CREATE INDEX like,
EXPLAIN CREATE INDEX AB ON xrootd ("F.mName", "F.mOpenTime");
CREATE INDEX AB ON xrootd ("F.mName", "F.mOpenTime");
But PSQL complains,
psql:sql/stats.sql:3: ERROR: syntax error at or near "INDEX"
LINE 2: EXPLAIN CREATE INDEX AB ON xrootd ("F.mName", "F.mOpenTime")...
Is there a better way of doing anything I am trying to do? At the very least, I would like to know if constructing the indexes is useful. I have a lot of variable and a lot of data, so speeding this up is crucial.
Checking the cost of CREATE INDEX would be able to tell me if make the index is too expensive for the gain of using it.
There is no EXPLAIN CREATE INDEX, as EXPLAIN only works for SELECTs and DML queries (INSERT/DELETE/UPDATE).
What you are probably looking for is called hypothetical indexes. There is a PostgreSQL extension for that: https://github.com/dalibo/hypopg

PL/pgSQL - %TYPE and ARRAY

Is it possible to use the %TYPE and array together?
CREATE FUNCTION role_update(
IN id "role".role_id % TYPE,
IN name "role".role_name % TYPE,
IN user_id_list "user".user_id % TYPE[],
IN permission_id_list INT[]
)
I got syntax error by this, but I don't want to duplicate any column type, so I want to use "user".user_id % TYPE instead of simply INT because then it is easier to modify any column type later.
As the manual explains here:
The type of a column is referenced by writing table_name.column_name%TYPE. Using this feature can sometimes help make a function independent of changes to the definition of a table.
The same functionality can be used in the RETURNS clause.
But there is no simple way to derive an array type from a referenced column, at least none that I would know of.
About modifying any column type later:
You are aware that this type of syntax is only a syntactical convenience to derive the type from a table column? Once created, there is no link whatsoever to the table or column involved.
It helps to keep a whole create script in sync. But id doesn't help with later changes to live objects in the database.
Related answer on dba.SE:
Array of template type in PL/pgSQL function using %TYPE
Using referenced types in function's parameters has no sense (in PostgreSQL), because its translated intermediately to actual types, and it is stored as actual types. Sorry, PostgreSQL doesn't support this functionality - something different is using referenced types inside function, where actual type is detected every first time execution in session.