Time of first occurrence of the query from pg_stat_statements. Possible to get? - postgresql

I'm debugging a DB performance issue. There's a lead that the issue was introduced after a certain deploy, e.g. when DB started to serve some new queries.
I'm looking to correlate deployment time with the performance issues, and would like to identify the queries that are causing this.
Using pg_stat_statements has been very handy so far. Unfortunately it does not store the time stamp of the first occurrence of each query.
Is there any auxiliary tables I could look into to see the time of first occurrence of queries?
Ideally, if this information would have been available in pg_stat_statements, I'd make a query like this:
select queryid from where date(first_run) = '2020-04-01';
Additionally, it'd be cool to see last_run as well, so to filter out some old queries that no longer execute at all, but remain in pg_stat_statements. That's more of a nice thing that's a necessity though.

This information is not stored anywhere, and indeed it would not be very useful. If the problem statement is a new one, you can easily identify it in your application code. If it is not a new statement, but something made the query slower, the first time the query was executed won't help you.
Is your source code not under version control?

Related

Postgres query plan changes for identical query when using pg_hint_plan

(Postgres 11.7)
I'm using the Rows pg_hint_plan hint to dynamically fix a bad row-count estimate.
My query accepts an array of arguments, which get unnested and joined onto the rest of the query as a predicate. By default, the query planner always assumes this array-argument contains 100 records, whereas in reality this number could be very different. This bad estimate was resulting in poor query plans. I set the number of rows definitively from within the calling application, by changing the hint text per query.
This approach seems to work sometimes, but I see some strange behaviour testing the query (in DBeaver).
If I start with a brand new connection, when I explain the query (or indeed just run it), the hint seems to be ignored for the first 6 executions, but thereafter it starts getting interpreted correctly. This is consistently reproducible: I see the offending row count estimates change on the 7th execution on a new connection.
More interestingly, the query also uses some (immutable) functions to do some lookup operations. If I remove these and replace them with an equivalent CTE or sub-select, this strange behaviour seems to disappear, and the hints are evaluated correctly all the time, even on a brand new connection.
What could be causing it to not honour the pg_hint_plan hints until after 6 requests have been made in that session? Why does the presence of the functions have a bearing on the hints?
Since you are using JDBC, try setting the prepareThreshold connection parameter to 1, as detailed in the documentation.
That will make the driver use a server prepared statement as soon as possible, ond it seems like this extension only works in that case.

How to cache duplicate queries in PostgreSQL

I have some extensive queries (each of them lasts around 90 seconds). The good news is that my queries are not changed a lot. As a result, most of my queries are duplicate. I am looking for a way to cache the query result in PostgreSQL. I have searched for the answer but I could not find it (Some answers are outdated and some of them are not clear).
I use an application which is connected to the Postgres directly.
The query is a simple SQL query which return thousands of data instance.
SELECT * FROM Foo WHERE field_a<100
Is there any way to cache a query result for at least a couple of hours?
It is possible to cache expensive queries in postgres using a technique called a "materialized view", however given how simple your query is I'm not sure that this will give you much gain.
You may be better caching this information directly in your application, in memory. Or if possible caching a further processed set of data, rather than the raw rows.
ref:
https://www.postgresql.org/docs/current/rules-materializedviews.html
Depending on what your application looks like, a TEMPORARY TABLE might work for you. It is only visible to the connection that created it and it is automatically dropped when the database session is closed.
CREATE TEMPORARY TABLE tempfoo AS
SELECT * FROM Foo WHERE field_a<100;
The downside to this approach is that you get a snapshot of Foo when you create tempfoo. You will not see any new data that gets added to Foo when you look at tempfoo.
Another approach. If you have access to the database, you may be able to significantly speed up your queries by adding and index on on field

Implement interval analysis on top of PostgreSQL

I have a couple of millions entries in a table which start and end timestamps. I want to implement an analysis tool which determines unique entries for a specific interval. Let's say between yesterday and 2 month before yesterday.
Depending on the interval the queries take between a couple of seconds and 30 minutes. How would I implement an analysis tool for a web front-end which would allow to quite quickly query this data, similar to Google Analytics.
I was thinking of moving the data into Redis and do something clever with interval and sorted sets etc. but I was wondering if there's something in PostgreSQL which would allow to execute aggregated queries, re-use old queries, so that for instance, after querying the first couple of days it does not start from scratch again when looking at different interval.
If not, what should I do? Export the data to something like Apache Spark or Dynamo DB and analysis in there to fill Redis for retrieving it quicker?
Either will do.
Aggregation is a basic task they all can do, and your data is smll enough to fit into main memory. So you don't even need a database (but the aggregation functions of a database may still be better implemented than if you rewrite them; and SQL is quite convenient to use.
Jusr do it. Give it a try.
P.S. make sure to enable data indexing, and choose the right data types. Maybe check query plans, too.

Query log equivalent for Progress/OpenEdge

Short story: A report running against a Progress database (OpenEdge Release 10.1C03) takes hours to complete. I suspect that it does not take advantage of existing data indexes. Would like to understand how it scans the data to then try to add an index that will make it run faster.
Source code of the report is not available. The code is native Progress 4GL, not SQL.
If it were an SQL database I would try to do a dump of SQL queries and would then go from that. With 4GL I did not find any such functionality. Is it possible to somehow peek at what gets executed at the low level?
What else can be done if there is no source code?
Thanks!
There are several things you can do:
If I recall correctly 10.1C should have the _usertablestat and _userindexstat virtual system tables available. These allow you to observe, at runtime, what tables and indexes are being accessed by a particular session. You can either write your own 4GL program to query them or you can use the screens in PROMON, R&D, 3 "Other Displays", 5 "I/O Operations by User by Table" and 6 "I/O Operations by User by Index". That will show you what tables and indexes are actually in use and how much use they are getting. If the observed data seems wrong it will probably give you a clue. (If the VSTs are missing it might be because the db was upgraded from an older version -- add them with proutil dbname -C updatevsts.)
You could also use the session startup parameters -clientlog "filename" and -logentrytypes QryInfo to obtain more detailed information about the queries being executed.
Keep in mind that Progress is not SQL. Unlike most SQL databases the 4gl uses a static, compile-time, optimizer. Index selection happens when the code is compiled. So unless you can recompile (and you seem to not have source so that seems unlikely) you won't be able to improve things by adding missing indexes. You might, however, at least be able to show the person who does have source where the problem is.
Another tool that can help is the profiler. This will identify where in the code the time is being spent. That can also be good information to provide to the original vendor if they need help finding the problem. For more information on the profiler: http://dbappraise.com/ppt/profiler.pptx

Sphinx UPDATE performance

Sphinx 2.0.1 brings with it the ability to call UPDATE and update an individual item in an index.
Does anyone know what type of performance this brings to sphinx when called VERY frequently (as frequently as several hundred times a second)? The reason for this would be to keep a real time index of trending item scores which get updated every time a user performs an action. Obviously when there are lots of users this value can be update quite frequently.
EDIT:
I should mention that I am not using SphinxSE.
You are talking about sphinx rt indices... Updates are fast, but remember, this type of indices do not support enable_star. This means you can't perform searches like appl*.
Such attributes are stored in memory. So updates should be really fast.
But I've never benchmarked it. So try benchmarking it!
... although to be honest I would still be tempted to 'batch process' it. Write the actions to a log "file", and then process that log in batches. Maybe every 10 seconds. All actions on the same record can be run as one update statement.