Partial index not being used in psql 8.2 - postgresql

I would like to run a query on a large table along the lines of:
SELECT DISTINCT user FROM tasks
WHERE ctime >= '2012-01-01' AND ctime < '2013-01-01' AND parent IS NULL;
There is already an index on tasks(ctime), but most (75%) of rows have a non-NULL parent, so that's not very effective.
I attempted to create a partial index for those rows:
CREATE INDEX CONCURRENTLY task_ctu_np ON tasks (ctime, user)
WHERE parent IS NULL;
but the query planner continues to choose the tasks(ctime) index instead of my partial index.
I'm using postgresql 8.2 on the server, and my psql client is 8.1.

First, I second Richard's suggestion that upgrading should be at the top of your priority. The areas of partial indexes, etc. have, as I understood it, improved significantly since 8.2.
The second thing is you really need the actual query plans with timing information (EXPLAIN ANALYZE) because without these we can't talk about selectivity, etc.
So my order of business if I were you would be to upgrade first and then tune after that.
Now, I understand that 8.3 is a big upgrade (it is the only one that caused us issues in LedgerSMB). You may need some time to address that, but the alternative is to get further behind and be asking questions on a version that is less and less in current understanding as time goes on.

Related

Postgresql best index for datetime ranges

I have a Postgre table “tasks” with the fields “start”:timestamptz, “finish”:timestamptz, “type”:int (and a lot of others). It contains about 200m records. Start, finish and type fields have a separate b-tree indexes.
I’d like to build a report “Tasks for a period” and need to get all tasks which lay (fully or partially) inside the reporting period. Report could be built for all task types or for the specific one.
So I wrote the SQL:
SELECT * FROM tasks
WHERE start<={report_to}
AND finish>={report_from}
AND ({report_tasktype} IS NULL OR type={report_tasktype})
and it runs for ages even on short reporting periods.
Please advice if there a way to improve performance by altering the query or by creating new indexes on the table? For some reasons I can’t change the structure of the “tasks” table
You would want a GiST index on the range. Since you already have it stored as two end points rather than as a range, you could do a functional index to convert them on the fly.
ON task USING GIST (tstzrange(start,finish))
And then compare the ranges for overlap with &&
It may also improve things to add "type" as a second column to the index, which would require the btree_gist extension.

kdb: getting one row from HDB

For a normal table, we can select one row using select[1] from t. How can I do this for HDB?
I tried select[1] from t where date=2021.02.25 but it gives error
Not yet implemented: it probably makes sense, but it’s not defined nor implemented, and needs more thinking about as the language evolves
select[n] syntax works only if table is already loaded in memory.
The easiest way to get 1st row of HDB table is:
1#select from t where date=2021.02.25
select[n] will work if applied on already loaded data, e.g.
select[1] from select from t where date=2021.02.25
I've done this before for ad-hoc queries by using the virtual index i, which should avoid the cost of pulling all data into memory just to select a couple of rows. If your query needs to map constraints in first before pulling a subset, this is a reasonable solution.
It will however pull N rows for each date partition selected due to the way that q queries work under the covers. So YMMV and this might not be the best solution if it was behind an API for example.
/ 5 rows (i[5] is the 6th row)
select from t where date=2021.02.25, sum=`abcd, price=1234.5, i<i[5]
If your table is date partitioned, you can simply run
select col1,col2 from t where date=2021.02.25,i=0
That will get the first record from 2021.02.25's partition, and avoid loading every record into memory.
Per your first request (which is different to above) select[1] from t, you can achieve that with
.Q.ind[t;enlist 0]

where column in (single value) performance

I am writing dynamic sql code and it would be easier to use a generic where column in (<comma-seperated values>) clause, even when the clause might have 1 term (it will never have 0).
So, does this query:
select * from table where column in (value1)
have any different performance than
select * from table where column=value1
?
All my test result in the same execution plans, but if there is some knowledge/documentation that sets it to stone, it would be helpful.
This might not hold true for each and any RDBMS as well as for each an any query with its specific circumstances.
The engine will translate WHERE id IN(1,2,3) to WHERE id=1 OR id=2 OR id=3.
So your two ways to articulate the predicate will (probably) lead to exactly the same interpretation.
As always: We should not really bother about the way the engine "thinks". This was done pretty well by the developers :-) We tell - through a statement - what we want to get and not how we want to get this.
Some more details here, especially the first part.
I Think this will depend on platform you are using (optimizer of the given SQL engine).
I did a little test using MySQL Server and:
When I query select * from table where id = 1; i get 1 total, Query took 0.0043 seconds
When I query select * from table where id IN (1); i get 1 total, Query took 0.0039 seconds
I know this depends on Server and PC and what.. But The results are very close.
But you have to remember that IN is non-sargable (non search argument able), it will not use the index to resolve the query, = is sargable and support the index..
If you want the best one to use, You should test them in your environment because they both work so good!!

Are idx_scan statistics reset automatically (default)?

I was looking at the tables (pg_stat_user_indexes and pg_stat_user_tables) and discovered many indices that are not being used.
But before I think about doing any operations to remove these indices, I need to understand what period was the analysis of this data (idx_scan), has it been since the database was created?
In the pg_stat_database table (column stats_reset) there is a date that normally is today or up to 15 days ago, but does this process interfere with the tables I mentioned above?
No command pg_stat_reset() was executed.
Does the pg_stat_reset() command clear the tables (pg_stat_user_indexes and pg_stat_user_tables)?
My goal is to understand the period of data collected so that I can make a decision.
Statistics are cumulative and are kept from the time of cluster creation on.
So if you see the pg_stat_database.stats_reset change regularly, there must be somebody or something doing that explicitly with the pg_stat_reset() function.
Doing so is somewhat problematic, because this resets all statistics, including those in pg_stat_user_tables which govern when autovacuum and autoanalyze take place. So after a reset these will be a little out of whack until autoanalyze has collected new statistics.
The better way is to take regular snapshots and calculate the difference.
You are right that you should collect data over a longer time before you determine if an index can be canned or not. For example, some activity may only take place once a month, but require certain indexes.
Before dropping indexes, consider that indexes also serve other purposes besides being scanned:
They can be UNIQUE or back a constraint, in which case they serve a purpose even when they are never scanned.
Indexes on expressions make PostgreSQL collect statistics on the distribution of the indexed expression, which can have a notable effect on query planning and the quality of your execution plans.
You could use the query in this blog to find all the indexes that serve no purpose at all.
Only superuser is allowed to reset statistic. Query planer depends on statistic.
Use snapshots:
CREATE TABLE stat_idx_snap_m10_d29_16_12 AS SELECT * FROM pg_stat_user_indexes;
CREATE TABLE stat_idx_snap_m10_d29_16_20 AS SELECT * FROM pg_stat_user_indexes;
Analyze difference any time later:
SELECT
s2.relid, s2.indexrelid, s2.schemaname, s2.relname, s2.indexrelname,
s2.idx_scan - s1.idx_scan as idx_scan,
s2.idx_tup_read - s1.idx_tup_read as idx_tup_read,
s2.idx_tup_fetch - s1.idx_tup_fetch as idx_tup_fetch
FROM stat_idx_snap_m10_d29_16_20 s2
FULL OUTER JOIN stat_idx_snap_m10_d29_16_12 s1
ON s2.relid = s1.relid AND s2.indexrelid = s1.indexrelid
ORDER BY s2.idx_scan - s1.idx_scan ASC;

Understanding SQL query complexity

I'm currently having trouble understanding why a seemingly simple query is taking much longer to return results than a much more complicated (looking) query.
I have a view, performance_summary (which in turn selects from another view). Currently, within psql, when I run a query like
SELECT section
FROM performance_summary
LIMIT 1;
it takes a minute or so to return a result, whereas a query like
SELECT section, version, weighted_approval_rate
FROM performance_summary
WHERE version in ('1.3.10', '1.3.11') AND section ~~ '%WEST'
ORDER BY 1,2;
gets results almost instantly. Without knowing how the view is defined, is there any obvious or common reason why this is?
Not really, without knowing how the view is defined. It could be that the "more complex" query uses an index to select just two rows and then perform some trivial grouping sorting on the two. The query without the where clause might see postgres operating on millions of rows, trillions of operations and producing a single row out after discarding 999999999 rows, we just don't know unless you post the view definition and the explain plan output for both queries
You might be falling into the trap of thinking that a View is somehow a cache of info - it isn't. It's a stored query, that is inserted into the larger query when you select from it/include it in another query- this means that the whole thing must be planned and executed from scratch. There isn't a notion that creating a View does any pre planning etc, onto which other further improvement is done. It's more like the definition of the View is pasted into any query that uses it, then the query is run as if it were just written there and then