Does PostgreSQL use indexes for queries with OR? - postgresql

For instance, I've got a Rails app with an index on a column called archived_at. The query is an OR that checks if archived_at IS NULL OR archived_at is in the future (it's a timestamp column).
Does using an OR bypass indices? I've heard something about that before.

PG can use Bitmap Index Scan for this operation.
If your rows are (more or less) well organized in the physical table (for example, archived_at is the date of insertion in the table, and there are few deletions) this process is extremely efficient.

Related

Is there a workaround to ensure Index Only Scan on jsondoc->>'....' equality queries?

I architected an entire application around the idea of storing unprocessed JSON documents in jsonb and then selectively creating covering indexes using field access expressions of the form (jsondoc->>'fieldname') to ensure all required queries can execute as Index Only Scans. The idea of storing the raw json is very attractive because it removes an entire ETL (Extract, Transform, Load) layer which has been a constant source of bugs and operational headaches.
The pgsql manual contains this very unfortunate fact:
In principle, index-only scans can be used with expression indexes.
For example, given an index on f(x) where x is a table column, it
should be possible to execute
SELECT f(x) FROM tab WHERE f(x) < 1;
as an index-only scan; and this is very attractive if f() is an
expensive-to-compute function. However, PostgreSQL's planner is
currently not very smart about such cases. It considers a query to be
potentially executable by index-only scan only when all columns needed
by the query are available from the index. In this example, x is not
needed except in the context f(x), but the planner does not notice
that and concludes that an index-only scan is not possible. If an
index-only scan seems sufficiently worthwhile, this can be worked
around by declaring the index to be on (f(x), x), where the second
column is not expected to be used in practice but is just there to
convince the planner that an index-only scan is possible. An
additional caveat, if the goal is to avoid recalculating f(x), is that
the planner won't necessarily match uses of f(x) that aren't in
indexable WHERE clauses to the index column. It will usually get this
right in simple queries such as shown above, but not in queries that
involve joins. These deficiencies may be remedied in future versions
of PostgreSQL.
My question is very specifically:
How can the workaround described for 'x' be applied to jsonb columns, without replicating the enormous json payload in the index?
You can extract the fields you want index-only scans on into generated columns and create the corresponding covering indexes on the generated columns. The query planner can then use an index-only scan.
Assuming that you only need to do this for a few fields, it should be much more efficient than an index that includes the full jsondoc column.

Selecting from JSONB field slow

I have a relatively small table (~50k rows). When I select all records, it takes a ~40s. The table has 3 JSONB columns. When I select every column except for the JSONBs, the query takes ~700ms.
If I add in just one of the JSONB fields, the query time jumps to nearly 10s.
I'm never using a where clause referencing something inside the JSONB, just selecting *. Even so, I tried adding GIN indexes because I saw them frequently mentioned as a performance booster for JSONB.
I've ran a full vacuum.
Postgres version 9.6
explain (analyze, buffers) select * from message;
Seq Scan on message (cost=0.00..5541.69 rows=52969 width=834) (actual
time=1.736..116.183 rows=52969 loops=1)
Buffers: shared hit=64 read=4948
Planning time: 0.151 ms
Execution time: 133.555 ms
Jsonb is PostgreSQL varlena data type - that means so when the value is longer than 2KB, then it is stored in auxiliary table (named TOAST table). A pointer to TOAST table is stored in main table. So when you don't touch Jsonb column, then this value is not read.
GIN index doesn't help in this case. It helps just for searching.
10sec on 50K values is long time - maybe your Jsonb values are pretty long, or your IO system doesn't perform well. Please, check size of your table, and check the performance of your IO. The cheap cloud machines usually has terrible IO.
Another possible reason of slowdown is a complexity of Jsonb data type. Jsonb is serialized tree of json sub objects. If you don't need some special features of Jsonb data type, then use JSON data type. This is just test (JSON format is checked on input only). The output of JSONB is faster than Jsonb, because JSON is internally text, and there are not any operations necessary. Output of Jsonb should be serialized, what is more expensive.

SQLite ANALYZE breaks indexes

I have a table that contains about 500K rows. The table has an index on the 'status' column. So I run the following explain command:
EXPLAIN QUERY PLAN SELECT * FROM my_table WHERE status = 'ACTIVE'
Results in a predictable 'explanation'...
SEARCH TABLE my_table USING INDEX IDX_my_table_status (status=?) (~10 rows)
After many additional rows are added to the table, I call 'ANALYZE'. Afterwards, queries seemed much slower so I re-ran my explain and now see the following:
SCAN TABLE my_table (~6033 rows)
First thing I notice is that BOTH the row estimates are WAY off. The biggest concern is the fact that the index seems to be skipped once ANALYZE is ran. I tried REINDEX - to no avail. The only way I can get the indexes back is to drop them, then re-create them. Has anyone seen this? Is this a bug? Any ideas what I am doing wrong? I have tried this on multiple datbases and I see the same results. This is on my PC, and on MAC and on the iPhone/iPad - all the same results.
When SQLite fetches rows from a table using an index, it has to read the index pages first, and then read all the table's pages that contain one or more matching records.
If there are many matching records, almost all the table's pages are likely to contain one, so going through the index would require reading more pages.
However, SQLite's query planner does not have information about the record sizes in the index or the table, so it's possible that its estimates are off.
The information collected by ANALYZE is stored in the sqlite_stat1 and maybe sqlite_stat3 tables.
Please show what the information about your table is.
If that information that not reflect the true distribution of your data, you can try to run ANALYZE again, or just delete that information from the sqlite_stat* tables.
You can force going through an index if you use ORDER BY on the indexed field.
(INDEXED BY is, as its documentation says, not intended for use in tuning the performance of a query.)
If you do not need to select all fields of the table, you can speed up specific queries by creating an index on those queries' fields so that you have a covering index.
It's not uncommon for a query execution plan to avoid using an existing index on a low-cardinality column like "status", which probably only has a few distinct values. It's often faster for the lookups to be performed by scanning the db table. (Some DBAs recommend never indexing low-cardinality columns.)
However, based on the wildly varying row counts in the explain plan, I'm guessing that SQLite's 'analyze' performs similarly to MySQL's 'analyze' when using the InnoDB storage engine. MySQL's 'analyze' does a random set of dives into the table data to determine row count, index cardinality, etc. Because of the random dives, the statistics may vary after each 'analyze' is run, and result in differing query execution plans. Low-cardinality columns are even more susceptible to incorrect stats, as, for example, the random dives may indicate that the majority of the rows in your table have an "active" status, making it more efficient to table scan rather than use the index. (I'm no SQLite expert, so someone please chime in if my hunch about the 'analyze' behavior is incorrect.)
You can try testing the use of the index in the query using "indexed by" (see http://www.sqlite.org/lang_indexedby.html), although forcing the use of indexes is usually a last resort. Different RDBMSs have different solutions to the low-cardinality problem, such as partitioning, using bitmap indexes, etc. I would recommend researching SQLite-specific solutions to querying/indexing on low-cardinality columns).

Delete Takes a Long Time

I've got a table which has about 5.5 million records. I need to delete some records from it based on date. My query looks like this:
DELETE FROM Table WHERE [Date] between '2011-10-31 04:30:23' and '2011-11-01 04:30:42'
It's about 9000 rows, but this operation last very long time. How can I speed it up? Date is type of datetime2, table has int primary key clustered. Update and delete triggers are disabled.
It's very possible that [Date] is being cast to a string on every row resulting in a sequential scan of the entire table.
You should try casting your parameters to a date instead:
DELETE FROM Table WHERE [Date] between convert(datetime, '2011-10-31 04:30:23') and convert(datetime, '2011-11-01 04:30:42')
Also, make sure there's an index on [Date]
Firstly make sure you have an index on date.
If there is an index check the execution plan and make sure it is using it. Notice that it doesn't always follow that using an index is the most efficient method of processing a delete because if you are deleting a large proportion of records (rule of thumb is in excess of 10%) the additional overhead of the index look-up can be greater than a full scan.
With a large table it's also well worth making sure that the statistics are up to date (run sp_updatestats) because if the database has an incorrect understanding of the number of rows in the table it will make inappropriate choices in its execution plan. For example if the statistics are incorrect the database may decide to ignore your index even if it exists because it thinks there are far fewer records in the table than there are. Odd distributions of dates might have similar effects.
I'd probably try dropping the index on date then recreating it again. Indexes are binary trees and to work efficiently they need to be balanced. If your data has accumulated over time the index may well lopsided and queries might take a long time to find the appropriate data. Both this and statistics issue should be handled automatically by your database maintenance job, but it's often overlooked.
Finally you don't say if there are many other indexes on the table. If there are then you might be running into issues with the database having to reorganize indexes as it progresses the delete as well as update the indexes. It's a bit drastic, but one option is to drop all other indexes on the table before running the delete, then create them again afterwards.

alternative to bitmap index in postgresql

I have a table with hundreds of millions rows with schema like below.
tabe AA {
id integer primay key,
prop0 boolean not null,
prop1 boolean not null,
prop2 smallint not null,
...
}
The each "property" field (prop0, prop1, ...) has a small number of distinct values. And I usually query to find "id" from the given conditions of properties fields. I think Bitmap index is best for this query. But postgresql seems not support bitmap index.
I tried b-tree index on each field but these indexes are not used according to the query explain.
Is there a good alternative way to do this?
(i'm using postgresql 9)
Your real problem is a bad schema design, not the index. The properties should be placed in a different table and your current table should link to that table using a many to many relation.
The BIT datatype might also be of use, just check the manual.
Create a multicolumn index on properties which are always or almost always queried. Or several multicolumn indexes if needed.
The alternative, when you do not query the same properties almost always, is to make a tsvector column with words describing your data, maintained using trigger, for example
prop0=true
prop1=false
prop2=4
would be
'propzero nopropone proptwo4'::tsvector
index it using GIN and then use full text search for searching:
where tsv ## 'popzero & nopropone & proptwo4'::tsquery
An index is only used if it actually speeds up the query which is not necessarily always the case. Especially with smallish tables (say thousands of rows) a full table scan ("seq scan" in the Postgres execution plan) might indeed be a lot faster.
How many rows did the table have when you tried the statement?
How did the query look like? Maybe there are other conditions that prevent the index usage.
Did you analyze the table to have up-to-date statistics?