How to force postgresql to use particular index when all fields match - postgresql

Consider I have a table T with fields a,b,c,d with two indices: first on a,b,c fields and second on a,b,d fields. Types of a,b,c and d are integer. Both indices are almost the same (on production they both have about 2Gb size, they have the same creation time and the same statistics of usage, table overall have about 60 millions rows).
I make two queries:
select * from T where a=... and b=... and c=...;
select * from T where a=... and b=... and d=...;
I expect that for the first query index on a,b,c fields is used and for the second index on a,b,d fields is used. However it's not the case and for both queries first index is used, but in second case with "filter"(I used expect analyze to gain this knowledge). For me such behavior is unacceptable, because in some circumstances number of entries in filter grows very fast and autovacuum/analyze (which actually helps the planner to use the right index) works too slow to prevent the unexpected latencies and downtime.
So my question is: how can I force postgresql not to use wrong index with filtering, but rather use the right index when all fields in query's 'where' and in that index match?

Finally I found the solution. It's not perfect and I will not mark it as the best one, however it works and could help someone.
What I have actually done is I have changed the indices, instead of a,b,c and a,b,d indices now I'm having c,a,b and d,a,b.
One problem appeared: I needed an index on 'a', because some queries rely on it. However when I add index solely on 'a', the problem from the first post appears again (the index on 'a' is used when planner thinks that its cost is lower than cost of c,a,b or d,a,b). So I decided to add new field to the table which is a copy (has the same data) of a field 'a', let's call it 'a1', and I added the index on this field. Now when I need to filter contents of 'a' somehow instead I'm filtering on 'a1' field. It's weird, but I couldn't find another solution.

Related

get postgres to use an index when querying timestamps in a function

I have a system with a large number of tables that contain historical data. Each table has a ts_from and ts_to column which are of type timestamptz. These represent the time period in which the data for a particular row was valid.
These columns are indexed.
If I want to query all rows that were valid at a particular timestamp, it is trivial to write the ts_from <= #at_timestamp AND ts_to >= #at_timestamp WHERE clause to utilitise the index.
However, I wanted to create a function called Temporal.at which would take the #at_timestamp column and the ts_from / ts_to columns and do this by hiding the complexity of the comparison from the query that uses it. You might think this is trivial, but I would also like to extend the concept to create a function called Temporal.between which would take a #from_timestamp and #to_timestamp and select all rows that were valid between those two periods. That function would not be trivial, as one would have to check where rows partially overlap the period rather than always being fully enclosed by it.
The issue is this: I have written these functions but they do not cause the index to be used. The query performance is woefully slow on the history tables, some of which have hundreds of millions of rows.
The questions therefore are:
a) Is there a way to write these functions so that we can be sure the indexes will be used?
b) Am I going about this completely the wrong way and is there a better way to proceed?
This is complicated if you model ts_from and ts_to as two different timestamp columns. Instead, you should use a range type: tstzrange. Then everything will become simple:
for containment in an interval, use #at_timestamp <# from_to
for interval overlap, use tstzinterval(#from_timestamp, #to_timestamp) && from_to
Both queries can be supported by a GiST index on the range column.

Postgresql: optimal use of multicolumn-index when subset of index is missing from the where clause

I will be having queries on my database with where clauses similar to this:
SELECT * FROM table WHERE a = 'string_value' AND b = 'other_string_value' AND t > <timestamp>
and less often to this:
SELECT * FROM table WHERE a = 'string_value' AND t > <timestamp>
I have created a multicolumn index on a, b and t on that order. However I am not sure if it will be optimal for my second -less frequent- query.
Will this index do an index scan on b or skip it and move to the t index immediately? (To be honest Im not sure how index scans work exactly). Should I create a second multi-column index on a and t only for the second query?
The docs state that
'the index is most efficient when there are constraints on the leading (leftmost) columns'
But in the example it doesn't highlight my case where the 'b' equality column is missing in the where clause.
The 2nd query will be much less effective with the btree index on (a,b,t) because the absence of b means t cannot be used efficiently (it can still be used as an in-index filter, but that is not nearly as good as being used as a start/stop point). An index on (a,t) will be able to support the 2nd query much more efficiently.
But that doesn't mean you have to create that index as well. Indexes take space and must be maintained, so are far from free. It might be better to just live with less-than-optimal plans for the 2nd query, since that query is used "less often". On the other hand, you did bother to post about it, so maybe "less often" is still pretty often. So you might be better off just to build the extra index and spend your time worrying about something else.
A btree index can be thought of like a phonebook, which is sorted on last name, then first name, then middle name. Your first query is like searching for "people named Mary Smith with a middle name less than Cathy" You can use binary search to efficiently find the first "Mary Smith", then you scan through those until the middle name is > 'Cathy', and you are done. Compare that to "people surnamed Smith with a middle name less than Cathy". Now you have to scan all the Smith's. You can't stop at the first middle name > Cathy, because any change in first name resets the order of the middle names.
Given that b only has 10 distinct values, you could conceivably use the (a,b,t) index in a skip scan quite efficiently. But PostgreSQL doen't yet implement skip scans natively. You can emulate them, but that is fragile, ugly, a lot of work, and easy to screw up. Nothing you said here makes me think it would be worthwhile to do.

PostgreSQL - Compare ts_vector fields

I have two tables in which I have data coming from two different sources. One of the field of each table contains the title of a movie, but for some reason out of my control, the titles are not always exactly the same.
So I use the ts_vector to get rid of all the minor differences (stop words, plurals and so on).
See an example here: http://sqlfiddle.com/#!17/5ccbc/3
My problem is how to compare the two ts_vector without taking into account the numberic values, but just the text content. If I compare directly the two fields, I only get the exact match between values, including position of each word. The only solution I have found is using the strip() function, that remove positions and weights from tsvector, leaving only the text content.
I was wondering if there is a fastest way to compare ts_vectors.
You could create in index on the stripped vector:
create index on tbl1 (strip(ts_title));
create index on tbl2 (strip(ts_title));
But given that your query has to fetch every row of each table, it is unlikely this would serve much of a point. Doing a merge join between the precomputed stripped vectors could be faster, but probably not once you include the overhead of building and maintaining the indexes. If the real WHERE clause is more restrictive (selecting only a few rows from one or the other of the tables) then please share the real query.

Creating the optimum index for my database

I have a table in postgresql with the following information:
rawData (fileID integer references otherTable, lineNum integer, data1 double, ...)
When I am searching this table, I do so with the following query:
SELECT lineNum, data1, ...other data FROM rawData WHERE
fileID = ? AND data1 < ? ORDER BY lineNum;
In general, the data in this table is a number of entries for each fileID, and each fileID has lineNum from 0 to x, with lineNum never repeating for each fileID (but it does repeat for different fileID's). Then data1 is effectively a random number that may or may not overlap.
In order to speed up the reading of this data, I am trying to create an index on it, but am having trouble figuring out the best way to index it. Currently I am looking at one of the following two index methods, and am wondering which would be better for my search, or if there is another option that I haven't thought of that would be better than either of them.
index idea 1:
CREATE INDEX searchIndex ON rawData (fileID, data1, lineNum);
index idea 2:
CREATE INDEX searchIndex ON rawData (fileID, lineNum, data1);
Note that at this time, this and a search not constrained by data1 are the only searches that I run on this table, so I'm not too concerned about this index slowing down other searches.
Lastly, would I have to change my search query to use the index, or would it automatically use that index when I search the table?
You should look at using this instead:
CREATE INDEX searchIndex ON rawData (fileID, lineNum);
A few things:
In particular, as per docs, Indexes with more than three columns are unlikely to be helpful unless the usage of the table is extremely stylized.
Since your second search query requires filtering without the data1 column, keeping the second column lineNum should be sufficient (since you mention it would be quasi-random), and in the rare occurrence that there are repeats, table fetches should ensure correctness. But what this would mean is that the Index would be 1/3rd smaller in size, which is a big win (Think index small-enough to be in memory / index-only-scans etc.)
Either index can be used. Which is faster will depend on many things, like how many rows are in the table, how many lineNum there are per fileID, how selective the data1 < ? clause is, what your hardware is, what our config settings are, which version of PostreSQL you are using, what physical order the table rows lie in, etc.
The only way to know for sure is to try it with your own data on your own system and see.
I'd just build an index on (fileID, lineNum, data1), or even just (fileID, lineNum), because that seems more natural, and then forget about it. Most likely it will be fast enough. Once there is a demonstrable performance problem, than you will have the test case at hand which is needed to come to a real conclusion.

100 columns vs Array of length 100

I have a table with 100+ values corresponding to each row, so I'm exploring different ways to store them.
Without any indexes, would I lose anything if I store these 100 values in an integer[] column in postgresql? As compared to storing them in separate columns.
Plus, since we can add indexes to array elemnets,
CREATE INDEX test_index on test ((foo[1]));
Would there be a performance difference queries using such an index as compared to regular index on a column?
As far as I've read, this performance difference would come into picture in arrays with variable length elements; but I'm not sure about fixed length ones.
Don't go for the lazy way.
If you need to store 100 and more values as array, it is ok, if it has sense has array for your application, your data.
If you need to query for a specific element of the array, then this design is not good, regardless of performances, and you must use columns. This will help you in the moment you must delete a "column" in the middle or redesign it.
Anyway, as wrote by Frank in comments, if values are all same type, consider to model them to another table (if also the meaning is the same).