Any method in PostgreSQL to create mutiple index entries on the same datum - postgresql

I'm indexing some spatial objects in PG and at this time I'm using PG with PostGis to build a R-tree. However, some objects are too large to be indexed, so I want to first split them to smaller parts and index them by the smaller parts.
However, I read the document of Gist, and find that there is nowhere to split (in the index level) one datum unless I first split them in the original relation.
Is there any way so solve this problem in Gist? Or are there any other framework to solve this problem? Or why it's impossible under the PG?

It depends on what exactly you want to do with the index.
One option would be not to index the objects themselves, but a simplified version:
CREATE INDEX ON atable USING gist (st_simplify(geom, 1, TRUE));
And then use that as a pre-filter:
SELECT ... FROM atable
WHERE st_intersects(st_simplify(geom, 1, TRUE), GEOMETRY '...')
AND st_intersects(geom, GEOMETRY '...');
That can use the index. The problem is that it could be that the simplified geometry does not intersect, so you could get false negatives in corner cases.

Related

PostgreSQL - Compare ts_vector fields

I have two tables in which I have data coming from two different sources. One of the field of each table contains the title of a movie, but for some reason out of my control, the titles are not always exactly the same.
So I use the ts_vector to get rid of all the minor differences (stop words, plurals and so on).
See an example here: http://sqlfiddle.com/#!17/5ccbc/3
My problem is how to compare the two ts_vector without taking into account the numberic values, but just the text content. If I compare directly the two fields, I only get the exact match between values, including position of each word. The only solution I have found is using the strip() function, that remove positions and weights from tsvector, leaving only the text content.
I was wondering if there is a fastest way to compare ts_vectors.
You could create in index on the stripped vector:
create index on tbl1 (strip(ts_title));
create index on tbl2 (strip(ts_title));
But given that your query has to fetch every row of each table, it is unlikely this would serve much of a point. Doing a merge join between the precomputed stripped vectors could be faster, but probably not once you include the overhead of building and maintaining the indexes. If the real WHERE clause is more restrictive (selecting only a few rows from one or the other of the tables) then please share the real query.

How to force postgresql to use particular index when all fields match

Consider I have a table T with fields a,b,c,d with two indices: first on a,b,c fields and second on a,b,d fields. Types of a,b,c and d are integer. Both indices are almost the same (on production they both have about 2Gb size, they have the same creation time and the same statistics of usage, table overall have about 60 millions rows).
I make two queries:
select * from T where a=... and b=... and c=...;
select * from T where a=... and b=... and d=...;
I expect that for the first query index on a,b,c fields is used and for the second index on a,b,d fields is used. However it's not the case and for both queries first index is used, but in second case with "filter"(I used expect analyze to gain this knowledge). For me such behavior is unacceptable, because in some circumstances number of entries in filter grows very fast and autovacuum/analyze (which actually helps the planner to use the right index) works too slow to prevent the unexpected latencies and downtime.
So my question is: how can I force postgresql not to use wrong index with filtering, but rather use the right index when all fields in query's 'where' and in that index match?
Finally I found the solution. It's not perfect and I will not mark it as the best one, however it works and could help someone.
What I have actually done is I have changed the indices, instead of a,b,c and a,b,d indices now I'm having c,a,b and d,a,b.
One problem appeared: I needed an index on 'a', because some queries rely on it. However when I add index solely on 'a', the problem from the first post appears again (the index on 'a' is used when planner thinks that its cost is lower than cost of c,a,b or d,a,b). So I decided to add new field to the table which is a copy (has the same data) of a field 'a', let's call it 'a1', and I added the index on this field. Now when I need to filter contents of 'a' somehow instead I'm filtering on 'a1' field. It's weird, but I couldn't find another solution.

Creating the optimum index for my database

I have a table in postgresql with the following information:
rawData (fileID integer references otherTable, lineNum integer, data1 double, ...)
When I am searching this table, I do so with the following query:
SELECT lineNum, data1, ...other data FROM rawData WHERE
fileID = ? AND data1 < ? ORDER BY lineNum;
In general, the data in this table is a number of entries for each fileID, and each fileID has lineNum from 0 to x, with lineNum never repeating for each fileID (but it does repeat for different fileID's). Then data1 is effectively a random number that may or may not overlap.
In order to speed up the reading of this data, I am trying to create an index on it, but am having trouble figuring out the best way to index it. Currently I am looking at one of the following two index methods, and am wondering which would be better for my search, or if there is another option that I haven't thought of that would be better than either of them.
index idea 1:
CREATE INDEX searchIndex ON rawData (fileID, data1, lineNum);
index idea 2:
CREATE INDEX searchIndex ON rawData (fileID, lineNum, data1);
Note that at this time, this and a search not constrained by data1 are the only searches that I run on this table, so I'm not too concerned about this index slowing down other searches.
Lastly, would I have to change my search query to use the index, or would it automatically use that index when I search the table?
You should look at using this instead:
CREATE INDEX searchIndex ON rawData (fileID, lineNum);
A few things:
In particular, as per docs, Indexes with more than three columns are unlikely to be helpful unless the usage of the table is extremely stylized.
Since your second search query requires filtering without the data1 column, keeping the second column lineNum should be sufficient (since you mention it would be quasi-random), and in the rare occurrence that there are repeats, table fetches should ensure correctness. But what this would mean is that the Index would be 1/3rd smaller in size, which is a big win (Think index small-enough to be in memory / index-only-scans etc.)
Either index can be used. Which is faster will depend on many things, like how many rows are in the table, how many lineNum there are per fileID, how selective the data1 < ? clause is, what your hardware is, what our config settings are, which version of PostreSQL you are using, what physical order the table rows lie in, etc.
The only way to know for sure is to try it with your own data on your own system and see.
I'd just build an index on (fileID, lineNum, data1), or even just (fileID, lineNum), because that seems more natural, and then forget about it. Most likely it will be fast enough. Once there is a demonstrable performance problem, than you will have the test case at hand which is needed to come to a real conclusion.

100 columns vs Array of length 100

I have a table with 100+ values corresponding to each row, so I'm exploring different ways to store them.
Without any indexes, would I lose anything if I store these 100 values in an integer[] column in postgresql? As compared to storing them in separate columns.
Plus, since we can add indexes to array elemnets,
CREATE INDEX test_index on test ((foo[1]));
Would there be a performance difference queries using such an index as compared to regular index on a column?
As far as I've read, this performance difference would come into picture in arrays with variable length elements; but I'm not sure about fixed length ones.
Don't go for the lazy way.
If you need to store 100 and more values as array, it is ok, if it has sense has array for your application, your data.
If you need to query for a specific element of the array, then this design is not good, regardless of performances, and you must use columns. This will help you in the moment you must delete a "column" in the middle or redesign it.
Anyway, as wrote by Frank in comments, if values are all same type, consider to model them to another table (if also the meaning is the same).

Postgres full text search with multiple columns, why concat in index and not at runtime?

I've come across full text search in postgres in the last few days, and I am a little confused about indexing when searching across multiple columns.
The postgres docs talk about creating a ts_vector index on concatenated columns, like so:
CREATE INDEX pgweb_idx ON pgweb
USING gin(to_tsvector('english', title || ' ' || body));
which I can search like so:
... WHERE
(to_tsvector('english', title||' '||body) ## to_tsquery('english', 'foo'))
However, if I wanted to sometimes search just the title, sometimes just the body, and sometimes both, I would need 3 separate indexes. And if I added in a third column, that could potentially be 6 indexes, and so on.
An alternative which I haven't seen in the docs is just to index the two columns seperately, and then just use a normal WHERE...AND query:
... WHERE
(to_tsvector('english', title) ## to_tsquery('english','foo'))
AND
(to_tsvector('english', body) ## to_tsquery('english','foo'))
Benchmarking the two on ~1million rows seems to have basically no difference in performance.
So my question is:
Why would I want to concatenate indexes like this, rather than just indexing columns individually? What are the advantages/disadvantages of both?
My best guess is that if I knew in advance I would only want to ever search both columns (never one at a time) I would only ever need one index by concatenating which use less memory.
Edit
moved to: https://dba.stackexchange.com/questions/15412/postgres-full-text-search-with-multiple-columns-why-concat-in-index-and-not-at
Using one index is easier / faster for a DB;
It will be quite difficult to properly rank results when using two indexes;
You can assign relative weights to columns when creating a single index, so that match in title will be worth more than a match in body;
You are searching for a single word here, what happens if you search for several and they appear separately in different columns?
To answer the question of the implementation of #3, please see https://www.postgresql.org/docs/9.1/textsearch-controls.html:
a weight is one of the letters A, B, C, or D
UPDATE tt SET ti =
setweight(to_tsvector(coalesce(title,'')), 'A') ||
setweight(to_tsvector(coalesce(keyword,'')), 'B') ||
setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
setweight(to_tsvector(coalesce(body,'')), 'D');