get postgres to use an index when querying timestamps in a function - postgresql

I have a system with a large number of tables that contain historical data. Each table has a ts_from and ts_to column which are of type timestamptz. These represent the time period in which the data for a particular row was valid.
These columns are indexed.
If I want to query all rows that were valid at a particular timestamp, it is trivial to write the ts_from <= #at_timestamp AND ts_to >= #at_timestamp WHERE clause to utilitise the index.
However, I wanted to create a function called Temporal.at which would take the #at_timestamp column and the ts_from / ts_to columns and do this by hiding the complexity of the comparison from the query that uses it. You might think this is trivial, but I would also like to extend the concept to create a function called Temporal.between which would take a #from_timestamp and #to_timestamp and select all rows that were valid between those two periods. That function would not be trivial, as one would have to check where rows partially overlap the period rather than always being fully enclosed by it.
The issue is this: I have written these functions but they do not cause the index to be used. The query performance is woefully slow on the history tables, some of which have hundreds of millions of rows.
The questions therefore are:
a) Is there a way to write these functions so that we can be sure the indexes will be used?
b) Am I going about this completely the wrong way and is there a better way to proceed?

This is complicated if you model ts_from and ts_to as two different timestamp columns. Instead, you should use a range type: tstzrange. Then everything will become simple:
for containment in an interval, use #at_timestamp <# from_to
for interval overlap, use tstzinterval(#from_timestamp, #to_timestamp) && from_to
Both queries can be supported by a GiST index on the range column.

Related

Postgresql best index for datetime ranges

I have a Postgre table “tasks” with the fields “start”:timestamptz, “finish”:timestamptz, “type”:int (and a lot of others). It contains about 200m records. Start, finish and type fields have a separate b-tree indexes.
I’d like to build a report “Tasks for a period” and need to get all tasks which lay (fully or partially) inside the reporting period. Report could be built for all task types or for the specific one.
So I wrote the SQL:
SELECT * FROM tasks
WHERE start<={report_to}
AND finish>={report_from}
AND ({report_tasktype} IS NULL OR type={report_tasktype})
and it runs for ages even on short reporting periods.
Please advice if there a way to improve performance by altering the query or by creating new indexes on the table? For some reasons I can’t change the structure of the “tasks” table
You would want a GiST index on the range. Since you already have it stored as two end points rather than as a range, you could do a functional index to convert them on the fly.
ON task USING GIST (tstzrange(start,finish))
And then compare the ranges for overlap with &&
It may also improve things to add "type" as a second column to the index, which would require the btree_gist extension.

PostgreSQL - Compare ts_vector fields

I have two tables in which I have data coming from two different sources. One of the field of each table contains the title of a movie, but for some reason out of my control, the titles are not always exactly the same.
So I use the ts_vector to get rid of all the minor differences (stop words, plurals and so on).
See an example here: http://sqlfiddle.com/#!17/5ccbc/3
My problem is how to compare the two ts_vector without taking into account the numberic values, but just the text content. If I compare directly the two fields, I only get the exact match between values, including position of each word. The only solution I have found is using the strip() function, that remove positions and weights from tsvector, leaving only the text content.
I was wondering if there is a fastest way to compare ts_vectors.
You could create in index on the stripped vector:
create index on tbl1 (strip(ts_title));
create index on tbl2 (strip(ts_title));
But given that your query has to fetch every row of each table, it is unlikely this would serve much of a point. Doing a merge join between the precomputed stripped vectors could be faster, but probably not once you include the overhead of building and maintaining the indexes. If the real WHERE clause is more restrictive (selecting only a few rows from one or the other of the tables) then please share the real query.

kdb+/q optimize union function

To give you a bit of background. I have a process which does this large complex calculation which takes a while to complete. It runs on a timer. After some investigation I realise that what is causing the slowness isn't the actual calculation but the internal q function, union.
I am trying to union two simple tables, table A and table B. A is approximately 5m rows and B is 500. Both tables have only two columns. First column is a symbol. Table A is actually a compound primary key of a table. (Also, how do you copy directly from the console?)
n:5000000
big:([]n?`4;n?100)
small:([]500?`4;500?100)
\ts big union small
I tried keying both columns and upserting, join and then distinct, "big, small where not small in big" but nothing seems to work :(
Any help will be appreciated!
If you want to upsert the big table it has to be keyed and upsert operator should be used. For example
n:5000000
//big ids are unique numbers from 0 to 499999
//table is keyed with 1! operator
big:1!([]id:(neg n)?n;val:n?100)
//big ids are unique numbers. 250 from 0-4999999 and 250 from 500000-1000000 intervals
small:([]id:(-250?n),(n+-250?n);val:500?100)
If big is global variable it is efficient to upsert it as
`big upsert small
if big is local
big: big upsert small
As the result big will have 500250 elements, because there are 250 common keys (id column) in big and small tables
this may not be relevant, but just a quick thought. If your big table has a column which has type `sym and if this column does not really show up that much throughout your program, why not cast it to string or other value? if you are doing this update process every single day then as the data gets packed in your partitioned hdb, whenever the new data is added, kdb+ process has to reassign/rewrite its sym file and i believe this is the part that actually takes a lot of time, not the union calculation itself..
if above is true, i'd suggest either rewriting your schema for the table which minimises # of rehashing(not sure if this is the right term though!) on your symfile. or, as the above person mentioned, try to assign attribute to your table.. this may reduce the time too.

100 columns vs Array of length 100

I have a table with 100+ values corresponding to each row, so I'm exploring different ways to store them.
Without any indexes, would I lose anything if I store these 100 values in an integer[] column in postgresql? As compared to storing them in separate columns.
Plus, since we can add indexes to array elemnets,
CREATE INDEX test_index on test ((foo[1]));
Would there be a performance difference queries using such an index as compared to regular index on a column?
As far as I've read, this performance difference would come into picture in arrays with variable length elements; but I'm not sure about fixed length ones.
Don't go for the lazy way.
If you need to store 100 and more values as array, it is ok, if it has sense has array for your application, your data.
If you need to query for a specific element of the array, then this design is not good, regardless of performances, and you must use columns. This will help you in the moment you must delete a "column" in the middle or redesign it.
Anyway, as wrote by Frank in comments, if values are all same type, consider to model them to another table (if also the meaning is the same).

Stored Procedure in postgresql, multiple queries w/ aggregates

I'm trying to write a store procedure that can take some input parameters (obviously), run multiple queries against those, taking the output from those and doing calculations, and from those calculations and the original queries, outputting a formatted text string like:
Number of Rows for max(Z) matching condition x and y of total rows matching x (x&y/x*100).
To explain the max(Z) bit, this will be the username field, it won't matter which actual entry is picked, because the where clause will filter the results by user id, is there a saner way to do this?
For starters break the code up into multiple procedures. Don't create one procedure that does all of these things.