Postgresql create function index on dynamic values - postgresql

I have such a query
select r.timestamp, r,value
from result_table r
where timestamp > ( NOW() - INTERVAL '120 hour' )
and r.id%10=1`
where id is the autoincremental primary key.
Instead, 120 and 10 can by any other number (decided by the user depending on his needs). Basically, the user wants data for some time interval with some decimation.
Obviously, it works too slow on a big amount of data. What should be the index(s) here?

PostgreSQL supports SQL expression or function indexes
where
timestamp > ( NOW() - INTERVAL '120 hour' )
and r.id % 10 = 1
Needs the index (timestamp, (id % 10)) to get more performance.
Query
CREATE INDEX
timestamp__idmod10
ON
result_table
(timestamp, (id % 10))
see demos
with index http://sqlfiddle.com/#!17/8e63b/6
without index http://sqlfiddle.com/#!17/9be99/3
Editted because of comment
Thanks, Raymond, However, (id % 10) is not that good since instead of
10 can be any other number. 9, 11, 100, 1, etc
Other approach use generate_series() and a delivered table to generate a id list matching % number = 1.
And use that resultset with a IN clause.
p.s this statement assumes a id column with SERIAL and a table equal or less then 1 million records. Also keep in mind that the generate_series() function takes some time.
SQL statement
SELECT
numbers.number FROM (
SELECT
generate_series(1, 1000000) as number
) AS numbers
WHERE
numbers.number % number = 1
Then you can use the index
CREATE INDEX timestamp_id ON result_table(timestamp, id);
And the query
SELECT
*
FROM
result_table
WHERE
timestamp > ( NOW() - INTERVAL '120 hour' )
AND
id IN (
SELECT
numbers.number FROM (
SELECT
generate_series(1, 1000000) as number
) AS numbers
WHERE
numbers.number % 10 = 1
)
see demo http://sqlfiddle.com/#!17/5013c0/6 with example data.

Related

How to get timestamp associated with percentile(x) value using timescale db time_bucket

I need find percentile(50) value and its timestamp using timescale db time-bucket. Finding P50 is easy but I don't know how to get the time stamp.
Select time_bucket('120 sec',timestamp_utc) as interval_size,
first(timestamp_utc,int_val) as minTime,
min(int_val) as minVal,
last(timestamp_utc,int_val) as maxTime,
max(int_val) as maxVal,
-- timestamp of percentile value below.
percentile_disc(0.5) within group (order by int_val) as medianVal
from timeseries.raw
where timestamp_utc > NOW() - INTERVAL '10 min'
AND tag_id = 59560544877390423
group by interval_size
order by interval_size desc
I think what you're looking for we can do by selecting where the int_val is equal to the median value in a lateral (percentile_disc does ensure that there is a value exactly equal to that value, there may be more than one depending on what you want there you could deal with the more than one case in different ways), building on a previous answer and making it work a bit better I think would look something like this:
WITH p50 AS (
Select time_bucket('120 sec',timestamp_utc) as interval_size,
first(timestamp_utc,int_val) as minTime,
min(int_val) as minVal,
last(timestamp_utc,int_val) as maxTime,
max(int_val) as maxVal,
-- timestamp of percentile value below.
percentile_disc(0.5) within group (order by int_val) as medianVal
from timeseries.raw
where timestamp_utc > NOW() - INTERVAL '10 min'
AND tag_id = 59560544877390423
group by interval_size
order by interval_size desc
) SELECT p50.*, rmed.*
FROM p50, LATERAL (SELECT * FROM timeseries.raw r
-- copy over the same where clause from above so we're dealing with the same subset of data
WHERE timestamp_utc > NOW() - INTERVAL '10 min'
AND tag_id = 59560544877390423
-- add a where clause on the median value
AND r.int_val = p50.medianVal
-- now add a where clause to account for the time bucket
AND r.timestamp_utc >= p50.interval_size
AND r.timestamp_utc < p50.interval_size + '120 sec'::interval
-- Can add an order by something desc limit 1 if you want to avoid ties
) rmed;
Note that this will do a second scan of the table, it should be reasonably efficient, especially if you have an index on that column, but it will cause another scan, there isn't a great way that I know of of doing it without a second scan.

How to rewrite SQL joins into window functions?

Database is HP Vertica 7 or PostgreSQL 9.
create table test (
id int,
card_id int,
tran_dt date,
amount int
);
insert into test values (1, 1, '2017-07-06', 10);
insert into test values (2, 1, '2017-06-01', 20);
insert into test values (3, 1, '2017-05-01', 30);
insert into test values (4, 1, '2017-04-01', 40);
insert into test values (5, 2, '2017-07-04', 10);
Of the payment cards used in the last 1 day, what is the maximum amount charged on that card in the last 90 days.
select t.card_id, max(t2.amount) max
from test t
join test t2 on t2.card_id=t.card_id and t2.tran_dt>='2017-04-06'
where t.tran_dt>='2017-07-06'
group by t.card_id
order by t.card_id;
Results are correct
card_id max
------- ---
1 30
I want to rewrite the query into sql window functions.
select card_id, max(amount) over(partition by card_id order by tran_dt range between '60 days' preceding and current row) max
from test
where card_id in (select card_id from test where tran_dt>='2017-07-06')
order by card_id;
But result set does not match, how can this be done?
Test data here:
http://sqlfiddle.com/#!17/db317/1
I can't try PostgreSQL, but in Vertica, you can apply the ANSI standard OLAP window function.
But you'll need to nest two queries: The window function only returns sensible results if it has all rows that need to be evaluated in the result set.
But you only want the row from '2017-07-06' to be displayed.
So you'll have to filter for that date in an outer query:
WITH olap_output AS (
SELECT
card_id
, tran_dt
, MAX(amount) OVER (
PARTITION BY card_id
ORDER BY tran_dt
RANGE BETWEEN '90 DAYS' PRECEDING AND CURRENT ROW
) AS the_max
FROM test
)
SELECT
card_id
, the_max
FROM olap_output
WHERE tran_dt='2017-07-06'
;
card_id|the_max
1| 30
As far as I know, PostgreSQL Window function doesn't support bounded range preceding thus range between '90 days' preceding won't work. It does support bounded rows preceding such as rows between 90 preceding, but then you would need to assemble a time-series query similar to the following for the Window function to operate on the time-based rows:
SELECT c.card_id, t.amount, g.d as d_series
FROM generate_series(
'2017-04-06'::timestamp, '2017-07-06'::timestamp, '1 day'::interval
) g(d)
CROSS JOIN ( SELECT distinct card_id from test ) c
LEFT JOIN test t ON t.card_id = c.card_id and t.tran_dt = g.d
ORDER BY c.card_id, d_series
For what you need (based on your question description), I would stick to using group by.

Looping SQL query - PostgreSQL

I'm trying to get a query to loop through a set of pre-defined integers:
I've made the query very simple for this question.. This is pseudo code as well obviously!
my_id = 0
WHILE my_id < 10
SELECT * from table where id = :my_id`
my_id += 1
END
I know that for this query I could just do something like where id < 10.. But the actual query I'm performing is about 60 lines long, with quite a few window statements all referring to the variable in question.
It works, and gets me the results I want when I have the variable set to a single figure.. I just need to be able to re-run the query 10 times with different variables hopefully ending up with one single set of results.
So far I have this:
CREATE OR REPLACE FUNCTION stay_prices ( a_product_id int ) RETURNS TABLE (
pid int,
pp_price int
) AS $$
DECLARE
nights int;
nights_arr INT[] := ARRAY[1,2,3,4];
j int;
BEGIN
j := 1;
FOREACH nights IN ARRAY nights_arr LOOP
-- query here..
END LOOP;
RETURN;
END;
$$ LANGUAGE plpgsql;
But I'm getting this back:
ERROR: query has no destination for result data
HINT: If you want to discard the results of a SELECT, use PERFORM instead.
So do I need to get my query to SELECT ... INTO the returning table somehow? Or is there something else I can do?
EDIT: this is an example of the actual query I'm running:
\x auto
\set nights 7
WITH x AS (
SELECT
product_id, night,
LAG(night, (:nights - 1)) OVER (
PARTITION BY product_id
ORDER BY night
) AS night_start,
SUM(price_pp_gbp) OVER (
PARTITION BY product_id
ORDER BY night
ROWS BETWEEN (:nights - 1) PRECEDING
AND CURRENT ROW
) AS pp_price,
MIN(spaces_available) OVER (
PARTITION BY product_id
ORDER BY night
ROWS BETWEEN (:nights - 1) PRECEDING
AND CURRENT ROW
) AS min_spaces_available,
MIN(period_date_from) OVER (
PARTITION BY product_id
ORDER BY night
ROWS BETWEEN (:nights - 1) PRECEDING
AND CURRENT ROW
) AS min_period_date_from,
MAX(period_date_to) OVER (
PARTITION BY product_id
ORDER BY night
ROWS BETWEEN (:nights - 1) PRECEDING
AND CURRENT ROW
) AS max_period_date_to
FROM products_nightlypriceperiod pnpp
WHERE
spaces_available >= 1
AND min_group_size <= 1
AND night >= '2016-01-01'::date
AND night <= '2017-01-01'::date
)
SELECT
product_id as pid,
CASE WHEN x.pp_price > 0 THEN x.pp_price::int ELSE null END as pp_price,
night_start as from_date,
night as to_date,
(night-night_start)+1 as duration,
min_spaces_available as spaces
FROM x
WHERE
night_start = night - (:nights - 1)
AND min_period_date_from = night_start
AND max_period_date_to = night;
That will get me all the nights night periods available for all my products in 2016 along with the price for the period and the max number of spaces I could fill in that period.
I'd like to be able to run this query to get all the periods available between 2 and 30 days for all my products.
This is likely to produce a table with millions of rows. The plan is to re-create this table periodically to enable a very quick look up of what's available for a particular date. The products_nightlypriceperiod represents a night of availability of a product - e.g. Product X has 3 spaces left for Jan 1st 2016, and costs £100 for the night.
Why use a loop? You can do something like this (using your first query):
with params as (
select generate_series(1, 10) as id
)
select t.*
from params cross join
table t
where t.id = params.id;
You can modify params to have the values you really want. Then just use cross join and let the database "do the looping."

postgresql find preceding and following timestamp to arbitrary timestamp

Given an arbitrary timestamp such as 2014-06-01 12:04:55-04 I can find in sometable the timestamps just before and just after. I then calculate the elapsed number of seconds between those two with the following query:
SELECT EXTRACT (EPOCH FROM (
(SELECT time AS t0
FROM sometable
WHERE time < '2014-06-01 12:04:55-04'
ORDER BY time DESC LIMIT 1) -
(SELECT time AS t1
FROM sometable
WHERE time > '2014-06-01 12:04:55-04'
ORDER BY time ASC LIMIT 1)
)) as elapsedNegative;
`
It works, but I was was wondering if there was another more elegant or astute way to achieve the same result? I am using 9.3. Here is a toy database.
CREATE TABLE sometable (
id serial,
time timestamp
);
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 11:59:37-04');
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 12:02:22-04');
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 12:04:49-04');
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 12:07:35-04');
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 12:09:53-04');
Thanks for any tips...
update Thanks to both #Joe Love and #Clément Prévost for interesting alternatives. Learned a lot on the way!
Your original query can't be more effective given that the sometable.time column is indexed, your execution plan should show only 2 index scans, which is very efficient (index only scans if you have pg 9.2 and above).
Here is a more readable way to write it
WITH previous_timestamp AS (
SELECT time AS time
FROM sometable
WHERE time < '2014-06-01 12:04:55-04'
ORDER BY time DESC LIMIT 1
),
next_timestamp AS (
SELECT time AS time
FROM sometable
WHERE time > '2014-06-01 12:04:55-04'
ORDER BY time ASC LIMIT 1
)
SELECT EXTRACT (EPOCH FROM (
(SELECT * FROM next_timestamp)
- (SELECT * FROM previous_timestamp)
))as elapsedNegative;
Using CTE allow you to give meaning to a subquery by naming it. Explicit naming is a well known and recognised coding best practice (use explicit names, don't abbreviate and don't use over generic names like "data" or "value").
Be warned that CTE are optimisation "fences" and sometimes get in the way of planner optimisation
Here is the SQLFiddle.
Edit: Moved the extract from the CTE to the final query so that PostgreSQL can use a index only scan.
This solution will likely perform better if the timestamp column does not have an index. When 9.4 comes out we can do it a little shorter by using aggregate filters.
This should be a bit bit faster as it's running 1 full table scan instead of 2, however it may perform worse, if your timestamp column is indexed and you have a large dataset.
Here's the example without the epoch conversion to make it more easy to read.
select
min(
case when start_timestamp > current_timestamp
then
start_timestamp
else 'infinity'::timestamp
end
),
max(
case when t1.start_timestamp < current_timestamp
then
start_timestamp
else '-infinity'::timestamp
end
)
from my_table as t1
And here's the example including the math and epoch extraction:
select
extract (EPOCH FROM (
min(
case when start_timestamp > current_timestamp
then
start_timestamp
else 'infinity'::timestamp
end
)-
max(
case when start_timestamp < current_timestamp
then
start_timestamp
else '-infinity'::timestamp
end
)))
from snap.offering_event
Please let me know if you need further details-- I'd recommend trying my code vs yours and seeing how it performs.

Postgresql - get closest datetime row relative to given datetime value

I have a postgres table with a unique datetime field.
I would like to use/create a function that takes as argument a datetime value and returns the row id having the closest datetime relative (but not equal) to the passed datetime value. A second argument could specify before or after the passed value.
Ideally, some combination of native datetime functions could handle this requirement. Otherwise it'll have to be a custom function.
Question: What are methods for querying relative datetime over a collection of rows?
select id, passed_ts - ts_column difference
from t
where
passed_ts > ts_column and positive_interval
or
passed_ts < ts_column and not positive_interval
order by abs(extract(epoch from passed_ts - ts_column))
limit 1
passed_ts is the timestamp parameter and positive_interval is a boolean parameter. If true only rows where the timestamp column is lower then the passed timestamp. If false the inverse.
use simply -.
Assuming you have a table with attributes Key, Attr and T (timestamp with or without timezone):
you can search with
select min(T - TimeValue) from Table where (T - TimeValue) > 0;
this will give you the main difference. You can combine this value with a join to the same table to get the tuple you are interested in:
select * from (select *, T - TimeValue as diff from Table) as T1 NATURAL JOIN
( select min(T - TimeValue) as diff from Table where (T - TimeValue) > 0) as T2;
that should do it
--dmg
You want the first row of a select statement producing all the rows below (or above) the given datetime in descending (or ascending) order.
Pseudo code for the function body:
SELECT id
FROM table
WHERE IF(#above, datecol < #param, datecol > #param)
ORDER BY IF (#above. datecol ASC, datecol DESC)
LIMIT 1
However, this does not work: one cannot condition the ordering direction.
The second idea is to do both queries, and select afterwards:
SELECT *
FROM (
(
SELECT 'below' AS dir, id
FROM table
WHERE datecol < #param
ORDER BY datecol DESC
LIMIT 1
) UNION (
SELECT 'above' AS dir, id
FROM table
WHERE datecol > #param
ORDER BY datecol ASC
LIMIT 1)
) AS t
WHERE dir = #dir
That should be pretty fast with an index on the datetime column.
-- test rig
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
CREATE TABLE lutser
( dt timestamp NOT NULL PRIMARY KEY
);
-- populate it
INSERT INTO lutser(dt)
SELECT gs
FROM generate_series('2013-04-30', '2013-05-01', '1 min'::interval) gs
;
DELETE FROM lutser WHERE random() < 0.9;
--
-- The query:
WITH xyz AS (
SELECT dt AS hh
, LAG (dt) OVER (ORDER by dt ) AS ll
FROM lutser
)
SELECT *
FROM xyz bb
WHERE '2013-04-30 12:00' BETWEEN bb.ll AND bb.hh
;
Result:
NOTICE: drop cascades to table tmp.lutser
DROP SCHEMA
CREATE SCHEMA
SET
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "lutser_pkey" for table "lutser"
CREATE TABLE
INSERT 0 1441
DELETE 1288
hh | ll
---------------------+---------------------
2013-04-30 12:02:00 | 2013-04-30 11:50:00
(1 row)
Wrapping it into a function is left as an excercise for the reader
UPDATE: here is a second one with the sandwiched-not-exists-trick (TM):
SELECT lo.dt AS ll
FROM lutser lo
JOIN lutser hi ON hi.dt > lo.dt
AND NOT EXISTS (
SELECT * FROM lutser nx
WHERE nx.dt < hi.dt
AND nx.dt > lo.dt
)
WHERE '2013-04-30 12:00' BETWEEN lo.dt AND hi.dt
;
You have to join the table to itself with the where condition looking for the smallest nonzero (negative or positive) interval between the base table row's datetime and the joined table row's datetime. It would be good to have an index on that datetime column.
P.S. You could also look for the max() of the previous or the min() of the subsequent.
Try something like:
SELECT *
FROM your_table
WHERE (dt_time > argument_time and search_above = 'true')
OR (dt_time < argument_time and search_above = 'false')
ORDER BY CASE WHEN search_above = 'true'
THEN dt_time - argument_time
ELSE argument_time - dt_time
END
LIMIT 1;