PostgreSQL: optimization of query for select with overlap - postgresql

I create the following query for selection data with overlapping periods (for campaigns, which have the same business identifier!):
select
campaign_instance_1.campaign_id,
campaign_instance_1.start_time
from campaign_instance as campaign_instance_1
inner join campaign_instance as campaign_instance_2
on campaign_instance_1.campaign_id = campaign_instance_2.campaign_id
and (
(campaign_instance_1.start_time between campaign_instance_2.start_time and campaign_instance_2.finish_time)
or (campaign_instance_1.finish_time between campaign_instance_2.start_time and campaign_instance_2.finish_time)
or (campaign_instance_1.start_time<campaign_instance_2.start_time and campaign_instance_1.finish_time>campaign_instance_2.finish_time)
or (campaign_instance_1.start_time>campaign_instance_2.start_time and campaign_instance_1.finish_time<campaign_instance_2.finish_time))
With index, created as:
CREATE INDEX IF NOT EXISTS camp_inst_idx_campaign_id_and_finish_time
ON public.campaign_instance_without_index USING btree
(campaign_id ASC NULLS LAST, finish_time DESC NULLS LAST)
TABLESPACE pg_default;
Already on 100 000 rows it runs very slowly - 43 seconds!
For optimization I tried to add index on start_time:
(campaign_id ASC NULLS LAST, finish_time DESC NULLS LAST, start_time DESC NULLS LAST)
But result is the same.
As I understand results of explain analyze, index "start_time" doesn't uses as a Index Condition:
I tried the query with this index either with 10 000 and 100 000 rows - so, as possible, it does not depends on sample size (at least on this scales).
Source table contains the following structure:
campaign_id bigint,
fire_time bigint,
start_time bigint,
finish_time bigint,
recap character varying,
details json
Why my index is not used, and what possible ways to improve the query?

Joining to campaign_instance (itself) doesn't really serve anything here other than making an "existence" check and probably your intention is not to get back duplicates for matching records. Thus you could simplify the query with EXISTS or LATERAL join. Also your join condition on time could be simplified, you seem to be looking for overlapping times:
select campaign_id,start_time
from campaign_instance c1
where exists( select * from campaign_instance c2
where c1.campaign_id = c2.campaign_id
and (c1.start_time <= c2.finish_time and c1.finish_time >= c2.start_time));
That time overlap probably would use < and > instead of <= and >= but I don't know your exact requirements, between is implicitly saying it is <= and >=.
EDIT: Ensure that the match is not the row itself:
(This table should have a primary key to make things easier, but as it doesn't, I would assume that there is no duplication on campaign_id, start_time and finish_time and that could be used as a composite key)
select campaign_id,start_time
from campaign_instance c1
where exists( select * from campaign_instance c2
where c1.campaign_id = c2.campaign_id
and (c1.start_time != c2.start_time or c1.finish_time != c2.finish_time)
and (c1.start_time <= c2.finish_time and c1.finish_time >= c2.start_time));
This takes around 230-250 milliseconds on my system (iMac i5 7500, 3.4 Ghz, 64 Gb mem).

Related

Postgres join vs aggregation on very large partitioned tables

I have a large table with 100s of millions of rows. Because it is so big, it is partitioned by date range first, and then that partition is also partitioned by a period_id.
CREATE TABLE research.ranks
(
security_id integer NOT NULL,
period_id smallint NOT NULL,
classificationtype_id smallint NOT NULL,
dtz timestamp with time zone NOT NULL,
create_dt timestamp with time zone NOT NULL DEFAULT now(),
update_dt timestamp with time zone NOT NULL DEFAULT now(),
rank_1 smallint,
rank_2 smallint,
rank_3 smallint
)
CREATE TABLE zpart.ranks_y1990 PARTITION OF research.ranks
FOR VALUES FROM ('1990-01-01 00:00:00+00') TO ('1991-01-01 00:00:00+00')
PARTITION BY LIST (period_id);
CREATE TABLE zpart.ranks_y1990p1 PARTITION OF zpart.ranks_y1990
FOR VALUES IN ('1');
every year has a partition and there are another dozen partitions for each year.
I needed to see the result for security_ids side by side for different period_ids.
So the join I initially used was one like this:
select c1.security_id, c1.dtz,c1.rank_2 as rank_2_1, c9.rank_2 as rank_2_9
from research.ranks c1
left join research.ranks c9 on c9.dtz=c9.dtz and c1.security_id=c9.security_id and c9.period_id=9
where c1.period_id =1 and c1.dtz>now()-interval'10 years'
which was slow, but acceptable. I'll call this the JOIN version.
Then, we wanted to show two more period_ids and extended the above to add additional joins on the new period_ids.
This slowed down the join enough for us to look at a different solution.
We found that the following type of query runs about 6 or 7 times faster:
select c1.security_id, c1.dtz
,sum(case when c1.period_id=1 then c1.rank_2 end) as rank_2_1
,sum(case when c1.period_id=9 then c1.rank_2 end) as rank_2_9
,sum(case when c1.period_id=11 then c1.rank_2 end) as rank_2_11
,sum(case when c1.period_id=14 then c1.rank_2 end) as rank_2_14
from research.ranks c1
where c1.period_id in (1,11,14,9) and c1.dtz>now()-interval'10 years'
group by c1.security_id, c1.dtz;
We can use the sum because the table has unique indexes so we know there will only ever be one record that is being "summed". I'll call this the SUM version.
The speed is so much better that I'm questioning half of the code I have written previously! Two questions:
Should I be trying to use the SUM version rather than the JOIN version everywhere or is the efficiency likely to be a factor of the specific structure and not likely to be as useful in other circumstances?
Is there a problem with the logic of the SUM version in cases that I haven't considered?
To be honest, I don't think your "join" version was ever a good idea anyway. You only have one (partitioned) table so there never was a need for any join.
SUM() is the way to go, but I would use SUM(...) FILTER(WHERE ..) instead of a CASE:
SELECT
security_id,
dtz,
SUM(rank_2) FILTER (WHERE period_id = 1) AS rank_2_1,
SUM(rank_2) FILTER (WHERE period_id = 9) AS rank_2_9,
SUM(rank_2) FILTER (WHERE period_id = 11) AS rank_2_11,
SUM(rank_2) FILTER (WHERE period_id = 14) AS rank_2_14,
FROM
research.ranks
WHERE
period_id IN ( 1, 11, 14, 9 )
AND dtz > now( ) - INTERVAL '10 years'
GROUP BY
security_id,
dtz;

to write a SQL query which select rows where column value changed from previous row

CREATE TABLE status( id serial NOT NULL,
id integer,
plan smallint,
ime timestamp without time zone
CONSTRAINT data_pkey PRIMARY KEY (id))
WITH (OIDS=FALSE);
ALTER TABLE data
OWNER TO postgres;
Index: data_idx
CREATE INDEX data_idx
ON data
USING btree
(time, id);
I have a table like this
id val plan time
1 8300 1 2011-01-01
2 8300 1 2011-01-02
3 8300 2 2011-01-03
4 9600 1 2011-01-04
5 9600 2 2011-01-05
How do I select the rows where sigplan changed from the previous row for that siteId?
In the example above, the query should return the rows
2011-01-03 (sigplan changed from 1 to 2 between 2011-01-01 and 2011-01-03 for 8300),
2011-01-05(sigplan changed from 1 to 2 between 2011-01-04 and 2011-01-05 for 9600).
The table contains lot of data so the query should be optimized.
SELECT siteId, sigplan, MAX(server_time) FROM traffview.status_data
GROUP BY siteId, sigplan
HAVING COUNT(1) > 1 AND MAX(server_time) > 'XXXXX' AND MAX(server_time) < 'XXXXX'
The annoying part is figuring out which is the previous row id with the same siteId. After that it is pretty easy by joining the table with itself.
SELECT t1.* FROM table t1, table t2
WHERE t1.sigplan != t2.sigplan
AND t2.id = (SELECT MAX(t3.id) FROM table t3 WHERE t3.id < t1.id)
If the table is moderately (not extremely) large I would consider doing this in application code instead, or by storing the change flag in its own column when writing a new row. A subquery for each row in the table has very poor performance.
This version doesn't have a sub-query, but does assume that you have consecutive IDs.
SELECT t1.*
FROM traffview AS t1, traffview AS t2
WHERE
t1.siteId = t2.siteId
AND t1.sigplan <> t2.sigplan
AND t1.id - t2.id = 1
ORDER BY
t1.server_time
In case you compare with previous rows it is useful to use LAG function which does the job for you:
SELECT sub.*
FROM (
SELECT
plan AS curr_plan,
LAG(plan) OVER (PARTITION BY val ORDER BY time) AS prev_plan,
val,
time
) sub
WHERE
sub.prev_plan IS NOT NULL AND sub.prev_plan <> sub.curr_plan;

PostgreSQL: Statistics on partial index?

PostgreSQL version: 9.3.13
Consider the following tables, index and data:
CREATE TABLE orders (
order_id bigint,
status smallint,
owner int,
CONSTRAINT orders_pkey PRIMARY KEY (order_id)
)
CREATE INDEX owner_index ON orders
USING btree
(owner) WHERE status > 0;
CREATE TABLE orders_appendix (
order_id bigint,
note text
)
Data
orders:
(IDs, 0, 1337) * 1000000 rows
(IDs, 10, 1337) * 1000 rows
(IDs, 10, 777) * 1000 rows
orders_appendix:
one row for each order
My problem is:
select * from orders where owner=1337 and status>0
The query planner estimated the number of row to be 1000000, but actual number of row is 1000.
In a more complicated following query:
SELECT note FROM orders JOIN orders_appendix using (order_id)
WHERE owner=1337 AND status>0
Instead of using inner join (which is preferable for small number of rows), it picks bitmap join + a full table scan on orders_appendix, which is very slow.
If the condition is "owner=777", it will choose the preferable inner join instead.
I believe it is bacause of the statistics, as AFAIK postgres can only collect and consider stats for each column independently.
However, if I...
CREATE INDEX onwer_abs ON orders (abs(owner)) where status>0;
Now, a slightly changed query...
SELECT note FROM orders JOIN rders_appendix using (order_id)
WHERE abs(owner)=1337 AND status>0
will results in the inner join that I wanted.
Is there a better solution? Perhaps "statistics on partial index"?

How can I execute a least cost routing query in postgresql, without temporary tables?

How can I execute a telecoms least cost routing query in PostgreSQL?
The purpose is generate a result set with ordered by the lowest price for the carriers. The table structure is below
SQL Fiddle
CREATE TABLE tariffs (
trf_tariff_id integer,
trf_carrier_id integer,
trf_prefix character varying,
trf_destination character varying,
trf_price numeric(15,6),
trf_connect_charge numeric(15,6),
trf_billing_interval integer,
trf_minimum_interval integer
);
For instance to check the cost for a call if passed through a particular carrier carrier_id the query is:
SELECT trf_price, trf_prefix as lmp FROM tariffs WHERE SUBSTRING(dialled_number,1, LENGTH(trf_prefix)) = trf_prefix and trf_carrier_id = carrier_id ORDER BY trf_prefix DESC limit 1
For the cost of the call for each carrier ie the least cost query the query is:
-- select * from tariffs
select distinct banana2.longest_prefix, banana2.trf_carrier_id_2, apple2.trf_carrier_id, apple2.lenprefix, apple2.trf_price, apple2.trf_destination from
(select banana.longest_prefix, banana.trf_carrier_id_2 from (select max(length(trf_prefix)) as longest_prefix, trf_carrier_id as trf_carrier_id_2 from (select *, length(trf_prefix) as lenprefix from tariffs where substring('35567234567', 1, length(trf_prefix) )= trf_prefix) as apple group by apple.trf_carrier_id) as banana) as banana2,
(select *, length(trf_prefix) as lenprefix from tariffs where substring('35567234567', 1, length(trf_prefix) )= trf_prefix) as apple2 -- group by apple2.trf_carrier_id where banana2.trf_carrier_id_2=apple2.trf_carrier_id and banana2.longest_prefix=apple2.lenprefix order by trf_price
The query works on the basis that for each carrier the longest matching prefix for a dialled number is unique and it will be the longest. So a join involving the longest prefix and carrier on the selection gives the set for all the carriers.
I one problem with my query:
I don't want to do the apple(X) query twice
(select *, length(trf_prefix) as lenprefix from tariffs where substring('35567234567', 1, length(trf_prefix) )= trf_prefix) as apple
There must be a more elegant way, probably declaring it once and using it twice.
What I want to do is run the query on the single carrier for each carrier:
SELECT trf_price, trf_prefix as lmp FROM tariffs WHERE SUBSTRING(dialled_number,1, LENGTH(trf_prefix)) = trf_prefix and trf_carrier_id = carrier_id ORDER BY trf_prefix DESC limit 1
and combine them into one set which will be sorted by price.
In fact I want to generalize the method for any such query where the output for the various values for a particular column or set of columns are combined into one set for further querying. I am told that CTEs are the way to accomplish that kind of query but I find the docs rather confusing. It is much easier with your own use cases.
PS. I am aware that the prefix length can be precomputed and stored.
Common Table Expressions:
with apple as (
select *, length(trf_prefix) as lenprefix
from tariffs
where substring('35567234567', 1, length(trf_prefix)) = trf_prefix
)
select distinct banana2.longest_prefix, banana2.trf_carrier_id_2,
apple.trf_carrier_id, apple.lenprefix, apple.trf_price,
apple.trf_destination
from (select banana.longest_prefix, banana.trf_carrier_id_2
from (select max(length(trf_prefix)) as longest_prefix,
trf_carrier_id as trf_carrier_id_2
from apple
group by apple.trf_carrier_id) as banana) as banana2,
apple
where banana2.trf_carrier_id_2 = apple.trf_carrier_id
and banana2.longest_prefix = apple.lenprefix
order by trf_price
You can just pull out the repeated table definition. Even if I'm just using one of those sub-select-in-a-from things a single time, I still use CTEs. I find the style you're using basically unreadable.

postgresql find preceding and following timestamp to arbitrary timestamp

Given an arbitrary timestamp such as 2014-06-01 12:04:55-04 I can find in sometable the timestamps just before and just after. I then calculate the elapsed number of seconds between those two with the following query:
SELECT EXTRACT (EPOCH FROM (
(SELECT time AS t0
FROM sometable
WHERE time < '2014-06-01 12:04:55-04'
ORDER BY time DESC LIMIT 1) -
(SELECT time AS t1
FROM sometable
WHERE time > '2014-06-01 12:04:55-04'
ORDER BY time ASC LIMIT 1)
)) as elapsedNegative;
`
It works, but I was was wondering if there was another more elegant or astute way to achieve the same result? I am using 9.3. Here is a toy database.
CREATE TABLE sometable (
id serial,
time timestamp
);
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 11:59:37-04');
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 12:02:22-04');
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 12:04:49-04');
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 12:07:35-04');
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 12:09:53-04');
Thanks for any tips...
update Thanks to both #Joe Love and #Clément Prévost for interesting alternatives. Learned a lot on the way!
Your original query can't be more effective given that the sometable.time column is indexed, your execution plan should show only 2 index scans, which is very efficient (index only scans if you have pg 9.2 and above).
Here is a more readable way to write it
WITH previous_timestamp AS (
SELECT time AS time
FROM sometable
WHERE time < '2014-06-01 12:04:55-04'
ORDER BY time DESC LIMIT 1
),
next_timestamp AS (
SELECT time AS time
FROM sometable
WHERE time > '2014-06-01 12:04:55-04'
ORDER BY time ASC LIMIT 1
)
SELECT EXTRACT (EPOCH FROM (
(SELECT * FROM next_timestamp)
- (SELECT * FROM previous_timestamp)
))as elapsedNegative;
Using CTE allow you to give meaning to a subquery by naming it. Explicit naming is a well known and recognised coding best practice (use explicit names, don't abbreviate and don't use over generic names like "data" or "value").
Be warned that CTE are optimisation "fences" and sometimes get in the way of planner optimisation
Here is the SQLFiddle.
Edit: Moved the extract from the CTE to the final query so that PostgreSQL can use a index only scan.
This solution will likely perform better if the timestamp column does not have an index. When 9.4 comes out we can do it a little shorter by using aggregate filters.
This should be a bit bit faster as it's running 1 full table scan instead of 2, however it may perform worse, if your timestamp column is indexed and you have a large dataset.
Here's the example without the epoch conversion to make it more easy to read.
select
min(
case when start_timestamp > current_timestamp
then
start_timestamp
else 'infinity'::timestamp
end
),
max(
case when t1.start_timestamp < current_timestamp
then
start_timestamp
else '-infinity'::timestamp
end
)
from my_table as t1
And here's the example including the math and epoch extraction:
select
extract (EPOCH FROM (
min(
case when start_timestamp > current_timestamp
then
start_timestamp
else 'infinity'::timestamp
end
)-
max(
case when start_timestamp < current_timestamp
then
start_timestamp
else '-infinity'::timestamp
end
)))
from snap.offering_event
Please let me know if you need further details-- I'd recommend trying my code vs yours and seeing how it performs.