optimizing a slow postgresql query against multiple tables - postgresql

One of our PostgreSQL queries started getting slow (~15 seconds) so we looked at migrating to a Graph database. Early tests show significantly faster speeds, so AWESOME.
Here's the problem- we still need to store a backup of the data in Postgres for non-analytics needs. The Graph database is just for analytics, and we'd prefer for it to remain a secondary data store. Because our business logic changed quite a bit during this migration, two existing tables turned into 4 -- and the current 'backup' selects in Postgres take anywhere from 1 to 6 minutes to run.
I've tried a few ways to optimize this, and the best seems to be turning this into two queries. If anyone can suggest obvious mistakes here , I'd love to hear a suggestion. I've tried switching up left/right/inner joins with little difference in the query planner. The join order does affect a difference ; I think I'm just not getting this correctly.
I'll go into details.
Goal : Retrieve the last 10 attachments sent to a given person
Database Structure :
CREATE TABLE message (
body_raw TEXT
CREATE TABLE attachments (
body_raw TEXT
CREATE TABLE message_2_attachments (
message_id INT NOT NULL REFERENCES message(id) ,
attachment_id INT NOT NULL REFERENCES attachments(id)
CREATE TABLE mailings (
event_timestamp TIMESTAMP not null ,
recipient_id INT NOT NULL ,
message_id INT NOT NULL REFERENCES message(id)
sidenote: the reason why a mailing is abstracted from the message is that a mailing often has more than one recipient /and/ a single message can go out to multiple recipients
This query takes about 5 minutes on a relatively small dataset (query planner time is the comment above each item ) :
-- 159374.75
FROM attachments
JOIN message_2_attachments ON attachments.id = message_2_attachments.attachment_id
JOIN message ON message_2_attachments.message_id = message.id
JOIN mailings ON mailings.message_id = message.id
WHERE mailings.recipient_id = 1
ORDER BY mailings.event_timestamp desc limit 10 ;
Splitting it up into 2 queries only takes 1/8 the time :
-- 19123.22
EXPLAIN ANALYZE SELECT message_2_attachments.attachment_id
FROM mailings
JOIN message ON mailings.message_id = message.id
JOIN message_2_attachments ON message.id = message_2_attachments.message_id
JOIN attachments ON message_2_attachments.attachment_id = attachments.id
WHERE mailings.recipient_id = 1
ORDER BY mailings.event_timestamp desc limit 10 ;
-- 1.089
EXPLAIN ANALYZE SELECT * FROM attachments WHERE id IN ( results of above query )
I've tried re-writing the queries a handful of times -- different join orders, different types of joins, etc. I can't seem to make this anywhere nearly as efficient in a single query as it can be in two.
UPDATED Github has better formatting, so the full output of explain is here - https://gist.github.com/jvanasco/bc1dd38ca06e52c9a090

Plugged in the output of your explain here : http://explain.depesz.com/s/hqPT
As you can see, the :
Hash Join (cost=96588.85..158413.71 rows=44473 width=3201) (actual time=22590.630..30761.213 rows=44292 loops=1)
Hash Cond: (message_2_attachment.attachment_id = attachment.id)
is taking a good amount of time. I'd try to add indexes to the foreign keys as well with :
CREATE INDEX idx_message_2_attachments_attachment_id ON "message_2_attachments" USING btree (attachment_id);
CREATE INDEX idx_message_2_attachments_message_id ON "message_2_attachments" USING btree (message_id);`
CREATE INDEX idx_mailings_message_id ON "mailings" USING btree (message_id);

The junction table is missing a primary key. Also it is advisable to add a reversed index on this PK:
CREATE TABLE message_2_attachments (
message_id INT NOT NULL REFERENCES message(id) ,
attachment_id INT NOT NULL REFERENCES attachments(id)
, PRIMARY KEY (message_id,attachment_id) -- <<== here
CREATE UNIQUE INDEX ON message_2_attachments(attachment_id,message_id); -- <<== here
For the mailings table, the situation is not so clear. It looks like some combination of {event_timestamp, recipient_id, message_id} could function as a candidate key. The id field merely functions as a surrogate.


Postgres: Optimization of query with simple where clause

I have a table with the following columns:
STATUS (VARCHAR) (4 different status possible)
other not relevant columns
I try to find all the lines with customer_id = and status = two different status.
The query looks like:
FROM my_table
WHERE customer_id = '12345678' AND status IN ('STATUS1', 'STATUS2');
The table contains about 1 mio lines. I added two indexes on customer_id and status. The query still needs about 1 second to run.
The explain plan is:
1. Gather
2. Seq Scan on my_table
Filter: (((status)::text = ANY ('{SUBMITTED,CANCELLED}'::text[])) AND ((customer_id)::text = '12345678'::text))
I ran the 'analyze my_table' after creating the indexes. What could I do to improve the performance of this quite simple query?
You need a compound (multi-column) index to help satisfy your query.
Guessing, it seems like the most selective column (the one with the most distinct values) is customer_id. status probably has only a few distinct values. So customer_id should go first in the index. Try this.
ALTER TABLE my_table ADD INDEX customer_id_status (customer_id, status);
This creates a BTREE index. A useful mental model for such an index is an old-fashioned telephone book. It's sorted in order. You look up the first matching entry in the index, then scan it sequentially for the items you want.
You may also want to try running ANALYZE my_table; to update the statistics (about selectivity) used by the query planner to choose an appropriate index.
Pro tip Avoid SELECT * if you can. Instead name the columns you want. This can help performance a lot.
Pro tip Your question said some of your columns aren't relevant to query optimization. That's probably not true; index design is a weird art. SELECT * makes it less true.

Best way to model state changes for point in time queries

I'm working on a system that needs to be able to find the "state" of an item at a particular time in history. The state is binary (either on or off). In this case it's to determine where to direct (to a particular "keyspace") a piece of timestamped data as determined by the timestamp of the data. I'm having a hard time deciding what the best way to model the data is.
Method 1 is to use the tstzrange with state being implied by the bounds of the range:
create extension btree_gist;
create table core.range_director (
range tstzrange,
directee_id text,
keyspace text,
-- allow a directee to be directed to multiple keyspaces at once
exclude using gist (directee_id with =, keyspace with =, range with &&)
insert into core.range_director values
('[2021-01-15 00:00:00 -0:00,2021-01-20 00:00:00 -0:00)', 'THING_ID', 'KEYSPACE_1'),
('[2021-01-15 00:00:00 -0:00,)', 'THING_ID', 'KEYSPACE_2');
select keyspace from core.range_director
where directee_id = 'THING_ID' and range_director.range #> '2021-01-15'::timestamptz;
-- returns KEYSPACE_1 and KEYSPACE_2
select keyspace from core.range_director
where directee_id = 'THING_ID' and range_director.range #> '2021-01-21'::timestamptz;
-- returns KEYSPACE_2
Method 2 is to have explicit state changes:
create table core.status_director (
status_time timestamptz,
status text,
directee_id text,
keyspace text
); -- not sure what pk to use for this method
insert into core.status_director values
('2021-01-15 00:00:00 -0:00','Open','THING_ID','KEYSPACE_1'),
('2021-01-20 00:00:00 -0:00','Closed','THING_ID','KEYSPACE_1'),
('2021-01-15 00:00:00 -0:00','Open','THING_ID','KEYSPACE_2');
select distinct on(keyspace) keyspace, status from core.status_director
where directee_id = 'THING_ID'
and status_time < '2021-01-16'
order by keyspace, status_time desc;
-- returns KEYSPACE_1:Open KEYSPACE_2:Open
select distinct on(keyspace) keyspace, status from core.status_director
where directee_id = 'THING_ID'
and status_time < '2021-01-21'
order by keyspace, status_time desc;
-- returns KEYSPACE_1:Closed, KEYSPACE_2:Open
-- so, client code has to ensure that it only directs to status=Open keyspaces
Maybe there are other methods that would work as well, but these two seem to make the most sense to me. The benefit of the first method is the really easy query, but the down side is that you now have to update rows to close the state whereas in the second method you can just post new states which seems easier.
The table could conceivable grow into thousands or tens of thousands of rows, but will probably not grow into millions (but does the best method change depending on the expected row count?). I have a couple of similar tables with the same point-in-time "state" queries so it's really important that I get the model for them right.
My instinct is to go with Method 1, but are there any footguns or performance considerations that I'm not thinking of that would urge the use case towards Method 2 (or another method I haven't considered?)
No footguns with Method 1, just great big huge cannons. With that method how do you determine the current status. You need to scan each status change and for each one toggle the status, or perhaps use something like "count(*)%2" odd gives one state even another. What happens if any row gets deleted, or data purged and you do not know how many state transactions there were. With the Method 2 you retrieve the greatest date and directly obtain the status.
For myself I would do Method 3. That being Method1 + Method 2. Yes I would have a date range of the status and the status value itself. That gives me complex historical analysis as I have the complete history as well as direct access to current status at any time.
So after doing a bunch of research on the topic I found that my case is a variation of a "Valid-Time State Table". See ch. 2 and ch. 5 of Developing Time-Oriented Database Applications in SQL by Richard Snodgrass.
The support for these tables isn't great but it's not terrible either (at least PostgreSQL has tstzranges to work with). Method 1 of my post is largely sufficient - the main wrinkle is between the state table and other tables.
Since PostgreSQL doesn't have native support for these kinds of temporal tables, you have to build referential integrity yourself. There's a bunch of ways to do this, but for anyone in the future looking for some direction, here is an example of what that might look like for a referential query on two bitemporal tables:
create table a (
row_id bigserial, -- to track individual rows
id int,
pov tstzrange, -- period of validity
pop tstzrange -- period of presence
create table b (
row_id bigserial,
id int,
pov tstzrange,
pop tstzrange,
a_id int
-- are we good?
with each_pov as (
select bool_or(a.pov #> b.pov) as ok
from a
join b on a.id = b.a_id
and upper(a.pop) is null
and upper(b.pop) is null
group by b.pov
) select coalesce(
(select count(*) = 0 from b where upper(pop) is null)
) from each_pov;
You can put the query into a constraint trigger on both the main table and the referenced table to get something approaching sequenced referential integrity for the current period of presence.

PostgreSQL - How to make a condition with records between the current record date and the same date plus 5 min?

I have something like this. With this part of code I detect if a vehicle stopped at least 5 minutes.
And works but, with a large amount of data, it starts to be slow.
I did a lot of tests and I'm sure that my problem is in the not exists block.
My table:
CREATE TABLE public.messages
id bigint PRIMARY KEY DEFAULT nextval('messages_id_seq'::regclass),
messagedate timestamp with time zone NOT NULL,
vehicleid integer NOT NULL,
driverid integer NOT NULL,
speedeffective double precision NOT NULL,
-- ... few nonsense properties
ALTER TABLE public.messages OWNER TO postgres;
CREATE INDEX idx_messages_1 ON public.messages
USING btree (vehicleid, messagedate);
And my query:
messages m
m.speedeffective > 0
and m.next_speedeffective = 0
and not exists( -- my problem
select id
from messages
vehicleid = m.vehicleid
and speedeffective > 5 -- I forgot this condition
and messagedate > m.messagedate
and messagedate <= m.messagedate + interval '5 minutes'
I can't figure out how to build the condition in a more performant way.
Edit DAY2:
I added a previous table like this to use in the second table:
WITH messagesx as (
speedeffective > 5
and now works better. I think that I'm missing a little detail.
Typically, a 'NOT EXISTS' will slow down your query as it requires a full scan of the table for each of the outer rows. Try to incorporate the same functionality within a join (I'm trying to rewrite the query here, without knowing the table, so I might make a mistake here):
messages m1
messages m2
ON m1.vehicleid = m2.vehicleid AND m1.messagedate < m2.messagedate AND m1.messagedate <= m2.messagedate+interval '5 minutes'
speedeffective > 0
and next_speedeffective = 0
and m2.vehicleid IS NULL
Take note that the NOT EXISTS is rewritten as the non-hit of the join condition.
Based on this answer: https://stackoverflow.com/a/36445233/5000827
and reading about NOT IN, NOT EXISTS and LEFT JOIN (where join is NULL)
For PostgreSQL, NOT EXISTS and LEFT JOIN are anti-join and works at the same way. (This is the reason why the #CountZukula answer's result is almost the same than mine)
The problem was on the kind of operation: Nest or Hash.
So, based on this: https://www.postgresql.org/docs/9.6/static/routine-vacuuming.html
PostgreSQL's VACUUM command has to process each table on a regular basis for several reasons:
To recover or reuse disk space occupied by updated or deleted rows.
To update data statistics used by the PostgreSQL query planner.
To update the visibility map, which speeds up index-only scans.
To protect against loss of very old data due to transaction ID wraparound or multixact ID wraparound.
I made a VACUUM ANALYZE to messages table and the same query works way fast.
So, with the VACUUM PostgreSQL can decide better.

Join time series Cassandra tables in Spark

I have two tables (agg_count_1 and agg_count_2) in Cassandra both with the same schema:
CREATE TABLE agg_count_1 (
pk_1 text,
pk_2 text,
pk_3 text,
window_start timestamp,
count counter,
PRIMARY KEY (( pk_1, pk_2, pk_3 ), window_start)
window_start is a timestamp rounded to nearest 15 minutes which means its value is exactly the same in both tables however rows for some time windows may be missing.
I would like to efficiently (inner) join these two tables on the primary key to a third table with very much the same schema and store value of agg_count_1.counter to counter_1 and agg_count_2.counter to counter_2 columns:
CREATE TABLE agg_joined (
pk_1 text,
pk_2 text,
pk_3 text,
window_start timestamp,
int counter_1,
int counter_2,
PRIMARY KEY (( pk_1, pk_2, pk_3 ), window_start)
This can be done in many ways using combination of Scala, Spark and Spark-Cassandra connector features. What is the recommended way?
I would appreciate to hear about solutions to avoid. Joins are in general expensive but I would expect this kind of "zipping" of time series should be fairly efficient if you (actually me) don't do anything wrong.
Based on Spark-Cassandra documentation using joinWithCassandraTable sounds suboptimal because it executes a single query for every partition:
joinWithCassandraTable utilizes the java drive to execute a single query for every partition required by the source RDD so no un-needed data will be requested or serialized.

SARGable way to find records near each other based on time window?

We have events insert into a table - a start event and an end event. Related events have the same internal_id number, and are inserted within a 90 second window. We frequently do a self-join on the table:
create table mytable (id bigint identity, internal_id bigint,
internal_date datetime, event_number int, field_a varchar(50))
select * from mytable a inner join mytable b on a.internal_id = b.internal_id
and a.event_number = 1 and b.event_number = 2
However, we can have millions of linked events each day. Our clustered key is the internal_date, so we can filter down to a partition level, but the performance can still be mediocre:
and a.internal_date >='20120807' and a.internal_date < '20120808'
and b.internal_date >='20120807' and b.internal_date < '20120808'
Is there a SARGable way to narrow it down further?
Adding this doesn't work - non-SARGable:
and a.internal_date <= b.internal_date +.001 --about 90 seconds
and a.internal_date > b.internal_date - .001 --make sure they're within the window
This isn't for a point query, so doing one-offs doesn't help - we're searching for thousands of records and need event details from the start event and the end event.
With this index your query will be much cheaper:
CREATE UNIQUE INDEX idx_iid on mytable(event_number, internal_id)
INCLUDE (id, internal_date, field_a);
The index allows you to seek on event_number rather than doing a clustered index scan, as well as enables you to do a merge join on internal_id rather than a hash join. The uniqueness constraint makes merge join even cheaper by eliminating possibility of many-to-many join.
See this for a more detailed explanation of merge join.