Get max timestamps efficiently for large table for a set of ids - postgresql

I have a large PostgreSQL db table (Actually lots of partition tables divided up by yearly quarters) that for simplicity sake is defined something like
id bigint
ts (timestamp)
value (float)
For a particular set of ids what is an efficient way of finding the last timestamp in the table for each specified id ?
The table is indexed by (id, timestamp)
If I do something naive like
SELECT sensor_id, MAX(ts)
FROM sensor_values
WHERE ts >= (NOW() + INTERVAL '-100 days') :: TIMESTAMPTZ
GROUP BY 1;
Things are pretty slow.
Is there a way of perhaps narrowing down the times first by a binary search of one id
(I can assume the timestamps are similar for a particular set of ids)
I am accessing the db through psycopg so the solution can be in code or SQL if I am missing something easy to speed this up.
The explain for the query can be seen here. https://explain.depesz.com/s/PVqg
Any ideas appreciated.

Related

Index for self-join on timestamp range and user_id

I have a table in a postgresql (10.2) database something like this...
create table (user_id text, event_time timestamp, ...);
I'd like to use this table in a self join, to match records to other records from the same user_id and an event_time within the next 5 minutes. Something like this...
select
*
from
test as a
inner join
test as b
on
a.user_id = b.user_id
and a.event_time < b.event_time
and a.event_time > b.event_time - interval '5 minutes';
This works fine, but I'd ideally like to make it faster. I've gotten the join to use an index on user_id, but I'm wondering if it's possible to make an index to use both user_id AND the timestamp?
I've tried making a gist index on a tsrange from the event time to the event time plus 5 minutes, but Postgres seemed to just use the user_id index in that case. I tried making a multi-column index on the user_id and the tsrange, but that doesn't seem supported.
Finally, I tried making an index on just the timestamp.
None of that seemed to help.
However, the timestamps cover a long time period, and I'm only interested in a 5-minute window, which intuitively feels like something a good index should help with.
Can this be accomplished?
A multi-column index on the user_id text and the event_time timestamp should work. A gist index on the range would need to include the user id as well, and it would be less versatile since it would work only with the fixed interval of 5 minutes. I wouldn't use it unless what you actually want is to establish an exclusion constraint on the table.

Table specifically built for a dashboard has several filters.... best way to index?

I have created a materialized view for the purposes of feeding into a dashboard.
My goal is to make this table selectable in the fastest way possible and I'm not sure how to approach it. I was hoping that if I describe the table and how it will be used, someone could offer some direction.
The context is a website with funnel steps.Each row is an instance of a user triggering a funnel step such as add to cart, checkout, payment details and then finally transaction.
Since the table is for the purposes of analytics, it will be refreshed automatically with cron once a day only, in the morning, so I'm not worried about real time update speed, only select speed with various where clauses.
Suppose I have the fields described below:
(N = ~13M and expected to be ~20 by January, growing several million per month)
Table is unique with the combination of session id, user id and funnel step.
- Session Id (Id, so some duplication but generally very very granular - Varchar)
- User Id (Id, so some duplication but generally very very granular - Varchar)
- Date (Date)
- Funnel Step (10 distinct value - Varchar)
- Device Category (3 distinct values - Varchar)
- Country (~ 100 distinct values - varchar)
- City (~1000+ distinct values - varchar)
- Source (several thousand distinct values, nevertheless, stakeholder would like a filter - varchar)
Would I index each field individually? Or, should I index all fields in a oner? Per the documentation, I think I can index up to 32 fields at once. But would that be advisable here given my primary goal of select query speed over everything else?
The table will feed into dashboard that reads the table and dynamically translates filter inputs into where clauses. Each time the user adjusts a filter, the table will be read and grouped and aggregated based on the filter / where clause inputs.
Example query:
select
event_action,
count(distinct user_id) as users
from website_data.ecom_funnel
where date >= $input_start_date
and date <= $input_end_date
and device_category in ($mobile, $desktop, $tablet)
and country in ($list of all countries minus any not selected)
and source in ($list of all sources minus any not selected)
group by 1 order by users desc
This will result in a funnel shaped table of data.
I cannot aggregate before hand because the primary metric of concern is users, not sessions. These must be de-duped from the underlying table. Classic example... Suppose a person visits a website once a day for a week. Then the sum of unique visitors for that week is 1, however if I summed visitors by day I would get 7. Similar with my table, some users take multiple sessions to complete the funnel. So, this is why I cannot pre aggregate the table, since I need to apply filters to the underlying data and then count(distinct user id).
Here's explain on a subset of fields if it is useful:
QUERY PLAN
Sort (cost=862194.66..862194.68 rows=9 width=24)
Sort Key: (count(DISTINCT client_id)) DESC
-> GroupAggregate (cost=847955.01..862194.51 rows=9 width=24)
Group Key: event_action
-> Sort (cost=847955.01..852701.48 rows=1898589 width=37)
Sort Key: event_action
-> Seq Scan on ecom_funnel (cost=0.00..589150.14 rows=1898589 width=37)
Filter: ((device_category = ANY ('{mobile,desktop}'::text[])) AND (source = 'google'::text))
My overarching, specific question is, given my use case, should I index each field individually or should I create one single index? Does it matter?
On top of that, any tips for optimising this materialized view to run a select query faster would be appreciated.
Looking at your filter conditions, you should check the cardinality of device_category field by posting
select device_category, count(*) from website_data.ecom_funnel group by device_category
and looking at the values to determine if an index should firstly include this column. Possible index here (without knowing the cardinality) would be multicolumn and include:
(device_category, date)
Saying that, there's no benefit from creating indexes on each separate column as your query wouldn't use them all, so it does matter. You would slow down other CRUD operations that aren't Read operation.
Creating an index on all columns won't probably speed it up too much for you as well, but that's based on the data lying under the hood (in the table) and how your filters compare to the overall query without them (cardinality of values in columns being filtered). This would most likely create a huge overhead of going through the index tree and then obtaining rowids to return the data you need.
Summing up, I would try to narrow the index down to the columns that matter most in your filtering which means they cut most of the data being retrieved. If your query is meant to return majority of rows from the table then there's a need to aggregate, unfortunately, as this wouldn't speed things up.
Hope it helps.
Edit: I've just read that you already posted count of distinct values among your table. I'm not sure what Funnel Step is bound to in your table, but assuming it's a column named event_action, it might be beneficial to instead create an index that would help in grouping as well by doing:
(date, event_action)
It seems like you have omitted the GROUP BY clause at all, which should be included and it should be grouping by event_action, since that's what your select part is doing.
If you narrow the date down to several days/months every time you perform a select query, it might be a huge benefit to create index with first date column.
Remember, that position of column in an index matters.
If you look for values from several months let's say, you should preaggregate and store precalculated values from each month in another table and then UNION ALL that data to the current query which would only select data from current (still being updated) time.

Looping through unique dates in PostgreSQL

In Python (pandas) I read from my database and then I use a pivot table to aggregate data each day. The raw data I am working on is about 2 million rows per day and it is per person and per 30 minutes. I am aggregating it to be daily instead so it is a lot smaller for visualization.
So in pandas, I would read each date into memory and aggregate it and then load it into a fresh table in postgres.
How can I do this directly in postgres? Can I loop through each unique report_date in my table, groupby, and then append it to another table? I am assuming doing it in postgres would be fast compared to reading it over a network in python, writing a temporary .csv file, and then writing it again over the network.
Here's an example: Suppose that you have a table
CREATE TABLE post (
posted_at timestamptz not null,
user_id integer not null,
score integer not null
);
representing the score various user have earned from posts they made in SO like forum. Then the following query
SELECT user_id, posted_at::date AS day, sum(score) AS score
FROM post
GROUP BY user_id, posted_at::date;
will aggregate the scores per user per day.
Note that this will consider that the day changes at 00:00 UTC (like SO does). If you want a different time, say midnight Paris time, then you can do it like so:
SELECT user_id, (posted_at AT TIME ZONE 'Europe/Paris')::date AS day, sum(score) AS score
FROM post
GROUP BY user_id, (posted_at AT TIME ZONE 'Europe/Paris')::date;
To have good performace for the above queries, you might want to create a (computed) index on (user_id, posted_at::date), or similarly for the second case.

use of redshift keys to make query efficient

I have a redshift table with hundreds of millions of rows. My typical query looks like this...
select * from table where senddate > '2015-01-01 00:00:00' and senddate < '2015-08-01 00:00:00' and username = 'xyz'
I am not sure how sort and distribution keys work. I will like to know what should be the best option to make the query efficient.
I have around 3,000 unique usernames and senddate is a date within last 5 years.
I have one more question:
I am not using any compression for this table. Does that make the query slow?
Never use select * in a columnar DB, only pull the columns which are needed.
If this is the only query you want to run, distribution keys dont matter. You can do a diststyle ALL but it will take n times the storage where n is the number of nodes. That said, if you are going to join tables, distribute them on the joining keys
You can have a sortkey on senddate, username to avoid reading all the records (similar to a table scan in row-stores)
Read through to have a basic understanding of these points
http://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-performance.html

PostgreSQL does not order timestamp column correctly

I have a table in a PostgreSQL database with a column of TIMESTAMP WITHOUT TIME ZONE type. I need to order the records by this column and apparently PostgreSQL has some trouble doing it as both
...ORDER BY time_column
and
...ORDER BY time_column DESC
give me the same order of elements for my 3-element sample of records having the same time_column value, except the amount of milliseconds in it.
It seems that while sorting, it does not consider milliseconds in the value.
I am sure the milliseconds are in fact stored in the database because when I fetch the records, I can see them in my DateTime field.
When I first load all the records and then order them by the time_column in memory, the result is correct.
Am I missing some option to make the ordering behave correctly?
EDIT: I was apparently missing a lot. The problem was not in PostgreSQL, but in NHibernate stripping the milliseconds off the DateTime property.
It's a foolish notion that PostgreSQL wouldn't be able to sort timestamps correctly.
Run a quick test and rest asured:
CREATE TEMP TABLE t (x timestamp without time zone);
INSERT INTO t VALUES
('2012-03-01 23:34:19.879707')
,('2012-03-01 23:34:19.01386')
,('2012-03-01 23:34:19.738593');
SELECT x FROM t ORDER by x DESC;
SELECT x FROM t ORDER by x;
q.e.d.
Then try to find out, what's really happening in your query. If you can't, post a testcase and you will be helped presto pronto.
try cast your column to ::timestamp like that:
SELECT * FROM TABLE
ORDER BY time_column::timestamp