Count distinct users over n-days - postgresql

My table consists of two fields, CalDay a timestamp field with time set on 00:00:00 and UserID.
Together they form a compound key but it is important to have in mind that we have many rows for each given calendar day and there is no fixed number of rows for a given day.
Based on this dataset I would need to calculate how many distinct users there are over a set window of time, say 30d.
Using postgres 9.3 I cannot use COUNT(Distinct UserID) OVER ... nor I can work around the issue using DENSE_RANK() OVER (... RANGE BETWEEN) because RANGE only accepts UNBOUNDED.
So I went the old fashioned way and tried with a scalar subquery:
SELECT
xx.*
,(
SELECT COUNT(DISTINCT UserID)
FROM data_table AS yy
WHERE yy.CalDay BETWEEN xx.CalDay - interval '30 days' AND xx.u_ts
) as rolling_count
FROM data_table AS xx
ORDER BY yy.CalDay
In theory, this should work, right? I am not sure yet because I started the query about 20 mins ago and it is still running. Here lies the problem, the dataset is still relatively small (25000 rows) but will grow over time. I would need something that scales and performs better.
I was thinking that maybe - just maybe - using the unix epoch instead of the timestamp could help but it is only a wild guess. Any suggestion would be welcome.

This should work. Can't comment on speed, but should be a lot less than your current one. Hopefully you have indexes on both these fields.
SELECT t1.calday, COUNT(DISTINCT t1.userid) AS daily, COUNT(DISTINCT t2.userid) AS last_30_days
FROM data_table t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY t1.calday
UPDATE
Tested it with a lot of data. The above works but is slow. Much faster to do it like this:
SELECT t1.*, COUNT(DISTINCT t2.userid) AS last_30_days
FROM (
SELECT calday, COUNT(DISTINCT userid) AS daily
FROM data_table
GROUP BY calday
) t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY 1, 2
So instead of building up a massive table for all the JOIN combinations and then grouping/aggregating, it first gets the "daily" data, then joins the 30 day on that. Keeps the join much smaller and returns quickly (just under 1 second for 45000 rows in the source table on my system).

Related

On-demand Median Aggregation on a Large Dataset

TLDR: I need to make several median aggregations on a large dataset for a webapp, but the performance is poor. Can my query be improved/is there a better DB than AWS Redshift for this use-case?
I'm working on a team project which involves on-demand aggregations of a large dataset for visualization through our web-app. We're using Amazon Redshift loaded with almost 1,000,000,000 rows, dist-key by date (we have data from 2014 up to today's date, with 900,000 data points being ingested every day) and sort-key by a unique id. The unique id has a possibly one-to-many relationship with other unique ids, for which the 'many' relationship can be thought as the id's 'children'.
Due to confidentiality, think of the table structures like this
TABLE NAME: meal_nutrition
DISTKEY(date),
SORTKEY(patient_id),
patient_name varchar,
calories integer,
fat integer,
carbohydrates integer,
protein integer,
cholesterol integer,
sodium integer,
calories integer
TABLE NAME: patient_hierarchy
DISTKEY(date date),
SORTKEY(patient_id integer),
parent_id integer,
child_id integer,
distance integer
Think of this as a world for which there's a hierarchy of doctors. Patients are encapsulated as both actual patients and the doctors themselves, for which doctors can be the patient of other doctors. Doctors can transfer ownership of patients/doctors at any time, so the hierarchy is constantly changing.
DOCTOR (id: 1)
/ \
PATIENT(id: 2) DOCTOR (id: 3)
/ \ \
P (id: 4) D (id: 8) D(id: 20)
/ \ / \ / \ \
................
One visualization that we're having trouble with (due to performance) is a time-series graph showing the day-to-day median of several metrics for which the default date-range must be 1 year. So in this example, we want the median of fats, carbohydrates, and proteins of all meals consumed by a patient/doctor and their 'children', given a patient_id. The query used would be:
SELECT patient_name,
date,
max(median_fats),
max(median_carbs),
max(median_proteins)
FROM (SELECT mn.date date,
ph.patient_name patient_name,
MEDIAN(fats) over (PARTITION BY date) AS median_fats,
MEDIAN(carbohydrates) over (PARTITION BY date) AS median_carbs,
MEDIAN(proteins) over (PARTITION BY date) AS median_proteins
FROM meal_nutrition mn
JOIN patient_hierarchy ph
ON (mn.patient_id = ph.child_id)
WHERE ph.date = (SELECT max(date) FROM patient_hierarchy)
AND ph.parent_id = ?
AND date >= '2016-12-17' and date <= '2017-12-17'
)
GROUP BY date, patient_name
The heaviest operations in this query are the sorts for the each of the medians (each requiring to sort ~200,000,000 rows), but we cannot avoid this. As a result, this query takes ~30s to complete, which translates to bad UX. Can the query I'm making be improved? Is there a better DB for this kind of use-case? Thanks!
As said in comments, sorting/distribution of your data is very important. If you get just one date slice of patient hierarchy all data you're using is on one node with distribution by date. It's better to distribute by meal_nutrition.patient_id and patient_hierarchy.child_id so data that is joined likely sits on the same node, and sort tables by date,patient_id and date,child_id respectively, so you can find the necessary date slices/ranges efficiently and then look up for patients efficiently.
As for the query itself, there are some options that you can try:
1) Approximate median like this:
SELECT mn.date date,
ph.patient_name patient_name,
APPROXIMATE PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY fats) AS median_fats
FROM meal_nutrition mn
JOIN patient_hierarchy ph
ON (mn.patient_id = ph.child_id)
WHERE ph.date = (SELECT max(date) FROM patient_hierarchy)
AND ph.parent_id = ?
AND date >= '2016-12-17' and date <= '2017-12-17'
GROUP BY 1,2
Notes: this might not work if the memory stack is exceeded. Also, you have to have only one such function per subquery so you can't get fats, carbs and proteins in the same subquery but you can calculate them separately and then join. if this works you can then test the accuracy by running your 30s statement for a few IDs and comparing results.
2) Binning. First group by each value, or set reasonable bins, then find the group/bin that is in the middle of the distribution. That will be your median. One variable example would be:
WITH
groups as (
SELECT mn.date date,
ph.patient_name patient_name,
fats,
count(1)
FROM meal_nutrition mn
JOIN patient_hierarchy ph
ON (mn.patient_id = ph.child_id)
WHERE ph.date = (SELECT max(date) FROM patient_hierarchy)
AND ph.parent_id = ?
AND date >= '2016-12-17' and date <= '2017-12-17'
GROUP BY 1,2,3
)
,running_groups as (
SELECT *
,sum(count) over (partition by date, patient_name order by fats rows between unlimited preceding and current row) as running_total
,sum(count) (partition by date, patient_name) as total
FROM groups
)
,distance_from_median as (
SELECT *
,row_number() over (partition by date, patient_name order by abs(0.5-(1.0*running_total/total))) as distance_from_median
FROM running_groups
)
SELECT
date,
patient_name,
fats
WHERE distance_from_median=1
That would likely allow grouping values on each individual node and subsequent operations with bins will be more light weight and avoid sorting the raw sets. Again, you have to benchmark. The less unique values you have the higher your performance gain will be because you'll have a small number of bins out of a big number of raw values and sorting will be much cheaper. The result is accurate except the option with even number of distinct values (for 1,2,3,4 it would return 2, not 2.5) but this is solvable by adding another layer if it's critical. The main question is if the approach itself improves performance significantly.
3) Materialize calculation for every date/patient id. If your only parameter is patient and you always calculate medians for the last year you can run the query overnight into a summary table and query that one. It's better even if (1) or (2) helps to optimize performance. You can also copy the summary table to a Postgres instance after materializing and use it as the backend for your app, you'll have better ping (Redshift is good for materializing large amounts of data but not good as web app backend). It comes with the cost of maintaining data transfer job, so if materializing/optimization made a good enough job you can leave it in Redshift.
I'm really interested in getting feedback if you try any of suggested options, this is a good use case for Redshift.

How to get count of timestamps which has interval bigger than xx seconds between next row in PostgresSQL

I have table with 3 columns (postgres 9.6) : serial , timestamp , clock_name
Usually there is 1 second different between each row but sometimes the interval is bigger.
I'm trying to get the number of occasions that the timestamp interval between 2 rows was bigger than 10 seconds (lets say I limit this to 1000 rows)
I would like to do this in one query (probably select from select) but I have no idea how to write such a query , my sql knowladge is very basic.
Any help will be appreciated
You can use window functions to retrieve the next record record given the current record.
Using the ORDER BY on the function to ensure things are in time stamp order and using PARTITION to keep the clocks separate you can find for each row the row that follows it.
WITH links AS
(
SELECT
id, ts, clock, LEAD(ts) OVER (PARTITION BY clock ORDER BY ts) AS next_ts
FROM myTable
)
SELECT * FROM links
WHERE
EXTRACT(EPOCH FROM (next_ts - ts)) > 10
You can then just compare the time stamps.
Window functions https://www.postgresql.org/docs/current/static/functions-window.html
Or if you prefer to use derived tables instead of WITH clause.
SELECT * FROM (
SELECT
id, ts, clock, LEAD(ts) OVER (PARTITION BY clock ORDER BY ts) AS next_ts
FROM myTable
) links
WHERE
EXTRACT(EPOCH FROM (next_ts - ts)) > 10

PostgreSQL row diff timestamp, and calculate stddev for group

I have a table with an ID column called mmsi and another column of timestamp, with multiple timestamps per mmsi.
For each mmsi I want to calculate the standard deviation of the difference between consecutive timestamps.
I'm not very experienced with SQL but have tried to construct a function as follows:
SELECT
mmsi, stddev(time_diff)
FROM
(SELECT mmsi,
EXTRACT(EPOCH FROM (timestamp - lag(timestamp) OVER (ORDER BY mmsi ASC, timestamp ASC)))
FROM ais_messages.ais_static
ORDER BY mmsi ASC, timestamp ASC) AS time_diff
WHERE time_diff IS NOT NULL
GROUP BY mmsi;
Your query looks on the right track, but it has several problems. You labelled your subquery, which looks almost right, with an alias which you then select. But this subquery returns multiple rows and columns so this doesn't make any sense. Here is a corrected version:
SELECT
t.mmsi,
STDDEV(t.time_diff) AS std
FROM
(
SELECT
mmsi,
EXTRACT(EPOCH FROM (timestamp - LAG(timestamp) OVER
(PARTITION BY mmsi ORDER BY timestamp))) AS time_diff
FROM ais_messages.ais_static
ORDER BY mmsi, timestamp
) t
WHERE t.time_diff IS NOT NULL
GROUP BY t.mmsi
This approach should be fine but there is one edge case where it might not behave as expected. If a given mmsi group have only one record, then it would not even appear in the result set of standard deviations. This is because the LAG calculation would return NULL for that single record and it would be filtered off.

Use generate_series to create a table

In Amazon Redshift, generate_series() seems to be supported on the leader node, but not on the compute nodes. Is there a way to use generate_series to create a table on the leader node, and then push it to the compute nodes?
This query runs fine, running on the leader node:
with
date_table as (select now()::date - generate_series(0, 7 * 10) as date),
hour_table as (select generate_series(0, 24) as hour),
time_table as (
select
date_table.date::date as date,
extract(year from date_table.date) as year,
extract(month from date_table.date) as month,
extract(day from date_table.date) as day,
hour_table.hour
from date_table CROSS JOIN hour_table
)
SELECT *
from time_table
However, this query fails:
create table test
diststyle all
as (
with
date_table as (select now()::date - generate_series(0, 7 * 10) as date),
hour_table as (select generate_series(0, 24) as hour),
time_table as (
select
date_table.date::date as date,
extract(year from date_table.date) as year,
extract(month from date_table.date) as month,
extract(day from date_table.date) as day,
hour_table.hour
from date_table CROSS JOIN hour_table
)
SELECT *
from time_table
);
The only solution I can think of right now is to pull the query results into another program (e.g. python) and then insert the result into the database, but that seems hackish.
For those of you who've never used redshift, it's a heavily modified variant of postgresql, and has lots of it's own idiosyncrasies. The below query is completely valid an runs fine:
create table test diststyle all as (select 1 as a, 2 as b);
select * from test
yields:
a b
1 2
The problem stems from the difference between leadernode only function and compute node functions on redshift. I'm pretty sure it's not due to a bug in my query.
I have not found a way to use leader-node only functions to create tables. There is not (AFAICT) any magic syntax that you can use to make them load their output back to a table.
I ended up using number tables to achieve a similar outcome. Even a huge number table will take up very little space on your Redshift cluster with runlength compression.

PostgreSQL - get records with null values

I'm trying to get a query which would show distributors that haven't sell anything in 90 days, but the problem I get is with NULL values. It seems PostgreSQL ignores null values, even when I queried to show it (or maybe I did it in wrong way).
Let say there are 1000 distributors, but with this query I only get 1 distributor, but there should be more distributors that didn't sell anything, because if I write SQL query to show distributors that sold by any amount in the last 90 days, it shows about 500. So I wonder where are those other 499? If I understand correctly, those other 499, didn't have any sales, so all records are null and are not showed in query.
Does anyone know how to make it show null values of one table where in relation other table is not null? (like partners table (res_partner) is not null, but sale_order table (sales) or object is null? (I also tried to filter like so.id IS NULL, but in such way I get empty query)
Code of my query:
(
SELECT
min(f1.id) as id,
f1.partner as partner,
f1.sum1
FROM
(
SELECT
min(f2.id) as id,
f2.partner as partner,
sum(f2.null_sum) as sum1
FROM
(
SELECT
min(rp.id) as id,
rp.search_name as partner,
CASE
WHEN
sol.price_subtotal IS NULL
THEN
0
ELSE
sol.price_subtotal
END as null_sum
FROM
sale_order as so,
sale_order_line as sol,
res_partner as rp
WHERE
sol.order_id=so.id and
so.partner_id=rp.id
and
rp.distributor=TRUE
and
so.date_order <= now()::timestamp::date
and
so.date_order >= date_trunc('day', now() - '90 day'::interval)::timestamp::date
and
rp.contract_date <= date_trunc('day', now() - '90 day'::interval)::timestamp::date
GROUP BY
partner,
null_sum
)as f2
GROUP BY
partner
) as f1
WHERE
sum1=0
GROUP BY
partner,
sum1
)as fld
EDIT: 2012-09-18 11 AM.
I think I understand why Postgresql behaves like this. It is because of the time interval. It checks if there is any not null value in that inverval. So it only found one record, because that record had sale order with zero (it was not converted from null to zero) and part which checked for null values was just skipped. If I delete time interval, then I would see all distributors that didn't sell anything at all. But with time interval for some reason it stops checking null values and looks if there are only not null values.
So does anyone know how to make it check for null values too in given interval?.. (for the last 90 days to be exact)
Aggregates like sum() and and min() do ignore NULL values. This is required by the SQL standard and every DBMS I know behaves like that.
If you want to treat a NULL value as e.g. a zero, then use something like this:
sum(coalesce(f2.null_sum, 0)) as sum1
But as far as I understand you question and your invalid query you actually want an outer join between res_partner and the sales tables.
Something like this:
SELECT min(rp.id) as id,
rp.search_name as partner,
sum(coalesce(sol.price_subtotal,0)) as price_subtotal
FROM res_partner as rp
LEFT JOIN sale_order as so ON so.partner_id=rp.id and rp.distributor=TRUE
LEFT JOIN sale_order_line as sol ON sol.order_id=so.id
WHERE so.date_order <= CURRENT_DATE
and so.date_order >= date_trunc('day', now() - '90 day'::interval)::timestamp::date
and rp.contract_date <= date_trunc('day', now() - '90 day'::interval)::timestamp::date
GROUP BY rp.search_name
I'm not 100% sure I understood your problem correctly, but it might give you a headstart.
Try to name subqueries, and retrieve their columns with col.q1, col.q2 etc. to make sure which column from which query/subquery you're dealing with. Maybe it's somewhat simple, e.g. it unites some rows containing only NULLs into one row? Also, at least for debugging purposes, it's smart to add , count(*) at the end of each query/subquery to get implicit number of rows returned on result.. hard to guess what exactly happened..