On-demand Median Aggregation on a Large Dataset - postgresql

TLDR: I need to make several median aggregations on a large dataset for a webapp, but the performance is poor. Can my query be improved/is there a better DB than AWS Redshift for this use-case?
I'm working on a team project which involves on-demand aggregations of a large dataset for visualization through our web-app. We're using Amazon Redshift loaded with almost 1,000,000,000 rows, dist-key by date (we have data from 2014 up to today's date, with 900,000 data points being ingested every day) and sort-key by a unique id. The unique id has a possibly one-to-many relationship with other unique ids, for which the 'many' relationship can be thought as the id's 'children'.
Due to confidentiality, think of the table structures like this
TABLE NAME: meal_nutrition
DISTKEY(date),
SORTKEY(patient_id),
patient_name varchar,
calories integer,
fat integer,
carbohydrates integer,
protein integer,
cholesterol integer,
sodium integer,
calories integer
TABLE NAME: patient_hierarchy
DISTKEY(date date),
SORTKEY(patient_id integer),
parent_id integer,
child_id integer,
distance integer
Think of this as a world for which there's a hierarchy of doctors. Patients are encapsulated as both actual patients and the doctors themselves, for which doctors can be the patient of other doctors. Doctors can transfer ownership of patients/doctors at any time, so the hierarchy is constantly changing.
DOCTOR (id: 1)
/ \
PATIENT(id: 2) DOCTOR (id: 3)
/ \ \
P (id: 4) D (id: 8) D(id: 20)
/ \ / \ / \ \
................
One visualization that we're having trouble with (due to performance) is a time-series graph showing the day-to-day median of several metrics for which the default date-range must be 1 year. So in this example, we want the median of fats, carbohydrates, and proteins of all meals consumed by a patient/doctor and their 'children', given a patient_id. The query used would be:
SELECT patient_name,
date,
max(median_fats),
max(median_carbs),
max(median_proteins)
FROM (SELECT mn.date date,
ph.patient_name patient_name,
MEDIAN(fats) over (PARTITION BY date) AS median_fats,
MEDIAN(carbohydrates) over (PARTITION BY date) AS median_carbs,
MEDIAN(proteins) over (PARTITION BY date) AS median_proteins
FROM meal_nutrition mn
JOIN patient_hierarchy ph
ON (mn.patient_id = ph.child_id)
WHERE ph.date = (SELECT max(date) FROM patient_hierarchy)
AND ph.parent_id = ?
AND date >= '2016-12-17' and date <= '2017-12-17'
)
GROUP BY date, patient_name
The heaviest operations in this query are the sorts for the each of the medians (each requiring to sort ~200,000,000 rows), but we cannot avoid this. As a result, this query takes ~30s to complete, which translates to bad UX. Can the query I'm making be improved? Is there a better DB for this kind of use-case? Thanks!

As said in comments, sorting/distribution of your data is very important. If you get just one date slice of patient hierarchy all data you're using is on one node with distribution by date. It's better to distribute by meal_nutrition.patient_id and patient_hierarchy.child_id so data that is joined likely sits on the same node, and sort tables by date,patient_id and date,child_id respectively, so you can find the necessary date slices/ranges efficiently and then look up for patients efficiently.
As for the query itself, there are some options that you can try:
1) Approximate median like this:
SELECT mn.date date,
ph.patient_name patient_name,
APPROXIMATE PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY fats) AS median_fats
FROM meal_nutrition mn
JOIN patient_hierarchy ph
ON (mn.patient_id = ph.child_id)
WHERE ph.date = (SELECT max(date) FROM patient_hierarchy)
AND ph.parent_id = ?
AND date >= '2016-12-17' and date <= '2017-12-17'
GROUP BY 1,2
Notes: this might not work if the memory stack is exceeded. Also, you have to have only one such function per subquery so you can't get fats, carbs and proteins in the same subquery but you can calculate them separately and then join. if this works you can then test the accuracy by running your 30s statement for a few IDs and comparing results.
2) Binning. First group by each value, or set reasonable bins, then find the group/bin that is in the middle of the distribution. That will be your median. One variable example would be:
WITH
groups as (
SELECT mn.date date,
ph.patient_name patient_name,
fats,
count(1)
FROM meal_nutrition mn
JOIN patient_hierarchy ph
ON (mn.patient_id = ph.child_id)
WHERE ph.date = (SELECT max(date) FROM patient_hierarchy)
AND ph.parent_id = ?
AND date >= '2016-12-17' and date <= '2017-12-17'
GROUP BY 1,2,3
)
,running_groups as (
SELECT *
,sum(count) over (partition by date, patient_name order by fats rows between unlimited preceding and current row) as running_total
,sum(count) (partition by date, patient_name) as total
FROM groups
)
,distance_from_median as (
SELECT *
,row_number() over (partition by date, patient_name order by abs(0.5-(1.0*running_total/total))) as distance_from_median
FROM running_groups
)
SELECT
date,
patient_name,
fats
WHERE distance_from_median=1
That would likely allow grouping values on each individual node and subsequent operations with bins will be more light weight and avoid sorting the raw sets. Again, you have to benchmark. The less unique values you have the higher your performance gain will be because you'll have a small number of bins out of a big number of raw values and sorting will be much cheaper. The result is accurate except the option with even number of distinct values (for 1,2,3,4 it would return 2, not 2.5) but this is solvable by adding another layer if it's critical. The main question is if the approach itself improves performance significantly.
3) Materialize calculation for every date/patient id. If your only parameter is patient and you always calculate medians for the last year you can run the query overnight into a summary table and query that one. It's better even if (1) or (2) helps to optimize performance. You can also copy the summary table to a Postgres instance after materializing and use it as the backend for your app, you'll have better ping (Redshift is good for materializing large amounts of data but not good as web app backend). It comes with the cost of maintaining data transfer job, so if materializing/optimization made a good enough job you can leave it in Redshift.
I'm really interested in getting feedback if you try any of suggested options, this is a good use case for Redshift.

Related

clickhouse downsample into OHLC time bar intervals

For a table e.g. containing a date, price timeseries with prices every e.g. millisecond, how can this be downsampled into groups of open high low close (ohlc) rows with time interval e.g. minute?
While option with arrays will work, the simplest option here is to use use combination of group by timeintervals with min, max, argMin, argMax aggregate functions.
SELECT
id,
minute,
max(value) AS high,
min(value) AS low,
avg(value) AS avg,
argMin(value, timestamp) AS first,
argMax(value, timestamp) AS last
FROM security
GROUP BY id, toStartOfMinute(timestamp) AS minute
ORDER BY minute
In ClickHouse you solve this kind of problem with arrays. Let's assume a table like the following:
CREATE TABLE security (
timestamp DateTime,
id UInt32,
value Float32
)
ENGINE=MergeTree
PARTITION BY toYYYYMM(timestamp)
ORDER BY (id, timestamp)
You can downsample to one-minute intervals with a query like the following:
SELECT
id, minute, max(value) AS high, min(value) AS low, avg(value) AS avg,
arrayElement(arraySort((x,y)->y,
groupArray(value), groupArray(timestamp)), 1) AS first,
arrayElement(arraySort((x,y)->y,
groupArray(value), groupArray(timestamp)), -1) AS last
FROM security
GROUP BY id, toStartOfMinute(timestamp) AS minute
ORDER BY minute
The trick is to use array functions. Here's how to decode the calls:
groupArray gathers column data within the group into an array.
arraySort sorts the values using the timestamp order. We use a lambda function to provide the timestamp array as the sorting key for the first array of values.
arrayElement allows us to pick the first and last elements respectively.
To keep the example simple I used DateTime for the timestamp which only samples at 1 second intervals. You can use a UInt64 column to get any precision you want. I added an average to my query to help check results.

Count distinct users over n-days

My table consists of two fields, CalDay a timestamp field with time set on 00:00:00 and UserID.
Together they form a compound key but it is important to have in mind that we have many rows for each given calendar day and there is no fixed number of rows for a given day.
Based on this dataset I would need to calculate how many distinct users there are over a set window of time, say 30d.
Using postgres 9.3 I cannot use COUNT(Distinct UserID) OVER ... nor I can work around the issue using DENSE_RANK() OVER (... RANGE BETWEEN) because RANGE only accepts UNBOUNDED.
So I went the old fashioned way and tried with a scalar subquery:
SELECT
xx.*
,(
SELECT COUNT(DISTINCT UserID)
FROM data_table AS yy
WHERE yy.CalDay BETWEEN xx.CalDay - interval '30 days' AND xx.u_ts
) as rolling_count
FROM data_table AS xx
ORDER BY yy.CalDay
In theory, this should work, right? I am not sure yet because I started the query about 20 mins ago and it is still running. Here lies the problem, the dataset is still relatively small (25000 rows) but will grow over time. I would need something that scales and performs better.
I was thinking that maybe - just maybe - using the unix epoch instead of the timestamp could help but it is only a wild guess. Any suggestion would be welcome.
This should work. Can't comment on speed, but should be a lot less than your current one. Hopefully you have indexes on both these fields.
SELECT t1.calday, COUNT(DISTINCT t1.userid) AS daily, COUNT(DISTINCT t2.userid) AS last_30_days
FROM data_table t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY t1.calday
UPDATE
Tested it with a lot of data. The above works but is slow. Much faster to do it like this:
SELECT t1.*, COUNT(DISTINCT t2.userid) AS last_30_days
FROM (
SELECT calday, COUNT(DISTINCT userid) AS daily
FROM data_table
GROUP BY calday
) t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY 1, 2
So instead of building up a massive table for all the JOIN combinations and then grouping/aggregating, it first gets the "daily" data, then joins the 30 day on that. Keeps the join much smaller and returns quickly (just under 1 second for 45000 rows in the source table on my system).

Handling of multiple queries as one result

Lets say I have this table
CREATE TABLE device_data_by_year (
year int,
device_id uuid,
sensor_id uuid,
nano_since_epoch bigint,
unit text,
value double,
source text,
username text,
PRIMARY KEY (year, device_id, nano_since_epoch,sensor_id)
) WITH CLUSTERING ORDER BY (device_id desc, nano_since_epoch desc);
I need to query data for a particular device and sensor in a period between 2017 and 2018. In this case 2 queries will be issued:
select * from device_data_by_year where year = 2018 AND device_id = ? AND sensor_id = ? AND nano_since_epoch >= ? AND nano_since_epoch <= ?
select * from device_data_by_year where year = 2018 AND device_id = ? AND sensor_id = ? AND nano_since_epoch >= ? AND nano_since_epoch <= ?
Currently I iterate over the resultsets and build a List with all the results. I am aware that this could (and will) run into OOM problems some day. Is there a better approach, how to handle / merge query results into one set?
Thanks
You can use IN to specify a list of years, but this is not very optimal solution - because the year field is partition key, then most probably the data will be on different machines, so one of the node will work as "coordinator", and will need to ask another machine for results, and aggregate data. From performance point of view, 2 async requests issued in parallel could be faster, and then do the merge on client side.
P.S. your data model have quite serious problems - you partition by year, this means:
Data isn't very good distributed across the cluster - only N=RF machines will hold the data;
These partitions will be very huge, even if you get only hundred of devices, reporting one measurement per minute;
Only one partition will be "hot" - it will receive all data during the year, and other partitions won't be used very often.
You can use months, or even days as partition key to decrease the size of partition, but it still won't solve the problem of the "hot" partitions.
If I remember correctly, Data Modelling course at DataStax Academy has an example of data model for sensor network.
Changed the table structure to:
CREATE TABLE device_data (
week_first_day timestamp,
device_id uuid,
sensor_id uuid,
nano_since_epoch bigint,
unit text,
value double,
source text,
username text,
PRIMARY KEY ((week_first_day, device_id), nano_since_epoch, sensor_id)
) WITH CLUSTERING ORDER BY (nano_since_epoch desc, sensor_id desc);
according to #AlexOtt proposal. Some changes to the application logic are required - for example findAllByYear needs to iterate over single weeks now.
Coming back to the original question: would you rather send 52 queries (getDataByYear, one query per week) oder would you use the IN operator here?

number of points within a radius of another set of points

I have two tables. One is a list of stores (with lat/long). The other is a list of customer addresses (with lat/long). What I want is a query that will return the number of customers within a certain radius for each store in my table. This gives me the total number of customers within 10,000 meters of ANY store, but I'm not sure how to loop it to return one row for each store with a count.
Note that I'm doing this queries using cartoDB, where the_geom is basically long/lat.
SELECT COUNT(*) as customer_count FROM customer_table
WHERE EXISTS(
SELECT 1 FROM store_table
WHERE ST_Distance_Sphere(store_table.the_geom, customer_table.the_geom) < 10000
)
This results in a single row :
customer_count
4009
Suggestions on how to make this work against my problem? I'm open to doing this other ways that might be more efficient (faster).
For reference, the column with store names, which would be in one column is store_identifier.store_table
I'll assume that you use the_geom to represent the coordinate (lat/lon) of store and customer. I will also assume that the_geom is of geography type. Your query will be something like this
select s.id, count(*) as customer_count
from customers c
inner join stores s
on st_dwithin(c.the_geom, s.the_geom, 10000)
group by s.id
This should give you neat table with a store id and count of customers within 10,000 meters from the store.
If the_geom is of type geometry, you query will be very similar but you should use st_distance_sphere() instead in order to express distance in kilometers (not degrees).

Calculate Mode - "Highest frequency row" DB2

What would be the most efficient way to calculating the mode across tables with joins in DB2..
I am trying to get the value with the most frequency(count) for a given column(ID - candidate key for joined table) on a given date.
The idea is to get the most common (value) from the table which has different (value)s for some accounts (for the same ID and date). We need to make it unique for use in another table.
You can use common table expressions [CTE's], indicated by WITH, to break the logic down into logical steps. First we'll build the summary rows, then we'll assign a ranking to the rows within each group, then pick out the ones that with the highest count of records.
Let's say we want to know which flavor of each item sells the most frequently on each date (perhaps assuming a record is quantity one).
WITH s as
(
SELECT itemID, saleDate, flavor, count(*) as tally
FROM sales
GROUP BY itemID, saleDate, flavor
), r as
(
SELECT itemID, saleDate, flavor, tally,
RANK() OVER (PARTITION BY itemID, saleDate ORDER BY tally desc) as pri
FROM s
)
SELECT itemID, saleDate, flavor, tally
FROM r
WHERE pri = 1
Here the names "s" and "r" refer to the result set from their respective CTE's. These names can then be used as to represent a table in another part of the statement.
The pri column will have the RANK() of tally value on the summary row from the first section "s" within the window of itemID and saleDate. Tally is descending, because we want the largest value first, which will get a RANK() of 1. Then in the main SELECT we simply pick those summary records which were first in their partition.
By using RANK() or DENSE_RANK() we could get back multiple flavors for an itemID, saleDate, if they are tied for first place. This could be eliminated by replacing RANK() with ROW_NUMBER(), but it would arbitrarily pick one of the tied flavors as a winner, and this may not be correct answer for the problem at hand.
If we had a sales quantity column in the table, we could replace COUNT(*) with SUM(salesqty) and find what had sold the most units.