Find closest positive value based on multiple criteria

Find closest positive value based on multiple criteria - postgresql

First of all I am still learning sql / postgresql so I am eagerly looking for explanations and thought process / strategy instead of just the raw answer. And I apologize in advance for the potential future misunderstandings or "stupid" questions.
Also if you know a great site which propose exercices or challenges in order to master sql / postgresql, I take everything :)
I am looking for a way to return the closest value, based on other specific results in the same table.
In the same table, I am tracking different events:
ESESS = End session event. Gives me a new timestamp (ts) every time Georges (id) finishes a session (let's say Georges is using a computer, so end session = shut the computer down)
USD = Amount of money inventory update event. Each time Georges spends/earn money, those 3 columns will return me the new balance (v), as well as his id and timestamp (ts) when the balance has been updated.
What I am trying to get is the balance at the end of each session.
My plan was to return esess.id and usd.v only if (ts.esess - ts.usd) is equal to the smallest minimum positive value.
So some sort of lookup from the ts.usd, when (ts.esess - ts.usd) match the condition...but I'm struggling with that part.
Here is the strategy in the following link:
QUERY PLAN
Here is the query:
SELECT
sessId, moneyV
FROM
(
SELECT
ts as sessTs,
mid as sessId
FROM
table1
WHERE
n='esess'
) as sess
INNER JOIN
(
SELECT
ts as moneyTs,
mid as moneyId,
v as moneyV
FROM
table1
WHERE
n='usd'
)as balance
ON sessId = moneyId
WHERE
sessTs - moneyTs =
(
SELECT
sessTs - moneyTs as timeDiff
FROM
table1
WHERE
sessTs - moneyTs > 0
ORDER BY
timeDiff ASC
LIMIT 1
)
;
So how should I proceed?
Also, I dug to find answers and find this post in particular, but did not understand everything and did not manage to make it work properly...
Thanks in advance!

Related

PostgreSQL 11.5 doing sequential scan for SELECT EXISTS query

I have a multi tenant environment where each tenant (customer) has its own schema to isolate their data. Not ideal I know, but it was a quick port of a legacy system.
Each tenant has a "reading" table, with a composite index of 4 columns:
site_code char(8), location_no int, sensor_no int, reading_dtm timestamptz.
When a new reading is added, a function is called which first checks if there has already been a reading in the last minute (for the same site_code.location_no.sensor_no):
IF EXISTS (
SELECT
FROM reading r
WHERE r.site_code = p_site_code
AND r.location_no = p_location_no
AND r.sensor_no = p_sensor_no
AND r.reading_dtm > p_reading_dtm - INTERVAL '1 minute'
)
THEN
RETURN;
END IF;
Now, bare in mind there are many tenants, all behaving fine except 1. In 1 of the tenants, the call is taking nearly half a second rather than the usual few milliseconds because it is doing a sequential scan on a table with nearly 2 million rows instead of an index scan.
My random_page_cost is set to 1.5.
I could understand a sequential scan if the query was returning possibly many rows, checking for the existance of any.
I've tried ANALYZE on the table, VACUUM FULL, etc but it makes no difference.
If I put "SET LOCAL enable_seqscan = off" before the query, it works perfectly... but it feels wrong, but it will have to be a temporary solution as this is a live system and it needs to work.
What else can I do to help Postgres make what is clearly the better decision of using the index?
EDIT: If I do a similar query manually (outside of a function) it chooses an index.

My guess is that the engine is evaluating the predicate and considers is not selective enough (thinks too many rows will be returned), so decides to use a table scan instead.
I would do two things:
Make sure you have the correct index in place:
create index ix1 on reading (site_code, location_no,
sensor_no, reading_dtm);
Trick the optimizer by making the selectivity look better. You can do that by adding the extra [redundant] predicate and r.reading_dtm < :p_reading_dtm:
select 1
from reading r
where r.site_code = :p_site_code
and r.location_no = :p_location_no
and r.sensor_no = :p_sensor_no
and r.reading_dtm > :p_reading_dtm - interval '1 minute'
and r.reading_dtm < :p_reading_dtm

Creating Date Cutoffs on continuous processes

I'm an intern databasing some CNC machine metrics and i'm a little stuck with a particular query, only started sql last week so please forgive me if this is a dumb question.
If a machine is running (state=on) past date,23:59, and I want to collect machine hours for that day, there is no logged off time, as the state=off column has not been recorded yet, thus I cannot collect that machine data. To work around this, I want to record a state off time of 23:59:59, and then create a new row with the same entity ID with a state_on time of day+1,00:00:01.
Here is what I have written so far, where am I going wrong? What combination of trigger, insert, procedure, case, etc should I use? Any suggestions are welcome, I've tried to look at some reference material, and want the first bit to look something like this.
CASE
WHEN min(stoff.last_changed) IS NULL
AND now() = '____-__-__ 23:59:59.062538+13'
THEN min(stoff.last_changed) IS now()
ELSE min(stoff.last_changed)
END
I know this is only the first component, but it fits into a larger select used within a view,let me know if I need to post anything else

This is a fairly complex query because there are a few possibilities you need to consider (given the physical setup of the CNC machine these may not both apply:
The machine might be running at midnight so you need to 'insert' a midnight start time (and treat midnight as the stop time for the previous day).
The machine might be running all day (so there are no records in your table for the day at all)
There are a number of ways you can implement this; I have chosen one that I think will be easiest for you to understand (and have avoided window functions). The response does use CTE's but I think these are easy enough to understand (it's really just a chain of queries; each one using the result of the previous one).
Lets setup some example data:
create table states (
entity_id int,
last_changed timestamp,
state bool
);
insert into states(entity_id, last_changed, state) values
(1, '2019-11-26 01:00', false),
(1, '2019-11-26 20:00', true),
(1, '2019-11-27 01:00', false),
(1, '2019-11-27 02:00', true),
(1, '2019-11-27 22:00', false);
Now the query (it's not as bad as it looks!):
-- Lets start with a range of dates that you want the report for (needed because its possible that there are
-- no entries at all in the states table for a particular date)
with date_range as (
select i::date from generate_series('2019-11-26', '2019-11-29', '1 day'::interval) i
),
-- Get all combinations of machine (entity_id) and date
allMachineDays as (
select distinct states.entity_id, date_range.i as started
from
states cross join date_range),
-- Work out what the state at the start of each day was (if no earlier state available then assume false)
stateAtStartOfDay as (
select
entity_id, started,
COALESCE(
(
select state
from states
where states.entity_id = allMachineDays.entity_id
and states.last_changed<=allMachineDays.started
order by states.last_changed desc limit 1
)
,false) as state
from allMachineDays
)
,
-- Now we can add in the state at the start of each day to the other state changes
statesIncludingStartOfDay as (
select * from stateAtStartOfDay
union
select * from states
),
-- Next we add the time that the state changed
statesWithEnd as (
select
entity_id,
state,
started,
(
select started from statesIncludingStartOfDay substate
where
substate.entity_id = states.entity_id and
substate.started > states.started
order by started asc
limit 1
) as ended
from
statesIncludingStartOfDay states
)
-- finally lets work out the duration
select
entity_id,
state,
started,
ended,
ended - started as duration
from
statesWithEnd states
where
ended is not null -- cut off the last midnight as its no longer needed
order by entity_id,started
Hopefully this makes sense and there are no errors in my logic! Note that I have made some assumptions about your data (i.e. last_changed is the time that the state began). If you just want a runtime for each day it's pretty easy to just add a group by to the last query and sum up duration.
It might help you to understand this if you run it one step at a time; gor example start with the following and then add in extra with clauses one at a time:
with date_range as (
select i::date from generate_series('2019-11-26', '2019-11-29', '1 day'::interval) i
)
select * from date_range

On-demand Median Aggregation on a Large Dataset

TLDR: I need to make several median aggregations on a large dataset for a webapp, but the performance is poor. Can my query be improved/is there a better DB than AWS Redshift for this use-case?
I'm working on a team project which involves on-demand aggregations of a large dataset for visualization through our web-app. We're using Amazon Redshift loaded with almost 1,000,000,000 rows, dist-key by date (we have data from 2014 up to today's date, with 900,000 data points being ingested every day) and sort-key by a unique id. The unique id has a possibly one-to-many relationship with other unique ids, for which the 'many' relationship can be thought as the id's 'children'.
Due to confidentiality, think of the table structures like this
TABLE NAME: meal_nutrition
DISTKEY(date),
SORTKEY(patient_id),
patient_name varchar,
calories integer,
fat integer,
carbohydrates integer,
protein integer,
cholesterol integer,
sodium integer,
calories integer
TABLE NAME: patient_hierarchy
DISTKEY(date date),
SORTKEY(patient_id integer),
parent_id integer,
child_id integer,
distance integer
Think of this as a world for which there's a hierarchy of doctors. Patients are encapsulated as both actual patients and the doctors themselves, for which doctors can be the patient of other doctors. Doctors can transfer ownership of patients/doctors at any time, so the hierarchy is constantly changing.
DOCTOR (id: 1)
/ \
PATIENT(id: 2) DOCTOR (id: 3)
/ \ \
P (id: 4) D (id: 8) D(id: 20)
/ \ / \ / \ \
................
One visualization that we're having trouble with (due to performance) is a time-series graph showing the day-to-day median of several metrics for which the default date-range must be 1 year. So in this example, we want the median of fats, carbohydrates, and proteins of all meals consumed by a patient/doctor and their 'children', given a patient_id. The query used would be:
SELECT patient_name,
date,
max(median_fats),
max(median_carbs),
max(median_proteins)
FROM (SELECT mn.date date,
ph.patient_name patient_name,
MEDIAN(fats) over (PARTITION BY date) AS median_fats,
MEDIAN(carbohydrates) over (PARTITION BY date) AS median_carbs,
MEDIAN(proteins) over (PARTITION BY date) AS median_proteins
FROM meal_nutrition mn
JOIN patient_hierarchy ph
ON (mn.patient_id = ph.child_id)
WHERE ph.date = (SELECT max(date) FROM patient_hierarchy)
AND ph.parent_id = ?
AND date >= '2016-12-17' and date <= '2017-12-17'
)
GROUP BY date, patient_name
The heaviest operations in this query are the sorts for the each of the medians (each requiring to sort ~200,000,000 rows), but we cannot avoid this. As a result, this query takes ~30s to complete, which translates to bad UX. Can the query I'm making be improved? Is there a better DB for this kind of use-case? Thanks!

As said in comments, sorting/distribution of your data is very important. If you get just one date slice of patient hierarchy all data you're using is on one node with distribution by date. It's better to distribute by meal_nutrition.patient_id and patient_hierarchy.child_id so data that is joined likely sits on the same node, and sort tables by date,patient_id and date,child_id respectively, so you can find the necessary date slices/ranges efficiently and then look up for patients efficiently.
As for the query itself, there are some options that you can try:
1) Approximate median like this:
SELECT mn.date date,
ph.patient_name patient_name,
APPROXIMATE PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY fats) AS median_fats
FROM meal_nutrition mn
JOIN patient_hierarchy ph
ON (mn.patient_id = ph.child_id)
WHERE ph.date = (SELECT max(date) FROM patient_hierarchy)
AND ph.parent_id = ?
AND date >= '2016-12-17' and date <= '2017-12-17'
GROUP BY 1,2
Notes: this might not work if the memory stack is exceeded. Also, you have to have only one such function per subquery so you can't get fats, carbs and proteins in the same subquery but you can calculate them separately and then join. if this works you can then test the accuracy by running your 30s statement for a few IDs and comparing results.
2) Binning. First group by each value, or set reasonable bins, then find the group/bin that is in the middle of the distribution. That will be your median. One variable example would be:
WITH
groups as (
SELECT mn.date date,
ph.patient_name patient_name,
fats,
count(1)
FROM meal_nutrition mn
JOIN patient_hierarchy ph
ON (mn.patient_id = ph.child_id)
WHERE ph.date = (SELECT max(date) FROM patient_hierarchy)
AND ph.parent_id = ?
AND date >= '2016-12-17' and date <= '2017-12-17'
GROUP BY 1,2,3
)
,running_groups as (
SELECT *
,sum(count) over (partition by date, patient_name order by fats rows between unlimited preceding and current row) as running_total
,sum(count) (partition by date, patient_name) as total
FROM groups
)
,distance_from_median as (
SELECT *
,row_number() over (partition by date, patient_name order by abs(0.5-(1.0*running_total/total))) as distance_from_median
FROM running_groups
)
SELECT
date,
patient_name,
fats
WHERE distance_from_median=1
That would likely allow grouping values on each individual node and subsequent operations with bins will be more light weight and avoid sorting the raw sets. Again, you have to benchmark. The less unique values you have the higher your performance gain will be because you'll have a small number of bins out of a big number of raw values and sorting will be much cheaper. The result is accurate except the option with even number of distinct values (for 1,2,3,4 it would return 2, not 2.5) but this is solvable by adding another layer if it's critical. The main question is if the approach itself improves performance significantly.
3) Materialize calculation for every date/patient id. If your only parameter is patient and you always calculate medians for the last year you can run the query overnight into a summary table and query that one. It's better even if (1) or (2) helps to optimize performance. You can also copy the summary table to a Postgres instance after materializing and use it as the backend for your app, you'll have better ping (Redshift is good for materializing large amounts of data but not good as web app backend). It comes with the cost of maintaining data transfer job, so if materializing/optimization made a good enough job you can leave it in Redshift.
I'm really interested in getting feedback if you try any of suggested options, this is a good use case for Redshift.

Reform a postgreSQL query using one (or more) index - Example

I am a beginner in PostgreSQL and, after understanding very basic things, I want to find out how I can get a better performance (on a query) by using an index (one or more). I have read some documentation, but I would like a specific example so as to "catch" it.
MY EXAMPLE: Let's say I have just a table (MyTable) with three columns (Customer(text), Time(timestamp), Consumption(integer)) and I want to find the customer(s) with the maximum consumption on '2014-07-01 01:00:00'. MY SOLUTION (without index usage):
SELECT Customer FROM MyTable WHERE Time='2013-07-01 02:00:00'
AND Consumption=(SELECT MAX(consumption) FROM MyTable);
----> What would be the exact full code, using - at least one - index for the query-example above ?

The correct query (using a correlated subquery) would be:
SELECT Customer
FROM MyTable
WHERE Time = '2013-07-01 02:00:00' AND
Consumption = (SELECT MAX(t2.consumption) FROM MyTable t2 WHERE t2.Time = '2013-07-01 02:00:00');
The above is very reasonable. An alternative approach if you want exactly one row returned is:
SELECT Customer
FROM MyTable
WHERE Time = '2013-07-01 02:00:00'
ORDER BY Consumption DESC
LIMIT 1;
And the best index is MyTable(Time, Consumption, Customer).

Why is performance of CTE worse than temporary table in this example

I recently asked a question regarding CTE's and using data with no true root records (i.e Instead of the root record having a NULL parent_Id it is parented to itself)
The question link is here; Creating a recursive CTE with no rootrecord
The answer has been provided to that question and I now have the data I require however I am interested in the difference between the two approaches that I THINK are available to me.
The approach that yielded the data I required was to create a temp table with cleaned up parenting data and then run a recursive CTE against. This looked like below;
Select CASE
WHEN Parent_Id = Party_Id THEN NULL
ELSE Parent_Id
END AS Act_Parent_Id
, Party_Id
, PARTY_CODE
, PARTY_NAME
INTO #Parties
FROM DIMENSION_PARTIES
WHERE CURRENT_RECORD = 1),
WITH linkedParties
AS
(
Select Act_Parent_Id, Party_Id, PARTY_CODE, PARTY_NAME, 0 AS LEVEL
FROM #Parties
WHERE Act_Parent_Id IS NULL
UNION ALL
Select p.Act_Parent_Id, p.Party_Id, p.PARTY_CODE, p.PARTY_NAME, Level + 1
FROM #Parties p
inner join
linkedParties t on p.Act_Parent_Id = t.Party_Id
)
Select *
FROM linkedParties
Order By Level
I also attempted to retrieve the same data by defining two CTE's. One to emulate the creation of the temp table above and the other to do the same recursive work but referencing the initial CTE rather than a temp table;
WITH Parties
AS
(Select CASE
WHEN Parent_Id = Party_Id THEN NULL
ELSE Parent_Id
END AS Act_Parent_Id
, Party_Id
, PARTY_CODE
, PARTY_NAME
FROM DIMENSION_PARTIES
WHERE CURRENT_RECORD = 1),
linkedParties
AS
(
Select Act_Parent_Id, Party_Id, PARTY_CODE, PARTY_NAME, 0 AS LEVEL
FROM Parties
WHERE Act_Parent_Id IS NULL
UNION ALL
Select p.Act_Parent_Id, p.Party_Id, p.PARTY_CODE, p.PARTY_NAME, Level + 1
FROM Parties p
inner join
linkedParties t on p.Act_Parent_Id = t.Party_Id
)
Select *
FROM linkedParties
Order By Level
Now these two scripts are run on the same server however the temp table approach yields the results in approximately 15 seconds.
The multiple CTE approach takes upwards of 5 minutes (so long in fact that I have never waited for the results to return).
Is there a reason why the temp table approach would be so much quicker?
For what it is worth I believe it is to do with the record counts. The base table has 200k records in it and from memory CTE performance is severely degraded when dealing with large data sets but I cannot seem to prove that so thought I'd check with the experts.
Many Thanks

Well as there appears to be no clear answer for this some further research into the generics of the subject threw up a number of other threads with similar problems.
This one seems to cover many of the variations between temp table and CTEs so is most useful for people looking to read around their issues;
Which are more performant, CTE or temporary tables?
In my case it would appear that the large amount of data in my CTEs would cause issue as it is not cached anywhere and therefore recreating it each time it is referenced later would have a large impact.

This might not be exactly the same issue you experienced, but I just came across a few days ago a similar one and the queries did not even process that many records (a few thousands of records).
And yesterday my colleague had a similar problem.
Just to be clear we are using SQL Server 2008 R2.
The pattern that I identified and seems to throw the sql server optimizer off the rails is using temporary tables in CTEs that are joined with other temporary tables in the main select statement.
In my case I ended up creating an extra temporary table.
Here is a sample.
I ended up doing this:
SELECT DISTINCT st.field1, st.field2
into #Temp1
FROM SomeTable st
WHERE st.field3 <> 0
select x.field1, x.field2
FROM #Temp1 x inner join #Temp2 o
on x.field1 = o.field1
order by 1, 2
I tried the following query but it was a lot slower, if you can believe it.
with temp1 as (
DISTINCT st.field1, st.field2
FROM SomeTable st
WHERE st.field3 <> 0
)
select x.field1, x.field2
FROM temp1 x inner join #Temp2 o
on x.field1 = o.field1
order by 1, 2
I also tried to inline the first query in the second one and the performance was the same, i.e. VERY BAD.
SQL Server never ceases to amaze me. Once in a while I come across issues like this one that reminds me it is a microsoft product after all, but in the end you can say that other database systems have their own quirks.