Postgres Custom Range Type - postgresql

Question:
How to make a custom range type using time (or time with tz) as a base?
What I have so far:
create time timerange as range (
subtype = time,
subtype_diff = ??? )
I think subtype_diff needs a function. For time types in pg, the minus function (difference) should work, but I can't seem to find the documentation that describes the correct syntax.
Background:
I am trying to make a scheduling app, where a service supplier would be able to show their availability and fees for different times of day, and a customer could see the price and book in real-time. The service supplier needs to be able to set different prices for different days or times of day. For example, a plumber might want, for a one hour visit:
$100 monday 0900-1800
$200 monday 1800-2200
$500 monday 2200-0000
To support this, the solution I am working on is as follows (any thoughts on better ways of doing this gratefully received)
I want to make a table that contains 'fee_rules'. I want to be able to lookup a given date, time and duration, and be able to check the associated fee based on a set of fee rules based on ranges. My proposed table schema:
id sequence
day_of_week integer [where 0 = Sunday, 1 = Monday..]
time_range [I want to make a custom time-range using only
hours:minutes of the day]
fee integer
fee_schedule_id (foreign key) (reference to a specific supplier, who is the 'owner' of that specific fee rule)
An example of a fee rule would be as follows:
id day_of_week time_range fee fee_schedule_id
12 01 10:00-18:00 100 543
For a given date, I plan to calculate day_of_week (e.g. day_of_week=01 for 'Monday') and generate a time_range based on the start_time and duration of the proposed visit e.g. visit_range=10:00-11:00. I want to be able to search using postgresql's range operators, e.g.
select fee where day_of_week = '01' and visit_range <# (range is contained by)time_range and fee_schedule_id = 543 [reference to the specific supplier's fees]

per #a_horse_with_no_name and #pozs
"I don't think you need the subtype_diff":
create type timerange as range (subtype = time);
create table schedule
(
id integer not null primary key,
time_range timerange
);
insert into schedule
values
(1, timerange(time '08:00', time '10:00', '[]')),
(2, timerange(time '10:00', time '12:00', '[]'));
select *
from schedule
where time_range #> time '09:00'

subtype_diff of timerange is actually in an example of Postgres doc (since 9.5 version):
CREATE FUNCTION time_subtype_diff(x time, y time) RETURNS float8 AS
'SELECT EXTRACT(EPOCH FROM (x - y))' LANGUAGE sql STRICT IMMUTABLE;
CREATE TYPE timerange AS RANGE (
subtype = time,
subtype_diff = time_subtype_diff
);

Related

How to add unique constraint over a time range?

I have a table slot which has a start_time and end_time. I want no other slot to be created having the same start and end time. A unique constraint as shown in the schema below
CREATE TABLE slot(
id SERIAL PRIMARY KEY,
start_time TIMETZ NOT NULL,
end_time TIMETZ NOT NULL,
CONSTRAINT slot_start_end_unique UNIQUE (start_time,end_time)
);
can be easily bypassed by picking up one minute + or - time. I want to add a constraint so that no equivalent time slot can be created or a subset time slot cannot be created.
I am thinking of using check to prevent any practically same slot from being created.
Can anyone please point towards the right direction?
Your idea of using a check constraint as unique enforcement is can probable be made to work but there could be issues and should probably be avoided. Your requirement necessitates comparing with other rows in the table but
PostgreSQL does not support CHECK constraints that reference table
data other than the new or updated row being checked. ...
It goes on to indicate a custom trigger is best employed. So, that is the approach here. See Section 5.4.1. Check Constraints.
Beyond that you have a couple issues: First off the data type TIME WITH TIMEZONE (TIMETZ) is a poor choice for data type and is somewhat misleading as it not actually used as indicated as the. As Section 8.5.3. Time Zones puts it:
Although the date type cannot have an associated time zone, the time
type can. Time zones in the real world have little meaning unless
associated with a date as well as a time, ... PostgreSQL assumes
your local time zone for any type containing only date or time.
(emphases mine)
Secondly, by using time only you may have problems specifying some ranges. How, for example, do you code the range from 22:00 to 06:00 or 23:45 to 00:15. But now back to the process.
The following trigger assumes data type TIME rather than TIMETZ and adjusts for the over midnight issue by assuming 'the next day' whenever start_time is greater than end_time.
create or replace
function is_valid_irange()
returns trigger
language plpgsql
strict
as $$
declare
k_existing_message constant text =
'Range Requested (%s,%s). Overlaps existing range (%s,%s).';
l_existing_range tsrange;
l_parm_range tsrange;
begin
with p_times(new_start_time, new_end_time) as
( values ('1970-01-01'::timestamp + new.start_time
,'1970-01-01'::timestamp + new.end_time
)
)
select tsrange(new_start_time,end_time,'[)')
into l_parm_range
from (select new_start_time
, case when new_start_time>new_end_time
then new_end_time + interval '1 day'
else new_end_time
end end_time
from p_times
) pr;
with db_range (id, existing_range) as
( select id, tsrange(start_time, end_time, '[)')
from ( select id, '1970-01-01'::timestamp + start_time start_time
, case when start_time>end_time
then '1970-01-02'::timestamp + end_time
else '1970-01-01'::timestamp + end_time
end end_time
from irange
) dr
)
select d.existing_range
into l_existing_range
from db_range d
where l_parm_range && existing_range
and d.id != new.id
limit 1;
if l_existing_range is not null
then
raise exception 'Invalid Range Requested:'
using detail= format( k_existing_message
, lower(l_parm_range)
, upper(l_parm_range)
, lower(l_existing_range)::time
, upper(l_existing_range)::time
);
end if;
return new;
end ;
$$;
How it works:
Postgres provides a set of built in data range types and a set of range operator functions.
The trigger coheres the start and end times,both new row and existing table rows, into timestamps with a fixed date ( the beginning of time 1970-01-01 according to unix).
Then employs the Overlaps (&&) operator. If any overlaps are found the trigger raises and exception. Instead of an exception it could return null to suppress
the insert or update but otherwise continue processing. For that it needs to become a BEFORE trigger. It is currently an AFTER trigger.
For full example see here. Do not worry about the date, pick any you want, just used a a generator for calculating times and to provide a common base for testing.
Create the table as normal then before you INSERT data into the table perform a SELECT query to search whether or not the time you are looking to insert already exists. For example you want to enter start 1pm and end 2pm as such:
DECLARE #start_value INT = 1
#end_value INT = 2;
Select COUNT(ID) as UseCheck FROM slot WHERE start_time = #start_value or end_time = #end_value
Then apply logic to say; IF UseCheck > 0 Then do stuff

Calculate the difference between two hours on PostgreSQL 9.4

Given a table as the following:
create table meetings(
id integer primary key,
start_time varchar,
end_time varchar
)
Considering that the string stored in this table follow the format 'HH:MM' and have a 24 hours format, is there a command on PostgreSQL 9.4 that I can cast fields to time, calculate the difference between them, and return a single result of the counting of full hours available?
e.g: start_time: '08:00' - end_time: '12:00'
Result must be 4.
In your particular case, assuming that you are working with clock values (both of them belonging to the same day), I would guess you can do this
(clock_to::time - clock_from::time) as duration
Allow me to leave you a ready to run example:
with cte as (
select '4:00'::varchar as clock_from, '14:00'::varchar as clock_to
)
select (clock_to::time - clock_from::time) as duration
from cte

On-demand Median Aggregation on a Large Dataset

TLDR: I need to make several median aggregations on a large dataset for a webapp, but the performance is poor. Can my query be improved/is there a better DB than AWS Redshift for this use-case?
I'm working on a team project which involves on-demand aggregations of a large dataset for visualization through our web-app. We're using Amazon Redshift loaded with almost 1,000,000,000 rows, dist-key by date (we have data from 2014 up to today's date, with 900,000 data points being ingested every day) and sort-key by a unique id. The unique id has a possibly one-to-many relationship with other unique ids, for which the 'many' relationship can be thought as the id's 'children'.
Due to confidentiality, think of the table structures like this
TABLE NAME: meal_nutrition
DISTKEY(date),
SORTKEY(patient_id),
patient_name varchar,
calories integer,
fat integer,
carbohydrates integer,
protein integer,
cholesterol integer,
sodium integer,
calories integer
TABLE NAME: patient_hierarchy
DISTKEY(date date),
SORTKEY(patient_id integer),
parent_id integer,
child_id integer,
distance integer
Think of this as a world for which there's a hierarchy of doctors. Patients are encapsulated as both actual patients and the doctors themselves, for which doctors can be the patient of other doctors. Doctors can transfer ownership of patients/doctors at any time, so the hierarchy is constantly changing.
DOCTOR (id: 1)
/ \
PATIENT(id: 2) DOCTOR (id: 3)
/ \ \
P (id: 4) D (id: 8) D(id: 20)
/ \ / \ / \ \
................
One visualization that we're having trouble with (due to performance) is a time-series graph showing the day-to-day median of several metrics for which the default date-range must be 1 year. So in this example, we want the median of fats, carbohydrates, and proteins of all meals consumed by a patient/doctor and their 'children', given a patient_id. The query used would be:
SELECT patient_name,
date,
max(median_fats),
max(median_carbs),
max(median_proteins)
FROM (SELECT mn.date date,
ph.patient_name patient_name,
MEDIAN(fats) over (PARTITION BY date) AS median_fats,
MEDIAN(carbohydrates) over (PARTITION BY date) AS median_carbs,
MEDIAN(proteins) over (PARTITION BY date) AS median_proteins
FROM meal_nutrition mn
JOIN patient_hierarchy ph
ON (mn.patient_id = ph.child_id)
WHERE ph.date = (SELECT max(date) FROM patient_hierarchy)
AND ph.parent_id = ?
AND date >= '2016-12-17' and date <= '2017-12-17'
)
GROUP BY date, patient_name
The heaviest operations in this query are the sorts for the each of the medians (each requiring to sort ~200,000,000 rows), but we cannot avoid this. As a result, this query takes ~30s to complete, which translates to bad UX. Can the query I'm making be improved? Is there a better DB for this kind of use-case? Thanks!
As said in comments, sorting/distribution of your data is very important. If you get just one date slice of patient hierarchy all data you're using is on one node with distribution by date. It's better to distribute by meal_nutrition.patient_id and patient_hierarchy.child_id so data that is joined likely sits on the same node, and sort tables by date,patient_id and date,child_id respectively, so you can find the necessary date slices/ranges efficiently and then look up for patients efficiently.
As for the query itself, there are some options that you can try:
1) Approximate median like this:
SELECT mn.date date,
ph.patient_name patient_name,
APPROXIMATE PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY fats) AS median_fats
FROM meal_nutrition mn
JOIN patient_hierarchy ph
ON (mn.patient_id = ph.child_id)
WHERE ph.date = (SELECT max(date) FROM patient_hierarchy)
AND ph.parent_id = ?
AND date >= '2016-12-17' and date <= '2017-12-17'
GROUP BY 1,2
Notes: this might not work if the memory stack is exceeded. Also, you have to have only one such function per subquery so you can't get fats, carbs and proteins in the same subquery but you can calculate them separately and then join. if this works you can then test the accuracy by running your 30s statement for a few IDs and comparing results.
2) Binning. First group by each value, or set reasonable bins, then find the group/bin that is in the middle of the distribution. That will be your median. One variable example would be:
WITH
groups as (
SELECT mn.date date,
ph.patient_name patient_name,
fats,
count(1)
FROM meal_nutrition mn
JOIN patient_hierarchy ph
ON (mn.patient_id = ph.child_id)
WHERE ph.date = (SELECT max(date) FROM patient_hierarchy)
AND ph.parent_id = ?
AND date >= '2016-12-17' and date <= '2017-12-17'
GROUP BY 1,2,3
)
,running_groups as (
SELECT *
,sum(count) over (partition by date, patient_name order by fats rows between unlimited preceding and current row) as running_total
,sum(count) (partition by date, patient_name) as total
FROM groups
)
,distance_from_median as (
SELECT *
,row_number() over (partition by date, patient_name order by abs(0.5-(1.0*running_total/total))) as distance_from_median
FROM running_groups
)
SELECT
date,
patient_name,
fats
WHERE distance_from_median=1
That would likely allow grouping values on each individual node and subsequent operations with bins will be more light weight and avoid sorting the raw sets. Again, you have to benchmark. The less unique values you have the higher your performance gain will be because you'll have a small number of bins out of a big number of raw values and sorting will be much cheaper. The result is accurate except the option with even number of distinct values (for 1,2,3,4 it would return 2, not 2.5) but this is solvable by adding another layer if it's critical. The main question is if the approach itself improves performance significantly.
3) Materialize calculation for every date/patient id. If your only parameter is patient and you always calculate medians for the last year you can run the query overnight into a summary table and query that one. It's better even if (1) or (2) helps to optimize performance. You can also copy the summary table to a Postgres instance after materializing and use it as the backend for your app, you'll have better ping (Redshift is good for materializing large amounts of data but not good as web app backend). It comes with the cost of maintaining data transfer job, so if materializing/optimization made a good enough job you can leave it in Redshift.
I'm really interested in getting feedback if you try any of suggested options, this is a good use case for Redshift.

Calculate the sum of time column in PostgreSql

Can anyone suggest me, the easiest way to find summation of time field in POSTGRESQL. i just find a solution for MYSQL but i need the POSTGRESQL version.
MYSQL: https://stackoverflow.com/questions/3054943/calculate-sum-time-with-mysql
SELECT SEC_TO_TIME(SUM(TIME_TO_SEC(timespent))) FROM myTable;
Demo Data
id time
1 1:23:23
2 4:00:23
3 9:23:23
Desired Output
14:47:09
What you want, is not possible. But you probably misunderstood the time type: it represents a precise time-point in a day. It doesn't make much sense, to add two (or more) times. f.ex. '14:00' + '14:00' = '28:00' (but there are no 28th hour in a day).
What you probably want, is interval (which represents time intervals; hours, minutes, or even years). sum() supports interval arguments.
If you use intervals, it's just that simple:
SELECT sum(interval_col) FROM my_table;
Although, if you stick to the time type (but you have no reason to do that), you can cast it to interval to calculate with it:
SELECT sum(time_col::interval) FROM my_table;
But again, the result will be interval, because time values cannot exceed the 24th hour in a day.
Note: PostgreSQL will even do the cast for you, so sum(time_col) should work too, but the result is interval in this case too.
I tried this solution on sql fieddle:
link
Table creation:
CREATE TABLE time_table (
id integer, time time
);
Insert data:
INSERT INTO time_table (id,time) VALUES
(1,'1:23:23'),
(2,'4:00:23'),
(3,'9:23:23')
query the data:
SELECT
sum(s.time)
FROM
time_table s;
If you need to calculate sum of some field, according another field, you can do this:
select
keyfield,
sum(time_col::interval) totaltime
FROM myTable
GROUP by keyfield
Output example:
keyfield; totaltime
"Gabriel"; "10:00:00"
"John"; "36:00:00"
"Joseph"; "180:00:00"
Data type of totaltime is interval.

How do I achieve exclusive OR in T-SQL?

I have a data table where there's a list of columns (boiled down to the pertinent ones for this example):
users(
usr_pkey int identity(1, 1) primary key,
usr_name nvarchar(64),
...,
)
accounts(
acc_pkey int identity(1, 1) primary key,
usr_key int foreign_key references users(usr_pkey),
acc_effective datetime,
acc_expires datetime,
acc_active bit,
...,
)
From this table I'm looking to grab all records where:
The account belongs to the specified user and
In the first instance:
the account is active and today's date falls between the account's effective and expiry date or
In the second instance:
if no records were identified by the first instance, the record with the most recent expiry date.
So - if an active record exists where today's date falls between the account's effective and expiry dates, I want that record. Only if no match was found do I want any account for this user having the most recent expiry date.
Unless something has radically changed in TSQL 2008, it's brute force.
select *
from table
where ( ( condition 1 OR condition 2)
AND NOT ( condition 1 AND condition 2) )
Here's one solution I've found:
select top 1 *
from accounts
where usr_key = #specified_user
order by
acc_active desc,
case
when getdate() between acc_effective and acc_expires then 0
else 1
end,
acc_expires desc
This would effectively order the records in the right priority sequence allowing me to pick the top one off the list
Strictly speaking, it doesn't achieve exclusive or, but it could be applied to this data set to achieve the same end.