How can I make this query more set based? - tsql

This is my first post here, so please let me know if I've not given everything needed.
I have been struggling to rewrite a process that has recently been causing me and my server significant performance issues.
The overall task is to identify where a customer has had to contact us back within +2 hours to +28 days of their previous contact. Currently this is being completed via the use of a cursor for all the contacts we received yesterday. This equates to approximately 50k contacts per day.
I am aware that this can be done through a cursor or a recursive CTE, but I feel like both options are bad. I am looking for another method to do the same job.
Below is a sample extract and the outcome i am expecting to see.
INSERT INTO SourceData ([CUSTOMER_KEY], [CONTACT_REFERENCE], [CONTACT_DATETIME], [EXPECTED_RESULT])
VALUES ('1', '100', '01/04/2020 09:00', 'Original Contact'),
('2', '101', '01/04/2020 10:00', 'Original Contact'),
('3', '102', '01/04/2020 11:00', 'Original Contact'),
('1', '103', '01/04/2020 12:00', 'Repeat of Contact Reference 100'),
('1', '104', '01/04/2020 13:00', 'Not Repeat - within 2 hours of previous contact'),
('1', '50' , '01/04/2020 14:00', 'Repeat of Contact Reference 103'),
('2', '105', '01/04/2020 14:00', 'Repeat of Contact Reference 101'),
('1', '106', '01/04/2020 15:00', 'Repeat of Contact Reference 104'),
('1', '200', '27/04/2020 12:00', 'Repeat of Contact Reference 106');
The process i currently follow is below. I am happy to update my post to provide code, but I don't think this will be too useful given that I am looking for other solutions.
Identify the current latest repeat of every customer. This was here to reduce the requirement on the full data table. If there was a repeat contact within the time frame already, then I can just assign it straight to that. This data is loaded into a new temp table: TempTable_Repeats_By_Customer.
Add all the contacts from yesterday to a temp table: TempTable_Yesterdays_Contacts
Open the cursor to start processing each Contact (from step 2) in order of Contact_DateTime (Ascending). At the same time i use TempTable_Repeats_By_Customer to identify if the customer has already had a repeat - and if this was within the eligible time frame.
If an existing repeat exists, retrieve the details from my existing reporting table and load a new row in.
If no existing repeat exists, check the full data table for other contacts received during the eligible period.
If there are more contacts from the same customer on a single day, I then go back and update TempTable_Repeats_By_Customer with the new details.
Either go to the next item in the cursor, or close and deallocate it.
Any help you all can give is much appreciated.

Perhaps I am overlooking something, but I think you should be able to do this using the LAG() function.
IF OBJECT_ID('tempdb.dbo.#SourceData', 'U') IS NOT NULL
DROP TABLE #SourceData;
CREATE TABLE #SourceData
(
[CUSTOMER_KEY] VARCHAR(10)
, [CONTACT_REFERENCE] VARCHAR(10)
, [CONTACT_DATETIME] DATETIME
, [EXPECTED_RESULT] VARCHAR(50)
);
INSERT INTO #SourceData
(
[CUSTOMER_KEY]
, [CONTACT_REFERENCE]
, [CONTACT_DATETIME]
, [EXPECTED_RESULT]
)
VALUES
('1', '100', '04/01/2020 09:00', 'Original Contact')
, ('2', '101', '04/01/2020 10:00', 'Original Contact')
, ('3', '102', '04/01/2020 11:00', 'Original Contact')
, ('1', '103', '04/01/2020 12:00', 'Repeat of Contact Reference 100')
, ('1', '104', '04/01/2020 13:00', 'Not Repeat - within 2 hours of previous contact')
, ('2', '105', '04/01/2020 14:00', 'Repeat of Contact Reference 101')
, ('1', '106', '04/01/2020 15:00', 'Repeat of Contact Reference 103')
, ('1', '200', '04/27/2020 12:00', 'Repeat of Contact Reference 106');
SELECT x.CUSTOMER_KEY
, x.CONTACT_REFERENCE
, x.CONTACT_DATETIME
, x.EXPECTED_RESULT
, x.[Minutes Difference]
FROM (
SELECT
CUSTOMER_KEY
, CONTACT_REFERENCE
, CONTACT_DATETIME
, EXPECTED_RESULT
, DATEDIFF(
MINUTE
, LAG(CONTACT_DATETIME) OVER
(PARTITION BY CUSTOMER_KEY ORDER BY CONTACT_DATETIME)
, CONTACT_DATETIME
) AS [Minutes Difference]
FROM #SourceData
) x
WHERE x.[Minutes Difference] > 60
AND x.[Minutes Difference] < 40320 -- this is the number of minutes in 28 days
Here is the demo.

The following code uses a recursive CTE to process the contacts in date/time order for each customer. Like Isaac's answer it calculates a delta time in minutes which may or may not be adequate resolution for your purposes.
NB: DateDiff "returns the count (as a signed integer value) of the specified datepart boundaries crossed". If you specify a datepart of day you'll get the number of midnights crossed, not the number of 24-hour periods. For example, Monday # 23:00 to Wednesday # 01:00 is 26 hours or two midnights, while Tuesday # 01:00 to Wednesday # 03:00 is still 26 hours, but only one midnight.
declare #SourceData as Table ( Customer_Key Int, Contact_Reference Int, Contact_DateTime DateTime, Expected_Result VarChar(50) );
INSERT INTO #SourceData ([CUSTOMER_KEY], [CONTACT_REFERENCE], [CONTACT_DATETIME], [EXPECTED_RESULT])
VALUES ('1', '100', '2020-04-01 09:00', 'Original Contact'),
('2', '101', '2020-04-01 10:00', 'Original Contact'),
('3', '102', '2020-04-01 11:00', 'Original Contact'),
('1', '103', '2020-04-01 12:00', 'Repeat of Contact Reference 100'),
('1', '104', '2020-04-01 13:00', 'Not Repeat - within 2 hours of previous contact'),
('2', '105', '2020-04-01 14:00', 'Repeat of Contact Reference 101'),
('1', '106', '2020-04-01 15:00', 'Repeat of Contact Reference 103'),
('1', '200', '2020-04-27 12:00', 'Repeat of Contact Reference 106');
with
ContactsByCustomer as (
-- Add a row number to simplify processing the contacts for each customer in Contact_DateTime order.
select Customer_Key, Contact_Reference, Contact_DateTime, Expected_Result,
Row_Number() over ( partition by Customer_Key order by Contact_DateTime ) as RN
from #SourceData ),
ProcessedContacts as (
-- Process the contacts in date/time order for each customer.
-- Start with the first contact for each customer ...
select Customer_Key, Contact_Reference, Contact_DateTime, Expected_Result, RN,
Cast( 'Original Contact' as VarChar(100) ) as Computed_Result,
0 as Delta_Minutes
from ContactsByCustomer
where RN = 1
union all
-- ... and add each subsequent contact in date/time order.
select CBC.Customer_Key, CBC.Contact_Reference, CBC.Contact_DateTime, CBC.Expected_Result, CBC.RN,
Cast(
case
when PH.Delta_Minutes < 120 then
'No Repeat - within 2 hours of previous contact'
when 120 <= PH.Delta_Minutes and PH.Delta_Minutes <= 40320 then
'Repeat of Contact Reference ' + Cast( PC.Contact_Reference as VarChar(10) )
else
'Original'
end
as VarChar(100) ),
PH.Delta_Minutes
from ProcessedContacts as PC inner join
ContactsByCustomer as CBC on CBC.Customer_Key = PC.Customer_Key and CBC.RN = PC.RN + 1 cross apply
-- Using cross apply makes it easy to use the calculated value as needed.
( select DateDiff( minute, PC.Contact_DateTime, CBC.Contact_DateTime ) as Delta_Minutes ) as PH
)
-- You can uncomment the select to see the intermediate results.
-- select * from ContactsByCustomer;
select *
from ProcessedContacts
order by Customer_Key, Contact_DateTime;

Related

Need help in Postgres Conversion

Hello Guys I am trying to convert following script from MS SQL to PostgreSQL but unable to convert the starred ones in the below script
WITH procedurerange_cte AS (
select distinct
accountsize as HospitalSize
,min(catprocsannualintegratedallpayer) over (partition by accountsize) as MinProcsPerAcct
,max(catprocsannualintegratedallpayer) over (partition by accountsize) as MaxProcsPerAcct
from sandbox.vw_hopd_universe_1_ms
group by accountsize,catprocsannualintegratedallpayer
), accts_cte AS (
select
accountsize as HospitalSize
,count(master_id) as Count
,sum(catprocsannualintegratedallpayer) as catprocsannualintegratedallpayer
from sandbox.vw_hopd_universe_1_ms
group by accountsize
), allcatprocs_cte AS (
select
sum(catprocsannualintegratedallpayer) as AllAnnCatProcs
from sandbox.accts_universeaccts
), totals_cte AS (
select
case when HospitalSize is null then 'Total' else HospitalSize end as HospitalSize
,sum(Count) as Count
,sum(catprocsannualintegratedallpayer) as catprocsannualintegratedallpayer
from accts_cte
group by grouping sets ((HospitalSize,Count,catprocsannualintegratedallpayer),())
)
select
a.HospitalSize
,a.Count
***--,convert(float,a.Count)/convert(float,(select Count from totals_cte where HospitalSize='Total')) as %OfHospitals***
,a.catprocsannualintegratedallpayer as HospitalAnnCatProcs
***--,a.catprocsannualintegratedallpayer/(select catprocsannualintegratedallpayer from totals_cte where HospitalSize='Total') as %OfHospProcs***
***--,a.catprocsannualintegratedallpayer/(select AllAnnCatProcs from allCatProcs_cte) as %OfAllProcs***
,MinProcsPerAcct
,MaxProcsPerAcct
,***CASE
when a.HospitalSize='Large' then '8 to 10'
when a.HospitalSize='Medium' then '5 to 7'
when a.HospitalSize='Small' then '0 to 4'
end as DecilesIncluded***
from totals_cte as a
left join procedurerange_cte as b
on a.HospitalSize=b.HospitalSize
Please help in converting the above script to PostgreSQL as I am new to this field

Subtracting values by id and a previous counter

Here is a Snippet of my data :-
customers order_id order_date order_counter
1 a 1/1/2018 1
1 b 1/4/2018 2
1 c 3/8/2018 3
1 d 4/9/2019 4
I'm trying to get the average number of days between the order time for each customer. So for the following Snippet the average number of days should be 32.66 days as there were 3,62,32 number of days between each order, sum it, and then divide by 3.
My data has Customers that may have more than 100+ orders .
You could use LAG function:
WITH cte AS (
SELECT customers,order_date-LAG(order_date) OVER(PARTITION BY customers ORDER BY order_counter) AS d
FROM t
)
SELECT customers, AVG(d)
FROM cte
WHERE d IS NOT NULL
GROUP BY customers;
db<>fiddle demo
With a self join, group by customer and get the average difference:
select
t.customers,
round(avg(tt.order_date - t.order_date), 2) averagedays
from tablename t inner join tablename tt
on tt.customers = t.customers and tt.order_counter = t.order_counter + 1
group by t.customers
See the demo.
Results:
| customers | averagedays |
| --------- | ----------- |
| 1 | 32.67 |
Please check below query.
I tried to insert data of two customers so that we can check that average for every customer is coming correct.
DB Fiddle Example: https://www.db-fiddle.com/
CREATE TABLE test (
customers INTEGER,
order_id VARCHAR(1),
order_date DATE,
order_counter INTEGER
);
INSERT INTO test
(customers, order_id, order_date, order_counter)
VALUES
('1', 'a', '2018-01-01', '1'),
('1', 'b', '2018-01-04', '2'),
('1', 'c', '2018-03-08', '3'),
('1', 'd', '2018-04-09', '4'),
('2', 'a', '2018-01-01', '1'),
('2', 'b', '2018-01-06', '2'),
('2', 'c', '2018-03-12', '3'),
('2', 'd', '2018-04-15', '4');
commit;
select customers , round(avg(next_order_diff),2) as average
from
(
select customers , order_date , next_order_date - order_date as next_order_diff
from
(
select customers ,
lead(order_date) over (partition by customers order by order_date) as next_order_date , order_date
from test
) a
where next_order_date is not null
) a
group by customers
order by customers
;
Another option. I would myself like the answer from #forpas except that it depends on the monotonically increasing value for order_counter (what happens when an order is deleted). The following accounts for that by actually counting the number of order pairs. It also accounts for customers have places only 1 order, returning NULL as the average.
select customers, round(sum(nd)::numeric/n, 2) avg_days_to_order
from (
select customers
, order_date - lag(order_date) over(partition by customers order by order_counter) nd
, count(*) over (partition by customers) - 1 n
from test
)d
group by customers, n
order by customers;

How to get the data of nearest date when comparing two tables

I have two tables that need to be joined on the nearest date (nearest before date).
Screenshot of My Requirement
For example: In Table1 Date is 6/19/2018(M/DD/YYYY) then I would like to get the data of nearest before date from the Table2(If table has 07/19/2018, 06/20/2018 and 06/16/2018, I would like to get 06/16/2018 record information).
I have multiple records in table1 and want to get the nearest date record in form from the table2. Please see the image for more info about my requirement. Thank you in advance for your help.
Assuming that you must do it for every customer distinctly (customer column is a key in the example). If you have another key (let's say customer, item_name, item_name column is not shown, add it manually in this case), then change the corresponding predicates (to a[2].customer=x.customer and a[2].item_name=x.item_name in the example). If you don't want to do it for each customer, just remove the predicates a[2].customer=x.customer and.
You can run the statement below AS IS to check.
with
xyz (customer, req_del_date) as (values
('ABC', date('2018-06-19'))
, ('ABC', date('2018-09-04'))
, ('ABC', date('2018-04-24'))
, ('ABC', date('2018-03-17'))
)
, abc (customer, actual_del_date) as (values
('ABC', date('2018-11-20'))
, ('ABC', date('2018-06-12'))
, ('ABC', date('2018-05-09'))
, ('ABC', date('2018-04-27'))
, ('ABC', date('2018-04-14'))
, ('ABC', date('2017-12-31'))
, ('ABC', date('2017-12-30'))
)
select x.customer, x.req_del_date, a.actual_del_date, a.diff_days
from xyz x, table (
select a.customer, a.actual_del_date
, days(x.req_del_date) - days(actual_del_date) diff_days -- just for test
-- other columns from abc if needed
from abc a
where a.customer=x.customer and x.req_del_date>=a.actual_del_date
and (days(x.req_del_date) - days(a.actual_del_date)) =
(
select min(days(x.req_del_date) - days(a2.actual_del_date))
from abc a2
where a2.customer=x.customer and x.req_del_date>=a2.actual_del_date
)
) a;

query linear timeline result based on itemevents date ranges and highest priority, help improve existing query

Postgresql 9.1
I need help to improve this query or new ideas how to query results I need.
below is some simple example query which generates a linear timeline ranges of events or call it tasks.
For example, I have a task to wear certain colour tshirts at certain dates or weeks, there are many task and many overlapping, since this summer is really hot, I cant wear multiple shirts same time, can I? :P , so shirts are prioritized by some reason. so what I need is generate simple linear list of tasks I must perform.
drop table if exists temp_box;
create temp table temp_box(id integer ,event_id integer,event_description text, priority integer , date_from date , date_to date);
insert into temp_box values(333,1, 'white shirt', 10, '2015-01-01' , '3000-01-01');
insert into temp_box values(333,22, 'green shirt', 8, '2015-01-05' , '2015-01-20');
insert into temp_box values(333,13, 'red shirt', 7, '2015-02-03' , '2015-05-10');
insert into temp_box values(333,2, 'grey shirt', 6, '2015-02-11' , '2015-04-01');
insert into temp_box values(333,104, 'blue blouse', 4, '2015-03-01' , '2015-03-11');
insert into temp_box values(333,6, 'nothing', 2, '2015-04-10' , '2015-04-12');
WITH days AS (
SELECT '2015-01-01'::date + aa.aa AS cday
FROM generate_series(0, 365) aa(aa)
), m1 AS (
SELECT q1.event_id,
min(q1.cday) AS min,
max(q1.cday) AS max,
q1.zz,
q1.id
FROM ( SELECT q1_1.event_id,
q1_1.cday,
q1_1.id,
q1_1.row_number,
q1_1.lg,
sum(
CASE
WHEN q1_1.lg IS NULL OR q1_1.lg <> q1_1.event_id THEN 1
ELSE 0
END) OVER (PARTITION BY q1_1.id ORDER BY q1_1.cday) AS zz
FROM ( SELECT q1_2.event_id,
q1_2.cday,
q1_2.id,
q1_2.row_number,
lag(q1_2.event_id) OVER (PARTITION BY q1_2.id ORDER BY q1_2.cday) AS lg
FROM ( SELECT temp_box_1.event_id,
days.cday,
temp_box_1.id,
row_number() OVER (PARTITION BY days.cday, temp_box_1.id ORDER BY temp_box_1.priority) AS row_number
FROM days,
temp_box temp_box_1
WHERE days.cday between temp_box_1.date_from AND temp_box_1.date_to
ORDER BY days.cday, temp_box_1.id, temp_box_1.priority) q1_2
WHERE q1_2.row_number = 1) q1_1) q1
GROUP BY q1.zz, q1.id, q1.event_id
)
SELECT m1.id,
m1.event_id,
((((temp_box.event_description || ' (from '::text) || to_char(m1.min::date, 'yyyy.MM.dd'::text)) || ' to '::text) || to_char(m1.max::date, 'yyyy.MM.dd'::text)) || ')'::text AS info
FROM m1,
temp_box
WHERE m1.event_id = temp_box.event_id
ORDER BY m1.zz
Results:
id event_id info
333 1 "white shirt (from 2015.01.01 to 2015.01.04)"
333 22 "green shirt (from 2015.01.05 to 2015.01.20)"
333 1 "white shirt (from 2015.01.21 to 2015.02.02)"
333 13 "red shirt (from 2015.02.03 to 2015.02.10)"
333 2 "grey shirt (from 2015.02.11 to 2015.02.28)"
333 104 "blue blouse (from 2015.03.01 to 2015.03.11)"
333 2 "grey shirt (from 2015.03.12 to 2015.04.01)"
333 13 "red shirt (from 2015.04.02 to 2015.04.09)"
333 6 "nothing (from 2015.04.10 to 2015.04.12)"
333 13 "red shirt (from 2015.04.13 to 2015.05.10)"
333 1 "white shirt (from 2015.05.11 to 2016.01.01)"
here query is explain.
In example, query performs fast. it generates data for only one person .
But since query must calculate data for 50'000 persons in single call. it takes too long, joining all dates and discarding most rows, then running sorts and aggregates ... way too expensive.
As a terrible amateur as I am now, cant figure out how to do this differently, and cause of poor English, mother Google does not to know what I realy want
There must be simpler and more efficient way to accomplish this.
open for any suggestions, thanks.
2015-07-09:
Execution plan is from sample data, the real query is quite large and to complex to try to post here and reproduce. explain on it wont be no more useful than this examples one. where you can see why its so heavy.
Example is for one "person" field "id" with value 333 represents one persons id. in real case sometimes there would be needed to calculate it for 50000 'People' with uncertain counts of "shirts" per person, so base dataset would be average like 50000 * 5 = 250000 rows, in secondary dataset joining dates will be 250k * 366 days = 91.5M rows, and that's just for one year! and then sorting and aggregate on that large dataset is quite slow. memory required for sorting is not the main problem, it still fits in ram.
I guess, I could do "bucketing" or how its called, grouping-aggregating initial dataset by persons generating a custom type of arrays per person, and pass that to function which will do join with dates and do the necessary calculations, this way it will eliminate large memory consumption and overhead caused sorting and aggregating on this large dataset. But that involves creating custom types and functions, which I rather wont do if I can.
I wish there would be some other way to calculate those task-event timeline slices ... without joining all dates and generating needlessly large amount of rows to process...
It's a bit late answer I came up on my own some while ago, better later than never ;)
After upgrading PostgreSQL to 9.3 version, I had an opportunity and need to
try and test range types. Came up with some ideas and revisited old problem with (later improved but still) slow query.
I wrote a function where you pass two arrays as an input value with equal element count. First one is bigint[] with event_id , second one is daterange[] array which contains each tasks active period accordingly. Events priority is implemented based on aggregated data order. So the first task_id in array "takes" - reserves his active daterange from the timeline , second event_id takes and reserves a range or two ranges which where not occupied and taken by the first, third takes whats left available from previous two and so on...
Function returns bigint event_id and array of ranges each task could acquire...
In the end, using this approach query performed more than 11 times faster with large datasets reducing execution time from couple minutes to several seconds.
Example Function:
CREATE OR REPLACE FUNCTION a2_daterange_timeline(IN i_id bigint[], IN i_ranges daterange[])
RETURNS TABLE(id bigint, ranges daterange[]) AS
$BODY$
declare
r record;
base_range daterange;
unocupied_ranges daterange[];
unocupied_ranges_temp daterange[];
target_range daterange;
overlap boolean;
to_right boolean;
to_left boolean;
uno daterange[];
ocu daterange[];
iii integer;
begin
overlap := false;
to_right := false;
to_left := false;
base_range := '[''2000-01-01'', ''3000-01-01'']' ::daterange;
unocupied_ranges := array[ base_range ];
FOR r IN SELECT unnest(i_id) id, unnest(i_ranges) zz
LOOP
unocupied_ranges_temp := array[] ::daterange[] ;
FOR iii IN 1..array_upper( unocupied_ranges,1) LOOP
target_range := r.zz ::daterange;
overlap := target_range && unocupied_ranges[iii];
to_right := not target_range &< unocupied_ranges[iii];
to_left := not target_range &> unocupied_ranges[iii];
uno :=case
when not overlap then array[unocupied_ranges[iii]]
when overlap and not to_right and not to_left then array[ daterange (lower(unocupied_ranges[iii]),lower(target_range),'[)') , daterange (upper(target_range),upper(unocupied_ranges[iii]),'[)') ]
when overlap and to_right and not to_left then array[ daterange (lower(unocupied_ranges[iii]),lower(target_range),'[)') ]
when overlap and not to_right and to_left then array[ daterange (upper(target_range),upper(unocupied_ranges[iii]),'[)') ]
when overlap and to_right and to_left then array[ ]::daterange[] end ;
unocupied_ranges_temp:= array_cat( unocupied_ranges_temp ,uno);
ocu :=case
when not overlap then array[ ] ::daterange[]
when overlap and not to_right and not to_left then array[ target_range ]
when overlap and to_right and not to_left then array[ daterange (lower(target_range),upper(unocupied_ranges[iii]),'[)') ]
when overlap and not to_right and to_left then array[ daterange (lower(unocupied_ranges[iii]),upper(target_range),'[)') ]
when overlap and to_right and to_left then array[ unocupied_ranges[iii] ] end ;
ranges := ocu;
if not ranges = array[]::daterange[] then
id := r.id;
return next;
end if;
END LOOP;
unocupied_ranges :=unocupied_ranges_temp;
END LOOP;
RETURN;
end;
$BODY$
LANGUAGE plpgsql IMMUTABLE
COST 100
ROWS 20;
Resulting query:
drop table if exists temp_box;
create temp table temp_box(person_id integer ,event_id integer,event_description text, priority integer , date_from date , date_to date);
insert into temp_box values(333,1, 'white shirt', 10, '2015-01-01' , '3000-01-01');
insert into temp_box values(333,22, 'green shirt', 8, '2015-01-05' , '2015-01-20');
insert into temp_box values(333,13, 'red shirt', 7, '2015-02-03' , '2015-05-10');
insert into temp_box values(333,2, 'grey shirt', 6, '2015-02-11' , '2015-04-01');
insert into temp_box values(333,104, 'blue blouse', 4, '2015-03-01' , '2015-03-11');
insert into temp_box values(333,6, 'nothing', 2, '2015-04-10' , '2015-04-12');
with
a as (select * from temp_box order by person_id,priority)
-- ordering by person and event priority
, b as (select person_id, array_agg(event_id) a, array_agg(daterange(coalesce(date_from,'2000-01-01') , coalesce(date_to,'3000-01-01'),'[]')) b from a temp_box group by person_id )
--aggregate events into arrays to pass into function and calculate available dateranges for each event
, c as (select (a2_daterange_timeline(a,b)).* from b )
--calculating data
, d as (select id as r_id, unnest (ranges) as ranges from c)
--unnesting function results into individual daterange slices
, e as (select *,row_number() over (partition by temp_box.person_id order by ranges) zz from temp_box left join d on temp_box.event_id = d.r_id where upper(ranges)-1 >= '2015-01-01' and lower(ranges) < '2015-01-01'::date +365 order by ranges)
-- joining calculated data to an initial dataset and filtering desired period
select
person_id,
event_id,
(((( event_description || ' (from '::text) || to_char( lower(ranges) , 'yyyy.MM.dd'::text)) || ' to '::text) || to_char(upper(ranges)-1 , 'yyyy.MM.dd'::text)) || ')'::text AS info
from e
This approach with some modifications might be used with other range types. Hope this answer will be helpful to someone else as well.

PostgreSQL, Custom aggregate

Is here way to get function like custom aggregate when MAX and SUM is not enough to get result?
Here is my table:
DROP TABLE IF EXISTS temp1;
CREATE TABLE temp1(mydate text, code int, price decimal);
INSERT INTO temp1 (mydate, code, price) VALUES
('01.01.2014 14:32:11', 1, 9.75),
( '', 1, 9.99),
( '', 2, 40.13),
('01.01.2014 09:12:04', 2, 40.59),
( '', 3, 18.10),
('01.01.2014 04:13:59', 3, 18.20),
( '', 4, 10.59),
('01.01.2014 15:44:32', 4, 10.48),
( '', 5, 8.19),
( '', 5, 8.24),
( '', 6, 11.11),
('04.01.2014 10:22:35', 6, 11.09),
('01.01.2014 11:48:15', 6, 11.07),
('01.01.2014 22:18:33', 7, 22.58),
('03.01.2014 13:15:40', 7, 21.99),
( '', 7, 22.60);
Here is query for getting result:
SELECT code,
ROUND(AVG(price), 2),
MAX(price)
FROM temp1
GROUP BY code
ORDER BY code;
In short:
I have to get LAST price by date (written as text) for every grouped code if date exists otherwise (if date isn't written) price should be 0.
In column LAST is wanted result and result of AVG and MAX for illustration:
CODE LAST AVG MAX
------------------------------
1 9.75 9.87 9.99
2 40.59 40.36 40.59
3 18.20 18.15 18.20
4 10.48 10.54 10.59
5 0.00 8.22 8.24
6 11.09 11.09 11.11
7 21.99 22.39 22.60
How would I get wanted result?
How that query would look like?
EDITED
I simply have to try 'IMSoP's advices to update and use custom aggregate functions first/last.
SELECT code,
CASE WHEN MAX(mydate)<>'' THEN
(SELECT last(price ORDER BY TO_TIMESTAMP(mydate, 'DD.MM.YYYY HH24:MI:SS')))
ELSE
0
END AS "LAST",
ROUND(AVG(price), 2) AS "AVG",
MAX(price) AS "MAX"
FROM temp1
GROUP BY code
ORDER BY code;
With this simple query I get same results as with Mike's complex query.
And more, those one better consumes double (same) entries in mydate column, and is faster.
Is this possible? It look's similar to 'SELECT * FROM magic()' :)
You said in comments that one code can have two rows with the same date. So this is sane data.
01.01.2014 1 3.50
01.01.2014 1 17.25
01.01.2014 1 99.34
There's no deterministic way to tell which of those rows is the "last" one, even if you sort by code and "date". (In the relational model--a model based on mathematical sets--the order of columns is irrelevant, and the order of rows is irrelevant.) The query optimizer is free to return rows is the way it thinks best, so this query
select *
from temp1
order by mydate, code
might return this on one run,
01.01.2014 1 3.50
01.01.2014 1 17.25
01.01.2014 1 99.34
and this on another.
01.01.2014 1 3.50
01.01.2014 1 99.34
01.01.2014 1 17.25
Unless you store some value that makes the meaning of last obvious, what you're trying to do isn't possible. When people need to make last obvious, they usually use a timestamp.
After your changes, this query seems to return what you're looking for.
with distinct_codes as (
select distinct code
from temp1
),
corrected_table as (
select
case when mydate <> '' then TO_TIMESTAMP(mydate, 'DD.MM.YYYY HH24:MI:SS')
else null
end as mydate,
code,
price
from temp1
),
max_dates as (
select code, max(mydate) max_date
from corrected_table
group by code
)
select c1.mydate, d1.code, coalesce(c1.price, 0)
from corrected_table c1
inner join max_dates m1
on m1.code = c1.code
and m1.max_date = c1.mydate
right join distinct_codes d1
on d1.code = c1.code
order by code;