PostgreSQL, Custom aggregate - postgresql

Is here way to get function like custom aggregate when MAX and SUM is not enough to get result?
Here is my table:
DROP TABLE IF EXISTS temp1;
CREATE TABLE temp1(mydate text, code int, price decimal);
INSERT INTO temp1 (mydate, code, price) VALUES
('01.01.2014 14:32:11', 1, 9.75),
( '', 1, 9.99),
( '', 2, 40.13),
('01.01.2014 09:12:04', 2, 40.59),
( '', 3, 18.10),
('01.01.2014 04:13:59', 3, 18.20),
( '', 4, 10.59),
('01.01.2014 15:44:32', 4, 10.48),
( '', 5, 8.19),
( '', 5, 8.24),
( '', 6, 11.11),
('04.01.2014 10:22:35', 6, 11.09),
('01.01.2014 11:48:15', 6, 11.07),
('01.01.2014 22:18:33', 7, 22.58),
('03.01.2014 13:15:40', 7, 21.99),
( '', 7, 22.60);
Here is query for getting result:
SELECT code,
ROUND(AVG(price), 2),
MAX(price)
FROM temp1
GROUP BY code
ORDER BY code;
In short:
I have to get LAST price by date (written as text) for every grouped code if date exists otherwise (if date isn't written) price should be 0.
In column LAST is wanted result and result of AVG and MAX for illustration:
CODE LAST AVG MAX
------------------------------
1 9.75 9.87 9.99
2 40.59 40.36 40.59
3 18.20 18.15 18.20
4 10.48 10.54 10.59
5 0.00 8.22 8.24
6 11.09 11.09 11.11
7 21.99 22.39 22.60
How would I get wanted result?
How that query would look like?
EDITED
I simply have to try 'IMSoP's advices to update and use custom aggregate functions first/last.
SELECT code,
CASE WHEN MAX(mydate)<>'' THEN
(SELECT last(price ORDER BY TO_TIMESTAMP(mydate, 'DD.MM.YYYY HH24:MI:SS')))
ELSE
0
END AS "LAST",
ROUND(AVG(price), 2) AS "AVG",
MAX(price) AS "MAX"
FROM temp1
GROUP BY code
ORDER BY code;
With this simple query I get same results as with Mike's complex query.
And more, those one better consumes double (same) entries in mydate column, and is faster.
Is this possible? It look's similar to 'SELECT * FROM magic()' :)

You said in comments that one code can have two rows with the same date. So this is sane data.
01.01.2014 1 3.50
01.01.2014 1 17.25
01.01.2014 1 99.34
There's no deterministic way to tell which of those rows is the "last" one, even if you sort by code and "date". (In the relational model--a model based on mathematical sets--the order of columns is irrelevant, and the order of rows is irrelevant.) The query optimizer is free to return rows is the way it thinks best, so this query
select *
from temp1
order by mydate, code
might return this on one run,
01.01.2014 1 3.50
01.01.2014 1 17.25
01.01.2014 1 99.34
and this on another.
01.01.2014 1 3.50
01.01.2014 1 99.34
01.01.2014 1 17.25
Unless you store some value that makes the meaning of last obvious, what you're trying to do isn't possible. When people need to make last obvious, they usually use a timestamp.
After your changes, this query seems to return what you're looking for.
with distinct_codes as (
select distinct code
from temp1
),
corrected_table as (
select
case when mydate <> '' then TO_TIMESTAMP(mydate, 'DD.MM.YYYY HH24:MI:SS')
else null
end as mydate,
code,
price
from temp1
),
max_dates as (
select code, max(mydate) max_date
from corrected_table
group by code
)
select c1.mydate, d1.code, coalesce(c1.price, 0)
from corrected_table c1
inner join max_dates m1
on m1.code = c1.code
and m1.max_date = c1.mydate
right join distinct_codes d1
on d1.code = c1.code
order by code;

Related

Get the next (or previous) non-null value in multiple partitioned

Sample data below.
I want to clean up data based on the next non-null value of the same id, based on row (actually a timestamp).
I can't do lag, because in some cases there are consecutive nulls.
I can't do coalesce(a.col_a, (select min(b.col_a) from table b where a.id=b.id)) because it will return an "outdated" value (eg NYC instead of SF in col_a row 4). (I can do this, once I've accounted for everything else, for the cases where i have no next non-null value, like col_b row 9/10, to just fill in the last).
The only thing I can think of is to do
table_x as (select id, col_x from table where col_a is not null)
for each column, and then join taking the minimum where id = id and table_x.row > table.row. But I have a handful of columns and that feels cumbersome and inefficient.
Appreciate any help!
row
id
col_a
col_a_desired
col_b
col_b_desired
0
1
-
NYC
red
red
1
1
NYC
NYC
red
red
2
1
SF
SF
-
blue
3
1
-
SF
-
blue
4
1
SF
SF
blue
blue
5
2
PAR
PAR
red
red
6
2
LON
LON
-
blue
7
2
LON
LON
-
blue
8
2
-
LON
blue
blue
9
2
LON
LON
-
blue
10
2
-
LON
-
blue
Can you try this query?
WITH samp AS (
SELECT 0 row_id, 1 id, null col_a, 'red' col_b UNION ALL
SELECT 1, 1, 'NYC', 'red' UNION ALL
SELECT 2, 1, 'SF', NULL UNION ALL
SELECT 3, 1, NULL, NULL UNION ALL
SELECT 4, 1, 'SF', 'blue' UNION ALL
SELECT 5, 2, 'PAR', 'red' UNION ALL
SELECT 6, 2, 'LON', NULL UNION ALL
SELECT 7, 2, 'LON', NULL UNION ALL
SELECT 8, 2, NULL, 'blue' UNION ALL
SELECT 9, 2, 'LON', NULL UNION ALL
SELECT 10, 2, NULL, NULL
)
SELECT
row_id,
id,
IFNULL(FIRST_VALUE(col_a IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
FIRST_VALUE(col_a IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id desc
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_a,
IFNULL(FIRST_VALUE(col_b IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
FIRST_VALUE(col_b IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id desc
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_b
from samp order by id, row_id
Output:
References:
https://cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions#first_value
https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls
I want to clean up data based on the next non-null value.
So if you reverse the order, that's the last non-null value.
If you have multiple columns and the logic is too cumbersome to write in SQL, you can write it in plpgsql instead, or even use the script language of your choice (but that will be slower).
The idea is to open a cursor for update, with an ORDER BY in the reverse order mentioned in the question. Then the plpgsql code stores the last non-null values in variables, and if needed issues an UPDATE WHERE CURRENT OF cursor to replace the nulls in the table with desired values.
This may take a while, and the numerous updates will take a lot of locks. It looks like your data can be processed in independent chunks using the "id" column as chunk identifier, so it would be a good idea to use that.

Divide table raw into chunks in Postgres with st_dwithin limit

I got a table with linestrings that I want to divide into chunks that have a list of id not higher than provided number for each and store only lines that are within certain distance.
For example, I got a table with 14 rows
create table lines ( id integer primary key, geom geometry(linestring) );
insert into lines (id, geom) values ( 1, 'LINESTRING(0 0, 0 1)');
insert into lines (id, geom) values ( 2, 'LINESTRING(0 1, 1 1)');
insert into lines (id, geom) values ( 3, 'LINESTRING(1 1, 1 2)');
insert into lines (id, geom) values ( 4, 'LINESTRING(1 2, 2 2)');
insert into lines (id, geom) values ( 11, 'LINESTRING(2 2, 2 3)');
insert into lines (id, geom) values ( 12, 'LINESTRING(2 3, 3 3)');
insert into lines (id, geom) values ( 13, 'LINESTRING(3 3, 3 4)');
insert into lines (id, geom) values ( 14, 'LINESTRING(3 4, 4 4)');
create index lines_gix on lines using gist(geom);
I want to split it into chunks with 3 ids for each chunk with lines that are within 2 meters from each other or the first one.
The result I am trying to get from this example is:
| Chunk No.| Id chunk list |
|----------|----------------|
| 1 | 1, 2, 3 |
| 2 | 4, 5, 6 |
| 3 | 7, 8, 9 |
| 4 | 10, 11, 12 |
| 5 | 13, 14 |
I tried to use st_clusterwithin but when lines are close to each other it will return all of them not split into chunks.
I also tried to use some with recursive magic like the one from the answer provided by Paul Ramsey here. But I don't know how to modify the query to return limited grouped id list.
I am not sure if it is the best possible answer so if anyone has a better method or know how to improve provided answer feel free to update it. With a little modification of Paul answer, I've managed to create following queries that are doing what I asked for.
-- Create function for easier interaction
CREATE OR REPLACE FUNCTION find_connected(integer, double precision, integer, integer[])
returns integer[] AS
$$
WITH RECURSIVE lines_r AS -- Recursive allow to use the same query on the output - is like continues append to result and use it inside a query
(SELECT ARRAY[id] AS idlist,
geom, id
FROM lines
WHERE id = $1
UNION ALL
SELECT array_append(lines_r.idlist, lines.id) AS idlist, -- append id list to array
lines.geom AS geom, -- keep geometry
lines.id AS id -- keep source table id
FROM (SELECT * FROM lines WHERE NOT $4 #> array[id]) lines, lines_r -- from source table and recursive table
WHERE ST_DWITHIN(lines.geom, lines_r.geom, $2) -- where lines are within 2 meters
AND NOT lines_r.idlist #> ARRAY[lines.id] -- recursive id list array not contain lines array
AND array_length(idlist, 1) <= $3
)
SELECT idlist
FROM lines_r WHERE array_length(idlist, 1) <= $3 ORDER BY array_length(idlist, 1) DESC LIMIT 1;
$$
LANGUAGE 'sql';
-- Create id chunks
WITH RECURSIVE groups_r AS (
(SELECT find_connected(id, 2, 3, ARRAY[id]) AS idlist, find_connected(id, 2, 3, ARRAY[id]) AS grouplist, id
FROM lines WHERE id = 1)
UNION ALL
(SELECT array_cat(groups_r.idlist, find_connected(lines.id, 2, 3, groups_r.idlist)) AS idlist,
find_connected(lines.id, 2, 3, groups_r.idlist) AS grouplist,
lines.id
FROM lines,
groups_r
WHERE NOT groups_r.idlist #> ARRAY[lines.id]
LIMIT 1))
SELECT
-- (SELECT array_agg(DISTINCT x) FROM unnest(idlist) t (x)) idlist, -- left for better understanding what is happening
row_number() OVER () chunk_id,
(SELECT array_agg(DISTINCT x) FROM unnest(grouplist) t (x)) grouplist,
id input_line_id
FROM groups_r;
The only problem is that performance is quite pure when the number of ids in the chunk increase. For a table with 300 rows and 20 ids per chunk, execution time is around 15 min, even with indexes on geometry and id columns.

Limit query by count distinct column values

I have a table with people, something like this:
ID PersonId SomeAttribute
1 1 yellow
2 1 red
3 2 yellow
4 3 green
5 3 black
6 3 purple
7 4 white
Previously I was returning all of Persons to API as seperate objects. So if user set limit to 3, I was just setting query maxResults in hibernate to 3 and returning:
{"PersonID": 1, "attr":"yellow"}
{"PersonID": 1, "attr":"red"}
{"PersonID": 2, "attr":"yellow"}
and if someone specify limit to 3 and page 2(setMaxResult(3), setFirstResult(6) it would be:
{"PersonID": 3, "attr":"green"}
{"PersonID": 3, "attr":"black"}
{"PersonID": 3, "attr":"purple"}
But now I want to select people and combine then into one json object to look like this:
{
"PersonID":3,
"attrs": [
{"attr":"green"},
{"attr":"black"},
{"attr":"purple"}
]
}
And here is the problem. Is there any possibility in postgresql or hibernate to set limit not by number of rows but to number of distinct people ids, because if user specifies limit to 4 I should return person1, 2, 3 and 4, but in my current limiting mechanism I will return person1 with 2 attributes, person2 and person3 with only one attribute. Same problem with pagination, now I can return half of a person3 array attrs on one page and another half on next page.
You can use row_number to simulate LIMIT:
-- Test data
CREATE TABLE person AS
WITH tmp ("ID", "PersonId", "SomeAttribute") AS (
VALUES
(1, 1, 'yellow'::TEXT),
(2, 1, 'red'),
(3, 2, 'yellow'),
(4, 3, 'green'),
(5, 3, 'black'),
(6, 3, 'purple'),
(7, 4, 'white')
)
SELECT * FROM tmp;
-- Returning as a normal column (limit by someAttribute size)
SELECT * FROM (
select
"PersonId",
"SomeAttribute",
row_number() OVER(PARTITION BY "PersonId" ORDER BY "PersonId") AS rownum
from
person) as tmp
WHERE rownum <= 3;
-- Returning as a normal column (overall limit)
SELECT * FROM (
select
"PersonId",
"SomeAttribute",
row_number() OVER(ORDER BY "PersonId") AS rownum
from
person) as tmp
WHERE rownum <= 4;
-- Returning as a JSON column (limit by someAttribute size)
SELECT "PersonId", json_object_agg('color', "SomeAttribute") AS attributes FROM (
select
"PersonId",
"SomeAttribute",
row_number() OVER(PARTITION BY "PersonId" ORDER BY "PersonId") AS rownum
from
person) as tmp
WHERE rownum <= 3 GROUP BY "PersonId";
-- Returning as a JSON column (limit by person)
SELECT "PersonId", json_object_agg('color', "SomeAttribute") AS attributes FROM (
select
"PersonId",
"SomeAttribute"
from
person) as tmp
GROUP BY "PersonId"
LIMIT 4;
In this case, of course, you must use a native query, but this is a small trade-off IMHO.
More info here and here.
I'm assuming you have another Person table. With JPA, you should do the query on Person table(one side), not on the PersonColor(many side).Then the limit will be applied on number of rows of Person then
If you don't have the Person table and can't modify the DB, what you can do is use SQL and Group By PersonId, and concatenate colors
select PersonId, array_agg(Color) FROM my_table group by PersonId limit 2
SQL Fiddle
Thank you guys. After I realize that it could not be done with one query I just do sth like
temp_query = select distinct x.person_id from (my_original_query) x
with user specific page/per_page
and then:
my_original_query += " AND person_id in (temp_query_results)

How to rewrite SQL joins into window functions?

Database is HP Vertica 7 or PostgreSQL 9.
create table test (
id int,
card_id int,
tran_dt date,
amount int
);
insert into test values (1, 1, '2017-07-06', 10);
insert into test values (2, 1, '2017-06-01', 20);
insert into test values (3, 1, '2017-05-01', 30);
insert into test values (4, 1, '2017-04-01', 40);
insert into test values (5, 2, '2017-07-04', 10);
Of the payment cards used in the last 1 day, what is the maximum amount charged on that card in the last 90 days.
select t.card_id, max(t2.amount) max
from test t
join test t2 on t2.card_id=t.card_id and t2.tran_dt>='2017-04-06'
where t.tran_dt>='2017-07-06'
group by t.card_id
order by t.card_id;
Results are correct
card_id max
------- ---
1 30
I want to rewrite the query into sql window functions.
select card_id, max(amount) over(partition by card_id order by tran_dt range between '60 days' preceding and current row) max
from test
where card_id in (select card_id from test where tran_dt>='2017-07-06')
order by card_id;
But result set does not match, how can this be done?
Test data here:
http://sqlfiddle.com/#!17/db317/1
I can't try PostgreSQL, but in Vertica, you can apply the ANSI standard OLAP window function.
But you'll need to nest two queries: The window function only returns sensible results if it has all rows that need to be evaluated in the result set.
But you only want the row from '2017-07-06' to be displayed.
So you'll have to filter for that date in an outer query:
WITH olap_output AS (
SELECT
card_id
, tran_dt
, MAX(amount) OVER (
PARTITION BY card_id
ORDER BY tran_dt
RANGE BETWEEN '90 DAYS' PRECEDING AND CURRENT ROW
) AS the_max
FROM test
)
SELECT
card_id
, the_max
FROM olap_output
WHERE tran_dt='2017-07-06'
;
card_id|the_max
1| 30
As far as I know, PostgreSQL Window function doesn't support bounded range preceding thus range between '90 days' preceding won't work. It does support bounded rows preceding such as rows between 90 preceding, but then you would need to assemble a time-series query similar to the following for the Window function to operate on the time-based rows:
SELECT c.card_id, t.amount, g.d as d_series
FROM generate_series(
'2017-04-06'::timestamp, '2017-07-06'::timestamp, '1 day'::interval
) g(d)
CROSS JOIN ( SELECT distinct card_id from test ) c
LEFT JOIN test t ON t.card_id = c.card_id and t.tran_dt = g.d
ORDER BY c.card_id, d_series
For what you need (based on your question description), I would stick to using group by.

SQL Time Series Completion Script

Version: SQL Server 2014
Objective: Create a complete time series with existing date range records.
Initial Data Setup:
IF OBJECT_ID('tempdb..#DataSet') IS NOT NULL
DROP TABLE #DataSet;
CREATE TABLE #DataSet (
RowID INT
,StartDt DATETIME
,EndDt DATETIME
,Col1 FLOAT);
INSERT INTO #DataSet (
RowID
,StartDt
,EndDt
,Col1)
VALUES
(1234,'1/1/2016','12/31/2999',100)
,(1234,'7/23/2016','7/27/2016',90)
,(1234,'7/26/2016','7/31/2016',80)
,(1234,'10/1/2016','12/31/2999',75);
Desired Results:
RowID, StartDt, EndDt, Col1
1234, '01/01/2016', '07/22/2016', 100
1234, '07/23/2016', '07/26/2016', 90
1234, '07/26/2016', '07/31/2016', 80
1234, '08/01/2016', '09/30/2016', 100
1234, '10/01/2016', '12/31/2999', 75
Not an easy task I will admit, If anyone has a suggestion on how to tackle this utilizing SQL alone (Microsoft 2014 TSQL) I would greatly appreciate it. Please keep in mind it is SQL and we want to try to avoid cursors at all costs based on performance for large data sets.
Thanks in Advance.
Also as an FYI I was able to achieve half of this by utilizing a LEAD windows function to set the End Date of the current record to the Startdate-1 of the next. The other half of filling gaps back in from previous records still eludes me.
Updated for the 9/31 to 9/30 date.
The following query does essentially what you are asking. You can tweak it to fit your requirements. Note that when checking the results of my query, your desired results contain 09/31/2016 which is not a valid date.
WITH
RankedData AS
(
SELECT RowID, StartDt, EndDt, Col1,
DATEADD(day, -1, StartDt) AS PrevEndDt,
RANK() OVER(ORDER BY StartDt, EndDt, RowID) AS rank_no
FROM #DataSet
),
HasGapsData AS
(
SELECT a.RowID, a.StartDt,
CASE WHEN b.PrevEndDt <= a.EndDt THEN b.PrevEndDt ELSE a.EndDt END AS EndDt,
a.Col1, a.rank_no
FROM RankedData a
LEFT JOIN RankedData b ON a.rank_no = b.rank_no - 1
)
SELECT RowID, StartDt, EndDt, Col1
FROM HasGapsData
UNION ALL
SELECT a.RowID,
DATEADD(day, 1, a.EndDt) AS StartDt,
DATEADD(day, -1, b.StartDt) AS EndDt,
a.Col1
FROM HasGapsData a
INNER JOIN HasGapsData b ON a.rank_no = b.rank_no - 1
WHERE DATEDIFF(day, a.EndDt, b.StartDt) > 1
ORDER BY StartDt, EndDt;