Aggregates in PostgreSQL

Aggregates in PostgreSQL - postgresql

Write an aggregate to count the number of times the number 40 is seen in a column.
Use your aggregate to count the number of 40 year olds in the directory table.
This is what I was doing:
Create function aggstep(curr int) returns int as $$
begin
return curr.count where age = 40;
end;
$$ language plpgsql;
Create aggregate aggs(integer) (
stype = int,
initcond = '',
sfunc = aggstep);
Select cas(age) from directory;

You could do it for example like this:
First, create a transition function:
CREATE FUNCTION count40func(bigint, integer) RETURNS bigint
LANGUAGE sql IMMUTABLE CALLED ON NULL INPUT AS
'SELECT $1 + ($2 IS NOT DISTINCT FROM 40)::integer::bigint';
That works because FALSE::integer is 0 and TRUE::integer is 1.
I use IS NOT DISTINCT FROM rather than = so that it does the correct thing for NULLs.
The aggregate can then be defined as
CREATE AGGREGATE count40(integer) (
SFUNC = count40func,
STYPE = bigint,
INITCOND = 0
);
You can then query like
SELECT count40(age) FROM directory;

Related

Sum all ascii values for every character of varchar in PostgreSQL

I have a table I want to partition based on HASH. This table has a column with varchar, which is the key I want to use to partition.
Ofc. I can't partition based on HASH with varchar, therefore I will SUM all the ASCII values of each character in the varchar.
I hope to get some help to stitch together a function, which takes a varchar parameter and returns the SUM as an INTEGER.
I have tried several variations - some of them commented out -, this is how it looks so far:
CREATE OR REPLACE FUNCTION sum_string_ascii_values(theString varchar)
RETURNS INTEGER
LANGUAGE plpgsql
AS
$$
DECLARE
theSum INTEGER;
BEGIN
-- Sum on all ascii values coming from the every single char from the input varchar.
SELECT SUM( val )
FROM LATERAL ( SELECT ASCII( UNNEST( STRING_TO_ARRAY( LOWER(theString), null) ) ) ) AS val
INTO theSum;
--SELECT SUM(val) FROM ASCII( UNNEST( STRING_TO_ARRAY( LOWER(theString), null) ) ) AS val INTO theSUM;
--RETURN SUM( ASCII( UNNEST( STRING_TO_ARRAY( LOWER(theString), null) ) ) );
RETURN theSUM;
END;
$$;
I hope someone will be able to write and explain a solution to this problem.

Instead of using SELECT to sum the characters, you can loop through the string instead
CREATE OR REPLACE FUNCTION sum_string_ascii_values(input text) RETURNS int LANGUAGE plpgsql AS $$
DECLARE
hash int = 0;
pos int = 0;
BEGIN
WHILE pos <= length(input) LOOP
hash = hash + ascii(upper(substr(input, pos, 1)));
pos = pos + 1;
END LOOP;
RETURN hash;
END;
$$;
Here is a link to a dbfiddle to demonstrate https://dbfiddle.uk/yfhpHyT1

How to turn a greater strict function into an aggregate

I have been looking for a max function setting null to the max value and found the following on (https://www.postgresql.org/message-id/r2y162867791004201002x50843917y3d1f1293db7451e0#mail.gmail.com) :
create or replace function greatest_strict(variadic anyarray)
returns anyelement as $$
select null from unnest($1) g(v) where v is null
union all
select max(v) from unnest($1) g(v)
limit 1
$$ language sql;
The problem is that this function is not an aggregation function usable for group by. How can I change that? Such that I can use the following query:
SELECT greatest_strict(performed_on) as start_date
from task
group by contract_id;

I've created this before: https://wiki.postgresql.org/wiki/Aggregate_strict_min_and_max
I call it strict_max, not strict_greatest, because "max" is already an aggregate so that seems like a better name.
This has the advantage (over the other answer) of not storing all the values in memory while it is aggregating over them, so that it can work on very large data sets.

You can create your own aggregation functions.
create aggregate agg_greatest_strict(anyelement) (
sfunc = create_array,
stype = anyarray,
finalfunc = greatest_strict,
initcond = '{}'
);
sfunc is a function which will be executed for every row and returns an intermediate result.
finalfunc will be executed afterwards with the result of the last sfunc execution.
In your case you could create the arrays for every row (your sfunc):
create or replace function create_array(anyarray, anyelement)
returns anyarray as $$
SELECT
$1 || $2
$$ language sql;
This simply aggregates the row values into one array. (first parameter is the result of the previous execution; if it is the first one, initcond value will be taken instead)
Afterwards you can take your function as finalfunc:
create or replace function greatest_strict(anyarray)
returns anyelement as $$
select null from unnest($1) g(v) where v is null
union all
select max(v) from unnest($1) g(v)
limit 1
$$ language sql;
demo:db<>fiddle
Edit: Former solutions without any finalfunc function using the greatest() function on every row:
demo:db<>fiddle (one sfunc for anyelement)
demo:db<>fiddle (overloaded sfunc for text and numeric type because of some problem with special chars and ASCII-order)

Function min(uuid) does not exist in postgresql

I have imported tables from Postgres to hdfs by using sqoop. My table have uuid field as primary key and my command sqoop as below:
sqoop import --connect 'jdbc:postgresql://localhost:5432/mydb' --username postgreuser --password 123456abcA --driver org.postgresql.Driver --table users --map-column-java id=String --target-dir /hdfs/postgre/users --as-avrodatafile --compress -m 2
But I got the error:
Import failed: java.io.IOException: org.postgresql.util.PSQLException: ERROR: function min(uuid) does not exist
I tried executed the sql command: SELECT min(id) from users and got the same error. How could I fix it ? I use Postgres 9.4, hadoop 2.9.0 and sqoop 1.4.7

I'd like to credit #robin-salih 's answer, I've used it and implementation of min for int, to build following code:
CREATE OR REPLACE FUNCTION min(uuid, uuid)
RETURNS uuid AS $$
BEGIN
IF $2 IS NULL OR $1 > $2 THEN
RETURN $2;
END IF;
RETURN $1;
END;
$$ LANGUAGE plpgsql;
create aggregate min(uuid) (
sfunc = min,
stype = uuid,
combinefunc = min,
parallel = safe,
sortop = operator (<)
);
It almost the same, but takes advantages of B-tree index, so select min(id) from tbl works in few millis.
P.S. I'm not pgsql expert, perhaps my code is somehow wrong, double check before use in production, but I hope it uses indexes and parallel execution correctly. I've made it just from sample code, not digging into theory behind aggregates in PG.

Postgres doesn't have built-in function for min/max uuid, but you can create your own using the following code:
CREATE OR REPLACE FUNCTION min(uuid, uuid)
RETURNS uuid AS $$
BEGIN
IF $2 IS NULL OR $1 > $2 THEN
RETURN $2;
END IF;
RETURN $1;
END;
$$ LANGUAGE plpgsql;
CREATE AGGREGATE min(uuid)
(
sfunc = min,
stype = uuid
);

I found the answer's provided by #robin-salih and #bodgan-mart to be a great starting point but ultimately incorrect. Here's a solution which worked better for me:
CREATE FUNCTION min_uuid(uuid, uuid)
RETURNS uuid AS $$
BEGIN
-- if they're both null, return null
IF $2 IS NULL AND $1 IS NULL THEN
RETURN NULL ;
END IF;
-- if just 1 is null, return the other
IF $2 IS NULL THEN
RETURN $1;
END IF ;
IF $1 IS NULL THEN
RETURN $2;
END IF;
-- neither are null, return the smaller one
IF $1 > $2 THEN
RETURN $2;
END IF;
RETURN $1;
END;
$$ LANGUAGE plpgsql;
create aggregate min(uuid) (
sfunc = min_uuid,
stype = uuid,
combinefunc = min_uuid,
parallel = safe,
sortop = operator (<)
);
For more details, see my post at How to select minimum UUID with left outer join?

I am defining min/max aggregates for uuids using least/greatest which I believe should give the best performance as those are native to postgres (but I haven't benchmarked it).
Since least/greatest are special forms (to my understanding) I have to proxy them using a function which I am marking as immutable and parallel safe.
least/greatest already have proper null-handling behavior.
I am using these in production on Postgres 13.
create or replace function min(uuid, uuid)
returns uuid
immutable parallel safe
language plpgsql as
$$
begin
return least($1, $2);
end
$$;
create aggregate min(uuid) (
sfunc = min,
stype = uuid,
combinefunc = min,
parallel = safe,
sortop = operator (<)
);
create or replace function max(uuid, uuid)
returns uuid
immutable parallel safe
language plpgsql as
$$
begin
return greatest($1, $2);
end
$$;
create aggregate max(uuid) (
sfunc = max,
stype = uuid,
combinefunc = max,
parallel = safe,
sortop = operator (>)
);

This is not a issue with sqoop. Postgres doesn't allow min/max on uuid. Each uuid is unique and is not considered bigger/smaller than other.
To fix this in sqoop you might need to use some other field as the split-by key. I used created_At timestamp as my split-by key instead.

Why Postgres doesn't return a result?

My first function consumes an array of UUID and returns a set of rows from the table:
CREATE OR REPLACE FUNCTION fun1 (
"UUID_" uuid []
)
RETURNS SETOF service AS
$body$
with recursive tree as (
SELECT * FROM service
WHERE id = ANY($1)
UNION ALL
SELECT service.* FROM service
JOIN tree ON service.id = tree.parent_id)
select distinct * from tree;
$body$
LANGUAGE 'sql'
VOLATILE
CALLED ON NULL INPUT
SECURITY INVOKER
COST 100 ROWS 1000;
For now I want to write another one to consume list of UUID as varchar and return the same data as the first function.
My bad try :
CREATE OR REPLACE FUNCTION fun2 (
"UUID_" varchar
)
RETURNS TABLE (
"ID" uuid,
"NAME" varchar,
"PARENT_ID" uuid
) AS
$body$
BEGIN
RETURN QUERY
with recursive tree as (
SELECT * FROM service
WHERE id = ANY(string_to_array($1, ',')::UUID[])
UNION ALL
SELECT service.* FROM service
JOIN tree ON service.id = tree.parent_id)
select distinct(ID), name, parent_id from tree;
END;
$body$
LANGUAGE 'plpgsql'
VOLATILE
CALLED ON NULL INPUT
SECURITY INVOKER
COST 100 ROWS 1000;

I'll answer by myself :
function fun1 can use varchar with converting to array of UUID when I use it.
so I can write :
select * from fun1('{some UUID array as varchar}'::UUID[]);

PostgreSQL aggregate function over range

I am trying to create a function that will find the intersection of tsrange, but I can't get it work:
CREATE AGGREGATE intersection ( tsrange ) (
SFUNC = *,
STYPE = tsrange
)

There are two modifications to your attempt. First, I don't think you can use an operator as the SFUNC, so you need to define a named function to do the intersection, and use that.
CREATE or REPLACE FUNCTION int_tsrange(a tsrange, b tsrange)
returns tsrange language plpgsql as
'begin return a * b; end';
Secondly, the default value for a range is the empty range -- so the intersection will always be empty. You need to initialize the range to an infinite range '[,]' to begin the aggregate. The aggregate definition then looks like:
CREATE AGGREGATE intersection ( tsrange ) (
SFUNC = int_tsrange,
STYPE = tsrange,
INITCOND = '[,]'
);

If interested, it is possible to define a single function for all range types:
CREATE OR REPLACE FUNCTION range_intersection(a ANYRANGE, b ANYRANGE) RETURNS ANYRANGE
AS $$
SELECT a * b;
$$ LANGUAGE sql IMMUTABLE STRICT;
CREATE AGGREGATE range_intersection_agg(ANYRANGE) (
SFUNC = range_intersection,
STYPE = ANYRANGE,
INITCOND = '(,)'
);
SELECT range_intersection_agg(rng)
FROM (VALUES (int4range(1, 10)), (int4range(2, 20)), (int4range(4, NULL))) t (rng)
-- outputs [4,10)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Aggregates in PostgreSQL - postgresql

Related

Sum all ascii values for every character of varchar in PostgreSQL

How to turn a greater strict function into an aggregate

Function min(uuid) does not exist in postgresql

Why Postgres doesn't return a result?

PostgreSQL aggregate function over range

Categories

Resources