PostgreSQL custom aggregate: more efficient option than array concatenation

PostgreSQL custom aggregate: more efficient option than array concatenation - postgresql

Consider a custom aggregate intended to take the set union of a bunch of arrays:
CREATE FUNCTION array_union_step (s ANYARRAY, n ANYARRAY) RETURNS ANYARRAY
AS $$ SELECT s || n; $$
LANGUAGE SQL IMMUTABLE LEAKPROOF PARALLEL SAFE;
CREATE FUNCTION array_union_final (s ANYARRAY) RETURNS ANYARRAY
AS $$
SELECT array_agg(i ORDER BY i) FROM (
SELECT DISTINCT UNNEST(x) AS i FROM (VALUES(s)) AS v(x)
) AS w WHERE i IS NOT NULL;
$$
LANGUAGE SQL IMMUTABLE LEAKPROOF PARALLEL SAFE;
CREATE AGGREGATE array_union (ANYARRAY) (
SFUNC = array_union_step,
STYPE = ANYARRAY,
FINALFUNC = array_union_final,
INITCOND = '{}',
PARALLEL = SAFE
);
As I understand it, array concatenation in PostgreSQL copies all the elements of both inputs into a new array, so this is quadratic in the total number of elements (before deduplication). Is there a more efficient alternative without writing extension code in C? (Specifically, using either LANGUAGE SQL or LANGUAGE plpgsql.) For instance, maybe it's possible for the step function to take and return a set of rows somehow?
An example of the kind of data this needs to be able to process:
create temp table demo (tag int, values text[]);
insert into demo values
(1, '{"a", "b"}'),
(2, '{"c", "d"}'),
(1, '{"a"}'),
(2, '{"c", "e", "f"}');
select tag, array_union(values) from demo group by tag;
tag | array_union
-----+-------------
2 | {c,d,e,f}
1 | {a,b}
Note in particular that the built-in array_agg cannot be used with this data, because the arrays are not all the same length:
select tag, array_agg(values) from demo group by tag;
ERROR: cannot accumulate arrays of different dimensionality

Array concatenation is expensive. That's why build-in array_agg() uses the internal array builder structure. Unfortunately, you cannot use this API on the SQL level.
I don't think using temp tables for custom aggregation is correct. Creating and dropping tables is expensive (temp table too) or very expensive (with high frequency), INSERT and SELECT is expensive, too. (Tables have a much more complex format than arrays.) If you need really fast aggregation, then write a C functions and use the array builder.
If you cannot use your own C extension, then use the built-in array_agg() function with deduplication and sort on already aggregated data.
CREATE OR REPLACE FUNCTION array_distinct(anyarray)
RETURNS anyarray AS $$
BEGIN
RETURN ARRAY(SELECT DISTINCT v FROM unnest($1) u(v) WHERE v IS NOT NULL ORDER BY v);
END;
$$ LANGUAGE plpgsql IMMUTABLE STRICT;
Call:
SELECT ..., array_distinct(array_agg(somecolumn)) FROM tab;

Related

Merge hstore data from multiple rows

Imagine I have a table with this definition:
CREATE TABLE test (
values HSTORE NOT NULL
);
Imagine I insert a few records and end up with the following:
values
-----------------------------
"a"=>"string1","b"=>"string2"
"b"=>"string2","c"=>"string3"
Is there any way I can make an aggregate query that will give me a new hstore with the merged keys (and values) for all rows.
Pseudo-query:
SELECT hstore_sum(values) AS value_sum FROM test;
Desired result:
value_sum
--------------------------------------------
"a"=>"string1","b"=>"string2","c"=>"string3"
I am aware of potential conflicts with different values for each key, but in my case the order / priority of which value is picked is not important (it does not even have to be deterministic, as they will be the same for the same key).
Is this possible out of the box or do you have to use some specific homemade SQL functions or other to do it?

You can do a lot of things, f.ex:
My first thought was to use the each() function, and aggregate keys and values separately, like:
SELECT hstore(array_agg(key), array_agg(value))
FROM test,
LATERAL each(hs);
But this performs the worst.
You can use the hstore_to_array() function too, to build a key-value altering array, like (#JakubKania):
SELECT hstore(array_agg(altering_pairs))
FROM test,
LATERAL unnest(hstore_to_array(hs)) altering_pairs;
But this isn't perfect yet.
You can rely the hstore values' representation, and build up a string, which will contain all your pairs:
SELECT hstore(string_agg(nullif(hs::text, ''), ','))
FROM test;
This is quite fast. However, if you want, you can use a custom aggregate function (which can use the built-in hstore concatenation):
CREATE AGGREGATE hstore_sum (hstore) (
SFUNC = hs_concat(hstore, hstore),
STYPE = hstore
);
-- i used the internal function (hs_concat) for the concat (||) operator,
-- if you do not want to rely on this function,
-- you could easily write an equivalent in a custom SQL function
SELECT hstore_sum(hs)
FROM test;
SQLFiddle

There is no in built function for that but hstore offers a few functions that allow to transform it to something else, for example an array. So we cast it to array, merge the arrays and create hstore from final array:
SELECT hstore(array_agg(x)) FROM
(SELECT unnest(hstore_to_array(hs)) AS x
FROM test)
as q;
http://sqlfiddle.com/#!15/cb11a/1
P.S. Some other combination (like going with JSON) might be more efficient.

This is what I wrote which works in production now. I avoid excessive conversion between types e.g. hstore and array. I also don't use hs_concat as sfunc directly as it will pruduce NULL if any of the hashes it is aggregating is NULL.
CREATE OR REPLACE FUNCTION public.agg_hstore_sum_sfunc(state hstore, val hstore)
RETURNS hstore AS $$
BEGIN
IF val IS NOT NULL THEN
IF state IS NULL THEN
state := val;
ELSE
state := state || val;
END IF;
END IF;
RETURN state;
END;
$$ LANGUAGE 'plpgsql';
CREATE AGGREGATE public.sum(hstore) (
SFUNC = public.agg_hstore_sum_sfunc,
STYPE = hstore
);

Improve performance of custom aggregate function in PostgreSQL

I have custom aggregate sum function which accepts boolean data type:
create or replace function badd (bigint, boolean)
returns bigint as
$body$
select $1 + case when $2 then 1 else 0 end;
$body$ language sql;
create aggregate sum(boolean) (
sfunc=badd,
stype=int8,
initcond='0'
);
This aggregate should calculate number of rows with TRUE. For example the following should return 2 (and it does):
with t (x) as
(values
(true::boolean),
(false::boolean),
(true::boolean),
(null::boolean)
)
select sum(x) from t;
However, it's performance is quite bad, it is 5.5 times slower then using casting to integer:
with t as (select (gs > 0.5) as test_vector from generate_series(1,1000000,1) gs)
select sum(test_vector) from t; -- 52012ms
with t as (select (gs > 0.5) as test_vector from generate_series(1,1000000,1) gs)
select sum(test_vector::int) from t; -- 9484ms
Is the only way how to improve this aggregate to write some new C function - e.g. some alternative of int2_sum function in src/backend/utils/adt/numeric.c?

Your test case is misleading, you only count TRUE. You should have both TRUE and FALSE - or even NULL, if applicable.
Like #foibs already explained, one wouldn't use a custom aggregate function for this. The built-in C-functions are much faster and do the job. Use instead (also demonstrating a simpler and more sensible test):
SELECT count(NULLIF(g%2 = 1, FALSE)) AS ct
FROM generate_series(1,100000,1) g;
How does this work?
Compute percents from SUM() in the same SELECT sql query
Several fast & simple ways (plus a benchmark) under this related answer on dba.SE:
For absolute performance, is SUM faster or COUNT?
Or faster yet, test for TRUE in the WHERE clause, where possible:
SELECT count(*) AS ct
FROM generate_series(1,100000,1) g;
WHERE g%2 = 1 -- excludes FALSE and NULL !
If you'd have to write a custom aggregate for some reason, this form would be superior:
CREATE OR REPLACE FUNCTION test_sum_int8 (int8, boolean)
RETURNS bigint as
'SELECT CASE WHEN $2 THEN $1 + 1 ELSE $1 END' LANGUAGE sql;
The addition is only executed when necessary. Your original would add 0 for the FALSE case.
Better yet, use a plpgsql function. It saves a bit of overhead per call, since it works like a prepared statement (the query is not re-planned). Makes a difference for a tiny aggregate function that is called many times:
CREATE OR REPLACE FUNCTION test_sum_plpgsql (int8, boolean)
RETURNS bigint AS
$func$
BEGIN
RETURN CASE WHEN $2 THEN $1 + 1 ELSE $1 END;
END
$func$ LANGUAGE plpgsql;
CREATE AGGREGATE test_sum_plpgsql(boolean) (
sfunc = test_sum_plpgsql
,stype = int8
,initcond = '0'
);
Faster than what you had, but much slower than the presented alternative with a standard count(). And slower than any other C-function, too.
->SQLfiddle

I created custom C function and aggregate for boolean:
C function:
#include "postgres.h"
#include <fmgr.h>
#ifdef PG_MODULE_MAGIC
PG_MODULE_MAGIC;
#endif
int
bool_sum(int arg, bool tmp)
{
if (tmp)
{
arg++;
}
return arg;
}
Transition and aggregate functions:
-- transition function
create or replace function bool_sum(bigint, boolean)
returns bigint
AS '/usr/lib/postgresql/9.1/lib/bool_agg', 'bool_sum'
language C strict
cost 1;
alter function bool_sum(bigint, boolean) owner to postgres;
-- aggregate
create aggregate sum(boolean) (
sfunc=bool_sum,
stype=int8,
initcond='0'
);
alter aggregate sum(boolean) owner to postgres;
Performance test:
-- Performance test - 10m rows
create table tmp_test as (select (case when random() <.3 then null when random() < .6 then true else false end) as test_vector from generate_series(1,10000000,1) gs);
-- Casting to integer
select sum(test_vector::int) from tmp_test;
-- Boolean sum
select sum(test_vector) from tmp_test;
Now sum(boolean) is as fast as sum(boolean::int).
Update:
It turns out that I can call existing C transition functions directly, even with boolean data type. It gets somehow magically converted to 0/1 on the way. So my current solution for boolean sum and average is:
create or replace function bool_sum(bigint, boolean)
returns bigint as
'int2_sum'
language internal immutable
cost 1;
create aggregate sum(boolean) (
sfunc=bool_sum,
stype=int8
);
-- Average for boolean values (percentage of rows with TRUE)
create or replace function bool_avg_accum(bigint[], boolean)
returns bigint[] as
'int2_avg_accum'
language internal immutable strict
cost 1;
create aggregate avg(boolean) (
sfunc=bool_avg_accum,
stype=int8[],
finalfunc=int8_avg,
initcond='{0,0}'
);

I don't see the real issue here. First of all, using sum as your custom aggregate name is wrong. When you call sum with your test_vector cast to int, the embedded postgres sum is used and not yours, that's why it is so much faster. A C function will always be faster, but I'm not sure you need one in this case.
You could easily drop the badd function and your custom sum use the embedded sum with a where clause
with t as (select 1 as test_vector from generate_series(1,1000000,1) gs where gs > 0.5)
select sum(test_vector) from t;
EDIT:
To sum it up, the best way to optimize your custom aggregate is to remove it if it is not needed. The second best way would be to write a C function.

eliminate duplicate array values in postgres

I have an array of type bigint, how can I remove the duplicate values in that array?
Ex: array[1234, 5343, 6353, 1234, 1234]
I should get array[1234, 5343, 6353, ...]
I tested out the example SELECT uniq(sort('{1,2,3,2,1}'::int[])) in the postgres manual but it is not working.

I faced the same. But an array in my case is created via array_agg function. And fortunately it allows to aggregate DISTINCT values, like:
array_agg(DISTINCT value)
This works for me.

The sort(int[]) and uniq(int[]) functions are provided by the intarray contrib module.
To enable its use, you must install the module.
If you don't want to use the intarray contrib module, or if you have to remove duplicates from arrays of different type, you have two other ways.
If you have at least PostgreSQL 8.4 you could take advantage of unnest(anyarray) function
SELECT ARRAY(SELECT DISTINCT UNNEST('{1,2,3,2,1}'::int[]) ORDER BY 1);
?column?
----------
{1,2,3}
(1 row)
Alternatively you could create your own function to do this
CREATE OR REPLACE FUNCTION array_sort_unique (ANYARRAY) RETURNS ANYARRAY
LANGUAGE SQL
AS $body$
SELECT ARRAY(
SELECT DISTINCT $1[s.i]
FROM generate_series(array_lower($1,1), array_upper($1,1)) AS s(i)
ORDER BY 1
);
$body$;
Here is a sample invocation:
SELECT array_sort_unique('{1,2,3,2,1}'::int[]);
array_sort_unique
-------------------
{1,2,3}
(1 row)

... Where the statandard libraries (?) for this kind of array_X utility??
Try to search... See some but no standard:
postgres.cz/wiki/Array_based_functions: good reference!
JDBurnZ/postgresql-anyarray, good initiative but needs some collaboration to enhance.
wiki.postgresql.org/Snippets, frustrated initiative, but "offcial wiki", needs some collaboration to enhance.
MADlib: good! .... but it is an elephant, not an "pure SQL snippets lib".
Simplest and faster array_distinct() snippet-lib function
Here the simplest and perhaps faster implementation for array_unique() or array_distinct():
CREATE FUNCTION array_distinct(anyarray) RETURNS anyarray AS $f$
SELECT array_agg(DISTINCT x) FROM unnest($1) t(x);
$f$ LANGUAGE SQL IMMUTABLE;
NOTE: it works as expected with any datatype, except with array of arrays,
SELECT array_distinct( array[3,3,8,2,6,6,2,3,4,1,1,6,2,2,3,99] ),
array_distinct( array['3','3','hello','hello','bye'] ),
array_distinct( array[array[3,3],array[3,3],array[3,3],array[5,6]] );
-- "{1,2,3,4,6,8,99}", "{3,bye,hello}", "{3,5,6}"
the "side effect" is to explode all arrays in a set of elements.
PS: with JSONB arrays works fine,
SELECT array_distinct( array['[3,3]'::JSONB, '[3,3]'::JSONB, '[5,6]'::JSONB] );
-- "{"[3, 3]","[5, 6]"}"
Edit: more complex but useful, a "drop nulls" parameter
CREATE FUNCTION array_distinct(
anyarray, -- input array
boolean DEFAULT false -- flag to ignore nulls
) RETURNS anyarray AS $f$
SELECT array_agg(DISTINCT x)
FROM unnest($1) t(x)
WHERE CASE WHEN $2 THEN x IS NOT NULL ELSE true END;
$f$ LANGUAGE SQL IMMUTABLE;

Using DISTINCT implicitly sorts the array. If the relative order of the array elements needs to be preserved while removing duplicates, the function can be designed like the following: (should work from 9.4 onwards)
CREATE OR REPLACE FUNCTION array_uniq_stable(anyarray) RETURNS anyarray AS
$body$
SELECT
array_agg(distinct_value ORDER BY first_index)
FROM
(SELECT
value AS distinct_value,
min(index) AS first_index
FROM
unnest($1) WITH ORDINALITY AS input(value, index)
GROUP BY
value
) AS unique_input
;
$body$
LANGUAGE 'sql' IMMUTABLE STRICT;

I have assembled a set of stored procedures (functions) to combat PostgreSQL's lack of array handling coined anyarray. These functions are designed to work across any array data-type, not just integers as intarray does: https://www.github.com/JDBurnZ/anyarray
In your case, all you'd really need is anyarray_uniq.sql. Copy & paste the contents of that file into a PostgreSQL query and execute it to add the function. If you need array sorting as well, also add anyarray_sort.sql.
From there, you can peform a simple query as follows:
SELECT ANYARRAY_UNIQ(ARRAY[1234,5343,6353,1234,1234])
Returns something similar to: ARRAY[1234, 6353, 5343]
Or if you require sorting:
SELECT ANYARRAY_SORT(ANYARRAY_UNIQ(ARRAY[1234,5343,6353,1234,1234]))
Return exactly: ARRAY[1234, 5343, 6353]

Here's the "inline" way:
SELECT 1 AS anycolumn, (
SELECT array_agg(c1)
FROM (
SELECT DISTINCT c1
FROM (
SELECT unnest(ARRAY[1234,5343,6353,1234,1234]) AS c1
) AS t1
) AS t2
) AS the_array;
First we create a set from array, then we select only distinct entries, and then aggregate it back into array.

In a single query i did this:
SELECT (select array_agg(distinct val) from ( select unnest(:array_column) as val ) as u ) FROM :your_table;

For people like me who still have to deal with postgres 8.2, this recursive function can eliminate duplicates without altering the sorting of the array
CREATE OR REPLACE FUNCTION my_array_uniq(bigint[])
RETURNS bigint[] AS
$BODY$
DECLARE
n integer;
BEGIN
-- number of elements in the array
n = replace(split_part(array_dims($1),':',2),']','')::int;
IF n > 1 THEN
-- test if the last item belongs to the rest of the array
IF ($1)[1:n-1] #> ($1)[n:n] THEN
-- returns the result of the same function on the rest of the array
return my_array_uniq($1[1:n-1]);
ELSE
-- returns the result of the same function on the rest of the array plus the last element
return my_array_uniq($1[1:n-1]) || $1[n:n];
END IF;
ELSE
-- if array has only one item, returns the array
return $1;
END IF;
END;
$BODY$
LANGUAGE 'plpgsql' VOLATILE;
for exemple :
select my_array_uniq(array[3,3,8,2,6,6,2,3,4,1,1,6,2,2,3,99]);
will give
{3,8,2,6,4,1,99}

Avoid to nest aggregation functions in PostgreSQL 8.3.4

Assuming that my subquery yields a number of rows with the columns (x,y), I would like to calculate the value avg(abs(x-mean)/y). where mean effectively is avg(x).
select avg(abs(x-avg(x))/y) as incline from subquery fails because I cannot nest aggregation functions. Nor can I think of a way to calculate the mean in a subquery while keeping the original result set. An avgdev function as it exists in other dialects would not exactly help me, so here I am stuck. Probably just due to lack of sql knowledge - calculating the value from the result set in postprocessing is easy.
Which SQL construct could help me?
Edit: Server version is 8.3.4. No window functions with WITH or OVER available here.

Not sure I understand you correctly, but you might be looking for something like this:
SELECT avg(x - mean/y)
FROM (
SELECT x,
y,
avg(x) as mean over(partition by your_grouping_column)
FROM your_table
) t
If you do not need to group your results to get the correct avg(x) then simply leave out the "partition by" using an empty over: over()

if your data sets are not too large you could accumulate them into an array and then return the incline from a function:
create type typ as (x numeric, y numeric);
create aggregate array_accum( sfunc = array_append,
basetype = anyelement,
stype = anyarray,
initcond = '{}' );
create or replace function unnest(anyarray) returns setof anyelement
language sql immutable strict as $$
select $1[i] from generate_series(array_lower($1,1), array_upper($1,1)) i;$$;
create function get_incline(typ[]) returns numeric
language sql immutable strict as $$
select avg(abs(x-(select avg(x) from unnest($1)))/y) from unnest($1);$$;
select get_incline((select array_accum((x,y)::typ) from subquery));
sample view for testing:
create view subquery as
select generate_series(1,5) as x, generate_series(1,6) as y;

One option I found is to use a temporary table:
begin;
create temporary table sub on commit drop as (...subquery code...);
select avg(abs(x-mean)/y) as incline from (SELECT x, y, (SELECT avg(x) FROM sub) AS mean FROM sub) as sub2;
commit;
But is that overkill?

Sorting array elements

I want to write a stored procedure that gets an array as input parameter and sort that array and return the sorted array.

The best way to sort an array of integers is without a doubt to use the intarray extension, which will do it much, much, much faster than any SQL formulation:
CREATE EXTENSION intarray;
SELECT sort( ARRAY[4,3,2,1] );
A function that works for any array type is:
CREATE OR REPLACE FUNCTION array_sort (ANYARRAY)
RETURNS ANYARRAY LANGUAGE SQL
AS $$
SELECT ARRAY(SELECT unnest($1) ORDER BY 1)
$$;
(I've replaced my version with Pavel's slightly faster one after discussion elsewhere).

In PostrgreSQL 8.4 and up you can use:
select array_agg(x) from (select unnest(ARRAY[1,5,3,7,2]) AS x order by x) as _;
But it will not be very fast.
In older Postgres you can implement unnest like this
CREATE OR REPLACE FUNCTION unnest(anyarray)
RETURNS SETOF anyelement AS
$BODY$
SELECT $1[i] FROM
generate_series(array_lower($1,1),
array_upper($1,1)) i;
$BODY$
LANGUAGE 'sql' IMMUTABLE
And array_agg like this:
CREATE AGGREGATE array_agg (
sfunc = array_append,
basetype = anyelement,
stype = anyarray,
initcond = '{}'
);
But it will be even slower.
You can also implement any sorting algorithm in pl/pgsql or any other language you can plug in to postgres.

Just use the function unnest():
SELECT
unnest(ARRAY[1,2]) AS x
ORDER BY
x DESC;
See array functions in the Pg docs.

This worked for me from http://www.pgsql.cz/index.php/PostgreSQL_SQL_Tricks_I#General_array_sort
CREATE OR REPLACE FUNCTION array_sort (ANYARRAY)
RETURNS ANYARRAY LANGUAGE SQL
AS $$
SELECT ARRAY(
SELECT $1[s.i] AS "foo"
FROM
generate_series(array_lower($1,1), array_upper($1,1)) AS s(i)
ORDER BY foo
);
$$;
Please see Craig's answer since he is far more more knowledgable on Postgres and has a better answer. Also if possible vote to delete my answer.

Very nice exhibition of PostgreSQL's features is general procedure for sorting by David Fetter.
CREATE OR REPLACE FUNCTION array_sort (ANYARRAY)
RETURNS ANYARRAY LANGUAGE SQL
AS $$
SELECT ARRAY(
SELECT $1[s.i] AS "foo"
FROM
generate_series(array_lower($1,1), array_upper($1,1)) AS s(i)
ORDER BY foo
);
$$;

If you're looking for a solution which will work across any data-type, I'd recommend taking the approach laid out at YouLikeProgramming.com.
Essentially, you can create a stored procedure (code below) which performs the sorting for you, and all you need to do is pass your array to that procedure for it to be sorted appropriately.
I have also included an implementation which does not require the use of a stored procedure, if you're looking for your query to be a little more transportable.
Creating the stored procedure
DROP FUNCTION IF EXISTS array_sort(anyarray);
CREATE FUNCTION
array_sort(
array_vals_to_sort anyarray
)
RETURNS TABLE (
sorted_array anyarray
)
AS $BODY$
BEGIN
RETURN QUERY SELECT
ARRAY_AGG(val) AS sorted_array
FROM
(
SELECT
UNNEST(array_vals_to_sort) AS val
ORDER BY
val
) AS sorted_vals
;
END;
$BODY$
LANGUAGE plpgsql;
Sorting array values (works with any array data-type)
-- The following will return: {1,2,3,4}
SELECT ARRAY_SORT(ARRAY[4,3,2,1]);
-- The following will return: {in,is,it,on,up}
SELECT ARRAY_SORT(ARRAY['up','on','it','is','in']);
Sorting array values without a stored procedure
In the following query, simply replace ARRAY[4,3,2,1] with your array or query which returns an array:
WITH
sorted_vals AS (
SELECT
UNNEST(ARRAY[4,3,2,1]) AS val
ORDER BY
val
)
SELECT
ARRAY_AGG(val) AS sorted_array
FROM
sorted_vals
... or ...
SELECT
ARRAY_AGG(vals.val) AS sorted_arr
FROM (
SELECT
UNNEST(ARRAY[4,3,2,1]) AS val
ORDER BY
val
) AS vals

I'm surprised no-one has mentioned the containment operators:
select array[1,2,3] <# array[2,1,3] and array[1,2,3] #> array[2,1,3];
?column?
══════════
t
(1 row)
Notice that this requires that all elements of the arrays must be unique.
(If a contains b and b contains a, they must be the same if all elements are unique)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

PostgreSQL custom aggregate: more efficient option than array concatenation - postgresql

Related

Merge hstore data from multiple rows

Improve performance of custom aggregate function in PostgreSQL

eliminate duplicate array values in postgres

Avoid to nest aggregation functions in PostgreSQL 8.3.4

Sorting array elements

Categories

Resources