How to add a custom aggregate function (eg. MAX/MIN) to PostgreSQL? - postgresql

I would like to add some extension function in Postgresql like max/min, but I could not find the source function of them. Could anyone suggest which part of the source code should I view? thanks,
Here is an example. I have relation: model(id int) where model is a bunch of CAD models each one has an ID ; I want to find all models which id>5 and area>5. but I do not want to calculate all face area, so I uses having clause only calculate a subset. here is the query:
select model.id, model.face_number
from
model
where
id>5
group by model.id
having
area(model.id)>5;
I want to define function area(oid) function like max/min as FDW. but I do not know how to pass the input parameters, so I want to compare it with min/max.

This doesn't make much sense.
min and max are aggregate functions. They reduce a set of rows into a single value.
Your problem description doesn't seem to have much to do with aggregation. So it's not at all clear that aggregate functions have anything to do with it.
If you really do need to write an aggregate function, start with the PostgreSQL manual:
User-defined aggregates
C-language extension functions
Writing extensions
Extension-building infrastructure
I strongly recommend that you prototype your aggregate function in PL/PgSQL or another procedural language. Write it in C only if you've demonstrated that it can work using a quicker-to-work-with language, and determined that you need it faster than you can do with PL/PgSQL or PL/Python or whatever.
Anyway, if you want to find the implementation of min/max, start here:
select a.*, so.oprname as aggsortopname, tt.typname as aggtranstypename
from pg_aggregate a
inner join pg_proc p on (a.aggfnoid = p.oid)
inner join pg_type tt on (a.aggtranstype = tt.oid)
inner join pg_operator so on (a.aggsortop = so.oid)
where p.proname = 'max';
There you'll see that the aggregate is composed of multiple parts: a transform function, a sort operator, a transitional state type, an optional final function, etc. The documentation on user-defined aggregates explains that in detail.
So there's no single "max function". The definition of max in pg_proc.h actually just refers to a dummy function.
So for max(int4), it's defined as the transition function int4larger (src/backend/utils/adt/int.c) over transition type int4, with the sort operator >, with no final function.

You do not want an aggregate function for what you have described. You also should not worry about performance until you have a working version--likely PostgreSQL's query optimizer will do exactly what you want if you write this:
select model.id, model.face_number
from
model
where
id>5 and area(model.id)>5;
Here is an example function:
CREATE FUNCTION area(int in_id)
RETURNS double precision AS $$
SELECT length*width FROM model WHERE id=in_id;
$$ LANGUAGE SQL STABLE;
Of course you can replace length*width with some more appropriate calculation.

Related

PostgreSQL Base Types (Scalar Type)

I have a use case where a custom base type in a PostgreSQL database would be very beneficial for dealing with non-linear data. The examples of this include defining using a input and output function to a C function. In my case I would rather just define the inp and out functions using SQL and then using the "LIKE" to inherit everything else from the double precision. Has anyone done this? is it even possible?
Possible example:
-- sample linear to logrithmic functions
CREATE FUNCTION to_linear(anyelement) RETURNS double precision
LANGUAGE SQL
AS
$$
SELECT CASE WHEN $1 > 0 THEN 30 / log($1::double precision) ELSE 0 END
$$;
create function to_log(anyelement) returns double precision
language sql
as $$
select 10^($1::double precision/30.0);
$$;
-- create the base type
create type mylogdata
(
INPUT = to_linear,
OUTPUT = to_log,
LIKE = double precision
) ;
-- sample use in a table definition
CREATE TABLE test_table(
mydata mylogdata
);
What I'm really after is a "sudo" or "partial" base-type to allow for a simple in-out conversions while allowing the existing functions (sum, average, etc...) to work on the inherited type (in this case, double precision); basically avoiding to write/rewrite functions in C.
Thoughts? Ideas? Comments? Not possible? :)
Much Thanks!
On a side note, if we had do go down the 'C' route, I think there could be an opportunity to create a more generic logarithmic scalar/base-type like the Char, Varchar, or Arbitrary Precision Number which could allow for the dynamic declaration of the log base and scale of the non-linear data.
Something like this could a big win for the science community and those of us dealing with "wave" based data like sound, vibration, earth quakes, light, radiation, etc. Here is a sample definition of the base:
Logarithmic(base, scale)
-- Below my idea for use in a table definition
-- Obviously the IN/OUT functions would have to be modified to use the base and scaling
-- as defined ( most likely in C ?? )
CREATE TABLE test_table
(
mydata logarithmic(10, 30)
);
If someone is interested in partnering in creating something like this, let me know.
If you want to write a data type with new type input and output functions, you have to write that in C. No doubt you can reuse a lot of functions from double precision.
But I would go the other way. Rather than having a type that looks like a number, but the known arithmetic operators behave weirdly, define a new set of operators on an existing data type. That can be done in SQL, and it feels more natural to me.
If you create an operator class for your new operators, you can also use them with indexes.

Replacement of DBA_Source in Postgres to find DB objects

How can I know table (table_1) is being used in which all UDF?
Below query gives me table's details:
SELECT * FROM information_schema.tables;
Below Query gives UDF details:
select * FROM pg_proc;
But how can I know that table1 is used in which all UDF?
A string search in the prosrc column of pg_proc is the only way if you want to find dependencies between functions and tables.
Of course that is not very satisfying, because it would be rather difficult to say if – say – an occurrence of table_1 is a reference to the table or a variable name. Also, you cannot find the source of C functions in the catalog.
To get a reliable answer, you would need insight into the language in which the function is written, and here is the core of the problem: PostgreSQL does not have any insight into the language! PostgreSQL's fabled extensibility allows you to define new languages for functions, and only the language handler knows how to interpret the string that is the function body. That also holds for PL/pgSQL which is shipped with PostgreSQL.
That is also the reason why there are no pg_depend entries for objects used in functions.

Syntax error in create aggregate

Trying to create an aggregate function:
create aggregate min (my_type) (
sfunc = least,
stype = my_type
);
ERROR: syntax error at or near "least"
LINE 2: sfunc = least,
^
What am I missing?
Although the manual calls least a function:
The GREATEST and LEAST functions select the largest or smallest value from a list of any number of expressions.
I can not find it:
\dfS least
List of functions
Schema | Name | Result data type | Argument data types | Type
--------+------+------------------+---------------------+------
(0 rows)
Like CASE, COALESCE and NULLIF, GREATEST and LEAST are listed in the chapter Conditional Expressions. These SQL constructs are not implemented as functions .. like #Laurenz provided in the meantime.
The manual advises:
Tip: If your needs go beyond the capabilities of these conditional
expressions, you might want to consider writing a stored procedure in
a more expressive programming language.
The terminology is a bit off here as well, since Postgres does not support true "stored procedures", just functions. (Which is why there is an open TODO item "Implement stored procedures".)
This manual page might be sharpened to avoid confusion ...
#Laurenz also provided an example. I would just use LEAST in the function to get identical functionality:
CREATE FUNCTION f_least(anyelement, anyelement)
RETURNS anyelement LANGUAGE sql IMMUTABLE AS
'SELECT LEAST($1, $2)';
Do not make it STRICT, that would be incorrect. LEAST(1, NULL) returns 1 and not NULL.
Even if STRICT was correct, I would not use it, because it can prevent function inlining.
Note that this function is limited to exactly two parameters while LEAST takes any number of parameters. You might overload the function to cover 3, 4 etc. input parameters. Or you could write a VARIADIC function for up to 100 parameters.
LEAST and GREATEST are not real functions; internally they are parsed as MinMaxExpr (see src/include/nodes/primnodes.h).
You could achieve what you want with a generic function like this:
CREATE FUNCTION my_least(anyelement, anyelement) RETURNS anyelement
LANGUAGE sql IMMUTABLE CALLED ON NULL INPUT
AS 'SELECT LEAST($1, $2)';
(thanks to Erwin Brandstetter for the CALLED ON NULL INPUT and the idea to use LEAST.)
Then you can create your aggregate as
CREATE AGGREGATE min(my_type) (sfunc = my_least, stype = my_type);
This will only work if there are comparison functions for my_type, otherwise you have to come up with a different my_least function.

Executing queries dynamically in PL/pgSQL

I have found solutions (I think) to the problem I'm about to ask for on Oracle and SQL Server, but can't seem to translate this into a Postgres solution. I am using Postgres 9.3.6.
The idea is to be able to generate "metadata" about the table content for profiling purposes. This can only be done (AFAIK) by having queries run for each column so as to find out, say... min/max/count values and such. In order to automate the procedure, it is preferable to have the queries generated by the DB, then executed.
With an example salesdata table, I'm able to generate a select query for each column, returning the min() value, using the following snippet:
SELECT 'SELECT min('||column_name||') as minval_'||column_name||' from salesdata '
FROM information_schema.columns
WHERE table_name = 'salesdata'
The advantage being that the db will generate the code regardless of the number of columns.
Now there's a myriad places I had in mind for storing these queries, either a variable of some sort, or a table column, the idea being to then have these queries execute.
I thought of storing the generated queries in a variable then executing them using the EXECUTE (or EXECUTE IMMEDIATE) statement which is the approach employed here (see right pane), but Postgres won't let me declare a variable outside a function and I've been scratching my head with how this would fit together, whether that's even the direction to follow, perhaps there's something simpler.
Would you have any pointers, I'm currently trying something like this, inspired by this other question but have no idea whether I'm headed in the right direction:
CREATE OR REPLACE FUNCTION foo()
RETURNS void AS
$$
DECLARE
dyn_sql text;
BEGIN
dyn_sql := SELECT 'SELECT min('||column_name||') from salesdata'
FROM information_schema.columns
WHERE table_name = 'salesdata';
execute dyn_sql
END
$$ LANGUAGE PLPGSQL;
System statistics
Before you roll your own, have a look at the system table pg_statistic or the view pg_stats:
This view allows access only to rows of pg_statistic that correspond
to tables the user has permission to read, and therefore it is safe to
allow public read access to this view.
It might already have some of the statistics you are about to compute. It's populated by ANALYZE, so you might run that for new (or any) tables before checking.
-- ANALYZE tbl; -- optionally, to init / refresh
SELECT * FROM pg_stats
WHERE tablename = 'tbl'
AND schemaname = 'public';
Generic dynamic plpgsql function
You want to return the minimum value for every column in a given table. This is not a trivial task, because a function (like SQL in general) demands to know the return type at creation time - or at least at call time with the help of polymorphic data types.
This function does everything automatically and safely. Works for any table, as long as the aggregate function min() is allowed for every column. But you need to know your way around PL/pgSQL.
CREATE OR REPLACE FUNCTION f_min_of(_tbl anyelement)
RETURNS SETOF anyelement
LANGUAGE plpgsql AS
$func$
BEGIN
RETURN QUERY EXECUTE (
SELECT format('SELECT (t::%2$s).* FROM (SELECT min(%1$s) FROM %2$s) t'
, string_agg(quote_ident(attname), '), min(' ORDER BY attnum)
, pg_typeof(_tbl)::text)
FROM pg_attribute
WHERE attrelid = pg_typeof(_tbl)::text::regclass
AND NOT attisdropped -- no dropped (dead) columns
AND attnum > 0 -- no system columns
);
END
$func$;
Call (important!):
SELECT * FROM f_min_of(NULL::tbl); -- tbl being the table name
db<>fiddle here
Old sqlfiddle
You need to understand these concepts:
Dynamic SQL in plpgsql with EXECUTE
Polymorphic types
Row types and table types in Postgres
How to defend against SQL injection
Aggregate functions
System catalogs
Related answer with detailed explanation:
Table name as a PostgreSQL function parameter
Refactor a PL/pgSQL function to return the output of various SELECT queries
Postgres data type cast
How to set value of composite variable field using dynamic SQL
How to check if a table exists in a given schema
Select columns with particular column names in PostgreSQL
Generate series of dates - using date type as input
Special difficulty with type mismatch
I am taking advantage of Postgres defining a row type for every existing table. Using the concept of polymorphic types I am able to create one function that works for any table.
However, some aggregate functions return related but different data types as compared to the underlying column. For instance, min(varchar_column) returns text, which is bit-compatible, but not exactly the same data type. PL/pgSQL functions have a weak spot here and insist on data types exactly as declared in the RETURNS clause. No attempt to cast, not even implicit casts, not to speak of assignment casts.
That should be improved. Tested with Postgres 9.3. Did not retest with 9.4, but I am pretty sure, nothing has changed in this area.
That's where this construct comes in as workaround:
SELECT (t::tbl).* FROM (SELECT ... FROM tbl) t;
By casting the whole row to the row type of the underlying table explicitly we force assignment casts to get original data types for every column.
This might fail for some aggregate function. sum() returns numeric for a sum(bigint_column) to accommodate for a sum overflowing the base data type. Casting back to bigint might fail ...
#Erwin Brandstetter, Many thanks for the extensive answer. pg_stats does indeed provide a few things, but what I really need to draw a complete profile is a variety of things, min, max values, counts, count of nulls, mean etc... so a bunch of queries have to be ran for each columns, some with GROUP BY and such.
Also, thanks for highlighting the importance of data types, i was sort of expecting this to throw a spanner in the works at some point, my main concern was with how to automate the query generation, and its execution, this last bit being my main concern.
I have tried the function you provide (I probably will need to start learning some plpgsql) but get a error at the SELECT (t::tbl) :
ERROR: type "tbl" does not exist
btw, what is the (t::abc) notation referred as, in python this would be a list slice, but it’s probably not the case in PLPGSQL

How to cut seconds from an interval column?

In my table results from column work_time (interval type) display as 200:00:00. Is it possible to cut the seconds part, so it will be displayed as 200:00? Or, even better: 200h00min (I've seen it accepts h unit in insert so why not load it like this?).
Preferably, by altering work_time column, not by changing the select query.
This is not something you should do by altering a column but by changing the select query in some way. If you change the column you are changing storage and functional uses, and that's not good. To change it on output, you need to modify how it is retrieved.
You have two basic options. The first is to modify your select queries directly, using to_char(myintervalcol, 'HH24:MI')
However if your issue is that you have a common format you want to have universal access to in your select query, PostgreSQL has a neat trick I call "table methods." You can attach a function to a table in such a way that you can call it in a similar (but not quite identical) syntax to a new column. In this case you would do something like:
CREATE OR REPLACE FUNCTION myinterval_nosecs(mytable) RETURNS text LANGUAGE SQL
IMMUTABLE AS
$$
SELECT to_char($1.myintervalcol, 'HH24:MI');
$$;
This works on the row input, not on the underlying table. As it always returns the same information for the same input, you can mark it immutable and even index the output (meaning it can be run at plan time and indexed used).
To call this, you'd do something like:
SELECT myinterval_nosecs(m) FROM mytable m;
But you can then use the special syntax above to rewrite that as:
SELECT m.myinterval_nosecs FROM mytable m;
Note that since myinterval_nosecs is a function you cannot omit the m. at the beginning. This is because the query planner will rewrite the query in the former syntax and will not guess as to which relation you mean to run it against.