How to get cosine distance between two vectors in postgres? - postgresql

I am wondering if there is a way to get cosine distance of two vectors in postgres.
For storing vectors I am using CUBE data type.
Below is my table definition:
test=# \d vectors
Table "public.vectors"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+-------------------------------------
id | integer | | not null | nextval('vectors_id_seq'::regclass)
vector | cube | | |
Also, sample data is given below:
test=# select * from vectors order by id desc limit 2;
id | vector
---------+------------------------------------------
2000000 | (109, 568, 787, 938, 948, 126, 271, 499)
1999999 | (139, 365, 222, 653, 313, 103, 215, 796)
I actually can write my own PLPGSql function for this, but wanted to avoid this as it might not be efficient.

About your table
First of all, I believe you should change your data type to plain array.
CREATE TABLE public.vector (
id serial NOT NULL,
vctor double precision [3] --for three dimensional vectors; of course you can change the dimension or leave it unbounded if you need it.
);
INSERT INTO public.vector (vctor) VALUES (ARRAY[2,3,4]);
INSERT INTO public.vector (vctor) VALUES (ARRAY[3,4,5]);
So
SELECT * FROM public.vector;
Will result in the following data
id | vctor
------|---------
1 | {2,3,4}
2 | {3,4,5}
Maybe not the answer you expected but consider this
As you may know already, calculating the cosine between the vectors involves calculating the magnitudes. I don't think the problem is the algorithm but the implementation; it requires calculating squares and square roots that is expensive for a RDBMS.
Now, talking about efficiency; the server process does not take the load when calling mathematical functions. In PostgreSQL, the mathematical functions (look here) run from the C library so they are pretty efficient. However, in the end, the host has to assign some resources to make these calculations.
I would indeed think carefully before implementing these rather costly operations inside the server. But there is not a right answer; it depends on how you are using the database. For example if it is a production database with thousands of concurrent users, I would move this kind of calculation elsewhere (a middle layer or a user application.) But if there are few users and your database is for a small research operation, then it is fine to implement it as a stored procedure or a process running inside your server but keep in mind this will affect scalability or portability. Of course, there are more considerations like how many rows will be processed, or whether or not you intend to fire triggers, etc.
Consider other alternatives
Make a client app
You can do a fast and decent program in VB or the language of your choice. And let the client app make the heavy calculation and use the database for what it does best that is storing and retrieving data.
Store the data differently
For this particular example, you could store the unit vectors plus the magnitude. In this way, finding the cosine between any two vectors reduces simply to the dot product of the unit vectors (only multiplication and division and no squares nor square roots.)
CREATE TABLE public.vector (
id serial NOT NULL,
uvctor double precision [3], --for three dimensional vectors; of course you can change the dimension or make it decimal if you need it
magnitude double precision
);
INSERT INTO public.vector (vctor) VALUES (ARRAY[0.3714, 0.5571, 0.7428], 5.385); -- {Ux, Uy, Uz}, ||V|| where V = [2, 3, 4];
INSERT INTO public.vector (vctor) VALUES (ARRAY[0.4243, 0.5657, 0.7071], 7.071); -- {Ux, Uy, Uz}, ||V|| where V = [3, 4, 5];
SELECT a.vctor as a, b.vctor as b, 1-(a.uvctor[1] * b.uvctor[1] + a.uvctor[2] * b.uvctor[2] + a.uvctor[3] * b.uvctor[3]) as cosine_distance FROM public.vector a
JOIN public.vector b ON a.id != b.id;
Resulting in
a | b | cosine_distance
-----------------------------|------------------------------|------------------
{0.3714,0.5571,0.7428,5.385} | {0.4243,0.5657,0.7071,7.071} | 0.00202963
{0.4243,0.5657,0.7071,7.071} | {0.3714,0.5571,0.7428,5.385} | 0.00202963
Even if you have to calculate the magnitude of the vector inside the server, you will make it once per vector and not every time you need to get the distance between two of them. This becomes more important as the number of rows is increasing. For 1000 vectors for example, you would have to calculate the magnitude 999000 times if you wanted to obtain the cosine difference between any two vectors using the original vector components.
Any combination of the above
Conclusion
When we pursue efficiency, most of the times there is not a canonical answer. Instead we have trade-offs that we have to consider and evaluate. It always depends on the ultimate goal we need to achieve. Databases are excellent for storing and retrieving data; they can definitely make other things but that comes with an added cost. If we can live with the added overhead then it's fine; otherwise we have to consider alternatives.

you can take reference to my code.
--for calculation of norm vector --
CREATE or REPLACE FUNCTION public.vector_norm(IN vector double precision[])
RETURNS double precision AS
$BODY$
BEGIN
RETURN(SELECT SQRT(SUM(pow)) FROM (SELECT POWER(e,2) as pow from unnest(vector) as e) as norm);
END;
$BODY$ LANGUAGE 'plpgsql';
ALTER FUNCTION public.vector_norm(double precision[]) OWNER TO postgres;
COMMENT ON FUNCTION public.vector_norm(double precision[]) IS 'This function is used to find a norm of vectors.';
--call function--
select public.vector_norm('{ 0.039968978613615,0.357211461290717,0.753132887650281,0.760665621142834,0.20826127845794}')
--for caculation of dot_product--
CREATE OR REPLACE FUNCTION public.dot_product(IN vector1 double precision[], IN vector2 double precision[])
RETURNS double precision
AS $BODY$
BEGIN
RETURN(SELECT sum(mul) FROM (SELECT v1e*v2e as mul FROM unnest(vector1, vector2) AS t(v1e,v2e)) AS denominator);
END;
$BODY$ LANGUAGE 'plpgsql';
ALTER FUNCTION public.dot_product(double precision[], double precision[]) OWNER TO postgres;
COMMENT ON FUNCTION public.dot_product(double precision[], double precision[])
IS 'This function is used to find a cosine similarity between two multi-dimensional vectors.';
--call fuction--
SELECT public.dot_product(ARRAY[ 0.039968978613615,0.357211461290717,0.753132887650281,0.760665621142834,0.20826127845794],ARRAY[ 0.039968978613615,0.357211461290717,0.753132887650281,0.760665621142834,0.20826127845794])
--for calculatuion of cosine similarity--
CREATE OR REPLACE FUNCTION public.cosine_similarity(IN vector1 double precision[], IN vector2 double precision[])
RETURNS double precision
LANGUAGE 'plpgsql'
AS $BODY$
BEGIN
RETURN(select ((select public.dot_product(ARRAY[ 0.63434,0.23487,0.324323], ARRAY[ 0.63434,0.23487,0.324323]) as dot_pod)/((select public.vector_norm(ARRAY[ 0.63434,0.23487,0.324323]) as norm1) * (select public.vector_norm(ARRAY[ 0.63434,0.23487,0.324323]) as norm2))) AS similarity_value)
END;
$BODY$;
ALTER FUNCTION public.cosine_similarity(double precision[], double precision[])
OWNER TO postgres;
COMMENT ON FUNCTION public.cosine_similarity(double precision[], double precision[])
IS 'this function is used to find a cosine similarity between two vector';

Related

Odd behavior with overloaded function in Postgres 13.3

I've added a pair of overloaded functions to handle safe vision, optionally with rounding, in PG 13.3. I've run some simple example cases through the routines and, in one case, the output varies unexpectedly. I'm hoping that someone can shed some light on what might be causing this inconsistency. First off, here is the code for the div_safe (anycompatible, anycompatible) : real and div_safe (anycompatible, anycompatible, integer) : real functions. (I tried replacing integer with anycompatible in that third parameter, it made no difference.)
------------------------------
-- No rounding
------------------------------
CREATE OR REPLACE FUNCTION tools.div_safe(
numerator anycompatible,
denominator anycompatible)
RETURNS real
AS $BODY$
SELECT numerator/NULLIF(denominator,0)::real
$BODY$
LANGUAGE SQL;
COMMENT ON FUNCTION tools.div_safe (anycompatible, anycompatible) IS
'Pass in any two values that are, or can be coerced into, numbers, and get a safe division real result.';
------------------------------
-- Rounding
------------------------------
CREATE OR REPLACE FUNCTION tools.div_safe(
numerator anycompatible,
denominator anycompatible,
rounding_in integer)
RETURNS real
AS $BODY$
SELECT ROUND(numerator/NULLIF(denominator,0)::numeric, rounding_in)::real
$BODY$
LANGUAGE sql;
COMMENT ON FUNCTION tools.div_safe (anycompatible, anycompatible, integer) IS
'Pass in any two values that are, or can be coerced into, numbers, the number of rounding digits, and get back a rounded, safe division real result.';
I threw together these checks, as I was working out the code:
-- (real, int))
select '5.1/nullif(null,0)', 5.1/nullif(null,0) as result union all
select 'div_safe(5.1,0)', div_safe(5.1, 0) as result union all
-- (0, 0)
select '0/nullif(0,0)', 5.1/nullif(null,0) as result union all
select 'div_safe(0, 0)', div_safe(0, 0) as result union all
-- (int, int)
select '5/nullif(8,0)::real', 5/nullif(8,0)::real as result union all
select 'div_safe(5,8)', div_safe(5, 8) as result union all
-- (string, int)
select 'div_safe(''5'',8)', div_safe('5', 8) as result union all
select 'div_safe(''8'',5)', div_safe('8', 5) as result union all
-- Rounding: Have to convert real result to numeric to pass it into ROUND (numeric, integer)
select 'round(div_safe(10,3)::numeric, 2)',
round(div_safe(10,3)::numeric, 2) as result union all
-- Pass a third parameter to specify rounding:
select 'div_safe(20,13,2)', div_safe(20, 13, 2) as result
+-----------------------------------+--------------------+
| ?column? | result |
+-----------------------------------+--------------------+
| 5.1/nullif(null,0) | NULL |
| div_safe(5.1,0) | NULL |
| 0/nullif(0,0) | NULL |
| div_safe(0, 0) | NULL |
| 5/nullif(8,0)::real | 0.625 |
| div_safe(5,8) | 0.625 |
| div_safe('5',8) | 0.625 |
| div_safe('8',5) | 1.600000023841858 |
| round(div_safe(10,3)::numeric, 2) | 3.33 |
| div_safe(20,13,2) | 1.5399999618530273 |
+-----------------------------------+--------------------+
The last line looks wrong to me, it should be rounded to 1.54. I've discovered that I get this behavior in the presence of one of the other tests. Specifically:
select '5/nullif(8,0)::real', 5/nullif(8,0)::real as result union all
Without that, the final line returns 1.54, as expected.
Can anyone shed some light on what's going on? Is it something to do with the combination of anycompatible with UNION ALL? Something incredibly simple that I'm missing?
And, if anyone knows, is there a chance that anynum might be added as a pseudo-type in the future?
Follow-up regarding inconsistent output
I've already gotten a helpful answer to my original question (thanks!), and am following up on a follow-on point. Namely, why does my function round data before returning it, and then the value is changed in the final result. It think that there's something fundamental I'm missing here, and it's not obvious. I figured that I needed to confirm that the right version of the function is being called, and RAISE NOTIFICATION to get at the values, as seen inside the method. This new version is div_safe_p (anycompatible, anycompatible, integer) : real, and is written in PL/PgSQL:
------------------------------
-- Rounding
------------------------------
drop function if exists tools.div_safe_p(anycompatible,anycompatible,integer);
CREATE OR REPLACE FUNCTION tools.div_safe_p(
numerator anycompatible,
denominator anycompatible,
rounding_in integer)
RETURNS real
AS $BODY$
DECLARE
result_r real := 0;
BEGIN
SELECT ROUND(numerator/NULLIF(denominator,0)::numeric, rounding_in)::real INTO result_r;
RAISE NOTICE 'Calling div_safe_p(%, %, %) : %', numerator, denominator, rounding_in, result_r;
RETURN result_r;
END
$BODY$
LANGUAGE plpgsql;
COMMENT ON FUNCTION tools.div_safe_p (anycompatible, anycompatible, integer) IS
'Pass in any two values that are, or can be coerced into, numbers, the number of roudning digits, and get back a rounded, safe division real result.';
Here's a sample call, and output:
select 5/nullif(8,0)::real union all
select div_safe_p(10,3, 2)::real
+--------------------+
| ?column? |
+--------------------+
| 0.625 |
| 3.3299999237060547 |
+--------------------+
The result of div_safe_p appears to be converted to a double, not a real. Check the RAISE NOTICE console output, the function returned 3.33:
NOTICE: Calling div_safe_p(10, 3, 2) : 3.33
Yes, this 3.33 is shown as 3.3299999237060547. I'm not clear why the value is modified from how it's returned from the function. I also can't reproduce the transformation by converting the value by hand. Both select 3.33::real and select 3.33::double precision return 3.33.
Another variant, the same as the original except without the ::real castings:
select 5/nullif(8,0) union all
select div_safe_p(10,3, 2)
+----------+
| ?column? |
+----------+
| 0 |
| 3.33 |
+----------+
It certainly looks like the first value encountered is guiding the column typing, as answered already. However, I'm stumped as to why this changes the behavior of the function itself. Or, at least changes how the output is interpreted.
If this sounds like a fine point...maybe it is. When I run into peculiarities that I can't explain, I hope to figure out what's going on so that I can predict and troubleshoot more complex examples in the future.
Thanks for any illumination!
This is as expected on account of the type resolution rules for UNION:
Select the first non-unknown input type as the candidate type, then consider each other non-unknown input type, left to right.
Now the first non-NULL data type is double precision (see the type resolution rules for operators), so all results get cast to double precision resulting in the imprecision being visible. Without that test, the result is of type real, so PostgreSQL shows few enough digits to hide the imprecision.
It is useful to use the pg_typeof function to show the data type, that will clear things up:
SELECT pg_typeof(v)
FROM (SELECT NULL
UNION ALL
SELECT 2::real / 3::real
UNION ALL
SELECT pi()) AS t(v);
pg_typeof
══════════════════
double precision
double precision
double precision
(3 rows)

PostgreSQL - rounding floating point numbers

I have a newbie question about floating point numbers in PostgreSQL 9.2.
Is there a function to round a floating point number directly, i.e. without having to convert the number to a numeric type first?
Also, I would like to know whether there is a function to round by an arbitrary unit of measure, such as to nearest 0.05?
When casting the number into a decimal form first, the following query works perfectly:
SELECT round(1/3.::numeric,4);
round
--------
0.3333
(1 row)
Time: 0.917 ms
However, what really I'd like to achieve is something like the following:
SELECT round(1/3.::float,4);
which currently gives me the following error:
ERROR: function round(double precision, integer) does not exist at character 8
Time: 0.949 ms
Thanks
Your workaround solution works with any version of PostgreSQL,
SELECT round(1/3.::numeric,4);
But the answer for "Is there a function to round a floating point number directly?", is no.
The cast problem
You are reporting a well-known "bug", there is a lack of overloads in some PostgreSQL functions... Why (???): I think "it is a lack" (!), but #CraigRinger, #Catcall (see comments at Craig's anser) and the PostgreSQL team agree about "PostgreSQL's historic rationale".
The solution is to develop a centralized and reusable "library of snippets", like pg_pubLib. It implements the strategy described below.
Overloading as casting strategy
You can overload the build-in ROUND function with,
CREATE FUNCTION ROUND(float,int) RETURNS NUMERIC AS $f$
SELECT ROUND($1::numeric,$2);
$f$ language SQL IMMUTABLE;
Now your dream will be reality, try
SELECT round(1/3.,4); -- 0.3333 numeric
It returns a (decimal) NUMERIC datatype, that is fine for some applications... An alternative is to use round(1/3.,4)::float or to create a round_tofloat() function.
Other alternative, to preserve input datatype and use all range of accuracy-precision of a floating point number (see IanKenney's answer), is to return a float when the accuracy is defined,
CREATE or replace FUNCTION ROUND(
input float, -- the input number
accuracy float -- accuracy
) RETURNS float AS $f$
SELECT ROUND($1/accuracy)*accuracy
$f$ language SQL IMMUTABLE;
COMMENT ON FUNCTION ROUND(float,float) IS 'ROUND by accuracy.';
Try
SELECT round(21.04, 0.05); -- 21.05
SELECT round(21.04, 5::float); -- 20
SELECT round(pi(), 0.0001); -- 3.1416
SELECT round(1/3., 0.0001); -- 0.33330000000000004 (ops!)
To avoid floating-pont truncation (internal information loss), you can "clean" the result, for example truncating on 9 digits:
CREATE or replace FUNCTION ROUND9(
input float, -- the input number
accuracy float -- accuracy
) RETURNS float AS $f$
SELECT (ROUND($1/accuracy)*accuracy)::numeric(99,9)::float
$f$ language SQL IMMUTABLE;
Try
SELECT round9(1/3., 0.00001); -- 0.33333 float, solved!
SELECT round9(1/3., 0.005); -- 0.335 float, ok!
PS: the command \df round, on psql after overloadings, will show something like this table
Schema | Name | Result | Argument
------------+-------+---------+------------------
myschema | round | numeric | float, int
myschema | round | float | float, float
pg_catalog | round | float | float
pg_catalog | round | numeric | numeric
pg_catalog | round | numeric | numeric, int
where float is synonymous of double precision and myschema is public when you not use a schema. The pg_catalog functions are the default ones, see at Guide the build-in math functions.
More details
See a complete Wiki answer here.
You can accomplish this by doing something along the lines of
select round( (21.04 /0.05 ),0)*0.05
where 21.04 is the number to round and 0.05 is the accuracy.

How to create a custom windowing function for PostgreSQL? (Running Average Example)

I would really like to better understand what is involved in creating a UDF that operates over windows in PostgreSQL. I did some searching about how to create UDFs in general, but haven't found an example of how to do one that operates over a window.
To that end I am hoping that someone would be willing to share code for how to write a UDF (can be in C, pl/SQL or any of the procedural languages supported by PostgreSQL) that calculates the running average of numbers in a window. I realize there are ways to do this by applying the standard average aggregate function with the windowing syntax (rows between syntax I believe), I am simply asking for this functionality because I think it makes a good simple example. Also, I think if there was a windowing version of average function then the database could keep a running sum and observation count and wouldn't sum up almost identical sets of rows at each iteration.
You have to look to postgresql source code postgresql/src/backend/utils/adt/windowfuncs.c and postgresql/src/backend/executor/nodeWindowAgg.c
There are no good documentation :( -- fully functional window function should be implemented only in C or PL/v8 - there are no API for other languages.
http://www.pgcon.org/2009/schedule/track/Version%208.4/128.en.html presentation from author of implementation in PostgreSQL.
I found only one non core implementation - http://api.pgxn.org/src/kmeans/kmeans-1.1.0/
http://pgxn.org/dist/plv8/1.3.0/doc/plv8.html
According to the documentation "Other window functions can be added by the user. Also, any built-in or user-defined normal aggregate function can be used as a window function." (section 4.2.8). That worked for me for computing stock split adjustments:
CREATE OR REPLACE FUNCTION prod(float8, float8) RETURNS float8
AS 'SELECT $1 * $2;'
LANGUAGE SQL IMMUTABLE STRICT;
CREATE AGGREGATE prods ( float8 ) (
SFUNC = prod,
STYPE = float8,
INITCOND = 1.0
);
create or replace view demo.price_adjusted as
select id, vd,
prods(sdiv) OVER (PARTITION by id ORDER BY vd DESC ROWS UNBOUNDED PRECEDING) as adjf,
rawprice * prods(sdiv) OVER (PARTITION by id ORDER BY vd DESC ROWS UNBOUNDED PRECEDING) as price
from demo.prices_raw left outer join demo.adjustments using (id,vd);
Here are the schemas of the two tables:
CREATE TABLE demo.prices_raw (
id VARCHAR(30),
vd DATE,
rawprice float8 );
CREATE TABLE demo.adjustments (
id VARCHAR(30),
vd DATE,
sdiv float);
Starting with table
payments
+------------------------------+
| customer_id | amount | item |
| 5 | 10 | book |
| 5 | 71 | mouse |
| 7 | 13 | cover |
| 7 | 22 | cable |
| 7 | 19 | book |
+------------------------------+
SELECT customer_id,
AVG(amount) OVER (PARTITION BY customer_id) AS avg_amount,
item,
FROM payments`
we get
+----------------------------------+
| customer_id | avg_amount | item |
| 5 | 40.5 | book |
| 5 | 40.5 | mouse |
| 7 | 18 | cover |
| 7 | 18 | cable |
| 7 | 18 | book |
+----------------------------------+
AVG being an aggregate function, it can act as a window function. However not all window functions are aggregate functions. The aggregate functions are the non-sophisticated window functions.
In the query above, let's not use the built-in AVG function and use our own implementation. Does the same, just implemented by the user. The query above becomes:
SELECT customer_id,
my_avg(amount) OVER (PARTITION BY customer_id) AS avg_amount,
item,
FROM payments`
The only difference from the former query is that AVG has been replaced with my_avg. We now need to implement our custom function.
On how to compute the average
Sum up all the elements, then divide by the number of elements. For customer_id of 7, that would be (13 + 22 + 19) / 3 = 18.
We can devide it in:
a step-by-step accumulation -- the sum.
a final operation -- division.
On how the aggregate function gets to the result
The average is computed in steps. Only the last value is necessary.
Start with an initial value of 0.
Feed 13. Compute the intermediate/accumulated sum, which is 13.
Feed 22. Compute the accumulated sum, which needs the previous sum plus this element: 13 + 22 = 35
Feed 19. Compute the accumulated sum, which needs the previous sum plus this element: 35 + 19 = 54. This is the total that needs to be divided by the number of element (3).
The result of step 3. is fed to another function, that knows how to divide the accumulated sum by the number of elements
What happened here is that the state started with the initial value of 0 and was changed with every step, then passed to the next step.
State travels between steps for as long as there is data. When all data is consumed state goes to a final function (terminal operation). We want the state to contain all the information needed for the accumulator as well as by the terminal operation.
In the specific case of computing the average, the terminal operation needs to know how many elements the accumulator worked with because it needs to divide by that. For that reason, the state needs to include both the accumulated sum and the number of elements.
We need a tuple that will contain both. Pre-defined POINT PostgreSQL type to the rescue. POINT(5, 89) means an accumulated sum of 5 elements that has the value of 89. The initial state will be a POINT(0,0).
The accumulator is implemented in what's called a state function. The terminal operation is implemented in what's called a final function.
When defining a custom aggregate function we need to specify:
the aggregate function name and return type
the initial state
the type of the state that the infrastructure will pass between steps and to the final function
a state function -- knows how to perform the accumulation steps
a final function -- knows how to perform the terminal operation. Not always needed (e.g. in a custom implementation of SUM the final value of the accumulated sum is the result.)
Here's the definition for the custom aggregate function.
CREATE AGGREGATE my_avg (NUMERIC) ( -- NUMERIC is what the function returns
initcond = '(0,0)', -- this is the initial state of type POINT
stype = POINT, -- this is the type of the state that will be passed between steps
sfunc = my_acc, -- this is the function that knows how to compute a new average from existing average and new element. Takes in the state (type POINT) and an element for the step (type NUMERIC)
finalfunc my_final_func -- returns the result for the aggregate function. Takes in the state of type POINT (like all other steps) and returns the result as what the aggregate function returns - NUMERIC
);
The only thing left is to define two functions my_acc and my_final_func.
CREATE FUNCTION my_acc (state POINT, elem_for_step NUMERIC) -- performs accumulated sum
RETURNS POINT
LANGUAGE SQL
AS $$
-- state[0] is the number of elements, state[1] is the accumulated sum
SELECT POINT(state[0]+1, state[1] + elem_for_step);
$$;
CREATE FUNCTION my_final_func (POINT) -- performs devision and returns final value
RETURNS NUMERIC
LANGUAGE SQL
AS $$
-- $1[1] is the sum, $1[0] is the number of elements
SELECT ($1[1]/$1[0])::NUMERIC;
$$;
Now that the functions are available CREATE AGGREGATE defined above will run successfully. Now that we have the aggregate defined, the query based on my_avg instead of the built-in AVG can be run:
SELECT customer_id,
my_avg(amount) OVER (PARTITION BY customer_id) AS avg_amount,
item,
FROM payments`
The results are identical with what you get when using the built-in AVG.
The PostgreSQL documentation suggests that the users are limited to implementing user-defined aggregate functions:
In addition to these [pre-defined window] functions, any built-in or user-defined general-purpose or statistical aggregate (i.e., not ordered-set or hypothetical-set aggregates) can be used as a window function;
What I suspect ordered-set or hypothetical-set aggregates means:
the value returned is identical to all other rows (e.g. AVG and SUM. In contrast RANK returns different values for all rows in group depending on more sophisticated criteria)
it makes no sense to ORDER BY when PARTITIONing because the values are the same for all rows anyway. In contrast we want to ORDER BY when using RANK()
Query:
SELECT customer_id, item, rank() OVER (PARTITION BY customer_id ORDER BY amount desc) FROM payments;
Geometric mean
The following is a user-defined aggregate function that I found no built-in aggregate for and may be useful to some.
The state function computes the average of the natural logarithms of the terms.
The final function raises constant e to whatever the accumulator provides.
CREATE OR REPLACE FUNCTION sum_of_log(state POINT, curr_val NUMERIC)
RETURNS POINT
LANGUAGE SQL
AS $$
SELECT POINT(state[0] + 1,
(state[1] * state[0]+ LN(curr_val))/(state[0] + 1));
$$;
CREATE OR REPLACE FUNCTION e_to_avg_of_log(POINT)
RETURNS NUMERIC
LANGUAGE SQL
AS $$
select exp($1[1])::NUMERIC;
$$;
CREATE AGGREGATE geo_mean (NUMBER)
(
stype = NUMBER,
initcond = '(0,0)', -- represent POINT value
sfunc = sum_of_log,
finalfunc = e_to_avg_of_log
);
PL/R provides such functionality. See here for some examples. That said, I'm not sure that it (currently) meets your requirement of "keep[ing] a running sum and observation count and [not] sum[ming] up almost identical sets of rows at each iteration" (see here).

Trying to create aggregate function in PostgreSQL

I'm trying to create new aggregate function in PostgreSQL to use instead of the sum() function
I started my journey in the manual here.
Since I wanted to create a function that takes an array of double precision values, sums them and then does some additional calculations I first created that final function:
takes double precision as input and gives double precision as output
DECLARE
v double precision;
BEGIN
IF tax > 256 THEN
v := 256;
ELSE
v := tax;
END IF;
RETURN v*0.21/0.79;
END;
Then I wanted to create the aggregate function that takes an array of double precision values and puts out a single double precision value for my previous function to handle.
CREATE AGGREGATE aggregate_ee_income_tax (float8[]) (
sfunc = array_agg
,stype = float8
,initcond = '{}'
,finalfunc = eeincometax);
What I get when I run that command is:
ERROR: function array_agg(double precision, double precision[]) does
not exist
I'm somewhat stuck here, because the manual lists array_agg() as existing function. What am I doing wrong?
Also, when I run:
\da
List of aggregate functions
Schema | Name | Result data type | Argument data types | Description
--------+------+------------------+---------------------+-------------
(0 rows)
My installation has no aggregate functions at all? Or does only list user defined functions?
Basically what I'm trying to understand:
1) Can I use an existing functions to sum up my array values?
2) How can I find out about input and ouptut data types of functions? Docs claim that array_agg() takes any kind of input.
3) What is wrong with my own aggregate function?
Edit 1
To give more information and clearer picture of what I'm trying to achieve:
I have one huge query over several tables which goes something like this:
SELECT sum(tax) ... from (SUBQUERY) as foo group by id
I want to replace that sum function with my own aggregate function so I don't have to do additional calculations on backend - since they can all be done on database level.
Edit 2
Accepted Ants's answer. Since final solution comes from comments I post it here for reference:
CREATE AGGREGATE aggregate_ee_income_tax (float8)
(
sfunc = float8pl
,stype = float8
,initcond = '0.0'
,finalfunc = eeincometax
);
Array agg is an aggregate function not a regular function, so it can't be used as a state transition function for a new aggregate. What you want to do is to create an aggregate function which has a state transition function that is identical to array_agg and a custom final func.
Unfortunately the state transition function of array_agg is defined in terms of an internal datatype so it can't be reused. Fortunately there is an existing function in core that already does what you want.
CREATE AGGREGATE aggregate_ee_income_tax (float8)(
sfunc = array_append,
stype = float8[],
initcond = '{}',
finalfunc = eeincometax);
Also note that you had your types mixed up, you probably want aggregate a set of floats to an array, not a set of arrays to a float.
In addition to #Ants excellent advice:
1.) Your final function could be simplified to:
CREATE FUNCTION eeincometax(float8)
RETURNS float8 LANGUAGE SQL AS
$func$
SELECT (least($1, 256) * 21) / 79
$func$;
2.) It seems like you are dealing with money? In this case I would strongly advise to use the type numeric (preferred) or money for the purpose. Floating point operations are often not precise enough.
3.) The initial condition of the aggregate can simply be just 0:
CREATE AGGREGATE aggregate_ee_income_tax(float8)
(
sfunc = float8pl
,stype = float8
,initcond = 0
,finalfunc = eeincometax
);
4.) In your case (least(sum(tax), 256) * 21) / 79 is probably faster than your custom aggregate. Aggregate functions provided by PostgreSQL are written in C and optimized for performance. I would use that instead.

Generate a random number of non duplicated random number in [0, 1001] through a loop

I need to generate a random number of non duplicated random number in plpgsql. The non duplicated number shall fall in the range of [1,1001]. However, the code generates number exceeding 1001.
directed2number := trunc(Random()*7+1);
counter := directed2number
while counter > 0
loop
to_point := trunc((random() * 1/directed2number - counter/directed2number + 1) * 1001 +1);
...
...
counter := counter - 1;
end loop;
If I understand right
You need a random number (1 to 8) of random numbers.
The random numbers span 1 to 1001.
The random numbers need to be unique. None shall appear more than once.
CREATE OR REPLACE FUNCTION x.unique_rand_1001()
RETURNS SETOF integer AS
$body$
DECLARE
nrnr int := trunc(random()*7+1); -- number of numbers
BEGIN
RETURN QUERY
SELECT (1000 * random())::integer + 1
FROM generate_series(1, nrnr*2)
GROUP BY 1
LIMIT nrnr;
END;
$body$ LANGUAGE plpgsql VOLATILE;
Call:
SELECT x.unique_rand_1001();
Numbers are made unique by the GROUP BY. I generate twice as many numbers as needed to provide enough numbers in case duplicates are removed. With the given dimensions of the task (max. 8 of 1001 numbers) it is astronomically unlikely that not enough numbers remain. Worst case scenario: viewer numbers are returned.
I wouldn't approach the problem that way in PostgreSQL.
From a software engineering point of view, I think I'd separate generating a random integer between x and y, generating 'n' of those integers, and guaranteeing the result is a set.
-- Returns a random integer in the interval [n, m].
-- Not rigorously tested. For rigorous testing, see Knuth, TAOCP vol 2.
CREATE OR REPLACE FUNCTION random_integer(integer, integer)
RETURNS integer AS
$BODY$
select cast(floor(random()*($2 - $1 +1)) + $1 as integer);
$BODY$
LANGUAGE sql VOLATILE
Then to select a single random integer between 1 and 1000,
select random_integer(1, 1000);
To select 100 random integers between 1 and 1000,
select random_integer(1, 1000)
from generate_series(1,100);
You can guarantee uniqueness in either application code or in the database. Ruby implements a Set class. Other languages have similar capabilities under various names.
One way to do this in the database uses a local temporary table. Erwin's right about the need to generate more integers than you need, to compensate for the removal of duplicates. This code generates 20, and selects the first 8 rows in the order they were inserted.
create local temp table unique_integers (
id serial primary key,
n integer unique
);
insert into unique_integers (n)
select random_integer(1, 1001) n
from generate_series(1, 20)
on conflict (n) do nothing;
select n
from unique_integers
order by id
fetch first 8 rows only;