I want to find the median value of some data in pgsql. A quick google search told me that PGSQL 8.2 does not come with a median function. After some more searching I found this link
https://wiki.postgresql.org/wiki/Aggregate_Median
which provides some information on how to write a custom median function. Here is the code I have so far
CREATE OR REPLACE FUNCTION my_schema.final_median(anyarray) RETURNS float8 STRICT AS
$$
DECLARE
cnt INTEGER;
BEGIN
cnt := (SELECT count(*) FROM unnest($1) val WHERE val IS NOT NULL);
RETURN (SELECT avg(tmp.val)::float8
FROM (SELECT val FROM unnest($1) val
WHERE val IS NOT NULL
ORDER BY 1
LIMIT 2 - MOD(cnt, 2)
OFFSET CEIL(cnt/ 2.0) - 1
) AS tmp
);
END
$$ LANGUAGE plpgsql;
CREATE AGGREGATE my_schema.mymedian(anyelement) (
SFUNC=array_append,
STYPE=anyarray,
FINALFUNC=my_schema.final_median,
INITCOND='{}'
);
-- I need this filter here. This is a place holder for a larger query
select my_schema.mymedian(id) filter (where id < 5)
from my_schema.golf_data
However I am getting an error when I run the code
ERROR: function my_schema.mymedian(numeric) is not defined as STRICT
LINE 27: select my_schema.mymedian(id) filter (where id < 5)
^
HINT: The filter clause is only supported over functions defined as STRICT.
********** Error **********
ERROR: function my_schema.mymedian(numeric) is not defined as STRICT
SQL state: 0AM00
Hint: The filter clause is only supported over functions defined as STRICT.
Character: 661
I am guessing the interpreter wants me to add the keyword strict somewhere. But I am not sure where I need to make this change.
Any help would be appreciated
This page indicates how to use the STRICT keyword and what it does:
http://www.postgresql.org/docs/8.2/static/sql-createfunction.html
So try this:
CREATE OR REPLACE FUNCTION my_schema.final_median(anyarray) RETURNS float8 STRICT AS
IMPORTANT:
The impact of using STRICT is that if your input is NULL (In your case, if there are no records with id < 5) then the result will be ASSUMED to be NULL and the function will not be called. So you need to be sure the place where you call it from can cope with that.
Related
I am completely new to supabase and postgresql. I wanted to get top 3 sneaker baught from sales_items database.
So i wrote this function
SELECT sneaker_product, SUM(quantity) as total_sales
FROM store_sales_items
WHERE is_sneaker is not null
GROUP by sneaker_product
ORDER BY total_sales DESC
LIMIT 3;
that will return
sneaker_product
total_sales
1
"10"
3
"6"
4
"5"
then i try to create function so i can use easily on react app. here it is
CREATE OR REPLACE FUNCTION top_three()
RETURNS TABLE (sneaker_product INT, total_sales INT) AS $$
BEGIN
RETURN QUERY
SELECT sneaker_products.id, SUM(quantity) as total_sales
FROM store_sales_items
JOIN sneaker_products on store_sales_items.sneaker_product = sneaker_products.id
WHERE is_sneaker is not null
GROUP BY sneaker_products.id
ORDER BY total_sales DESC
LIMIT 3;
END;
$$ LANGUAGE plpgsql;
after lots of error solving I came up this but still Failed to run sql query: structure of query does not match function result type
all i am doing this on supabase web app sql editor.
One more thing sneakerproduct has relation with sneaker_products db as well is that causing problem.Idk but can some one help me. Thanks 😊
tried exact matching types of col but no luck. tried to set sneakerproduct type as sneaker_proudcts db i thought might work but thought thats not the way to do it.
The aggregate function sum returns bigint, not integer. Either change the function result type or cast the sum to integer.
I am trying to create the following function in PostgreSQL but get the following error. This is from a MySQL procedure that I need to convert to PostgreSQL. I am failing to convert the syntax to PostgreSQL. I am a beginner in PostgreSQL. Please assist me.
CREATE OR REPLACE FUNCTION public.usp_failed_analyze4()
RETURNS TABLE(status varchar) as
$BODY$
SET #maxdate = (SELECT MAX(analyzetime) FROM wp_analyze_history);
SET #maxdateint = (SELECT DATEDIFF(NOW() ,MAX(analyzetime)) FROM wp_analyze_history);
SET #STATUS = SELECT Status from wp_analyze_history WHERE Status NOT IN ('OK','Table is already up to date','The Analyze task DID NOT run!') AND analyzetime = #maxdate);
SET #STATUSNOTRUN = 'The Analyze task DID NOT run!';
IF #maxdateint > 7
THEN SELECT #STATUSNOTRUN;
ELSE SELECT #STATUS as "";
$BODY$
LANGUAGE sql;
error: ERROR: syntax error at or near "#"
Position: 109
It's hard to tell what you want as you tried to copy the MySQL 1:1.
However, there are several problems in your code:
language sql does not have variables or IF statements. You need to use PL/pgSQL (language plpgsql)
PL/pgSQL requires a declare block to declare all variables and the actual code needs a begin ... end; block as well.
You can use SET for assignment
To store the result of a single row query in a variable use select ... into ... from
The character # is invalid in an SQL identifier and can't be used for variable names (which follow the rules of SQL identifiers). In Postgres it's a common habit to prefix variable with something to avoid ambiguity with column names. I use l_ for local variables (but that's completely a personal preference)
You don't seem to want to return multiple rows, but a single value. So you don't need returns table
To return something from a function, use return not select
Putting that all together it should look something like this:
CREATE OR REPLACE FUNCTION usp_failed_analyze4()
RETURNS varchar -- return a single value
AS
$BODY$
declare
l_maxdate timestamp;
l_maxdatediff interval;
l_status text;
l_statusnotrun text;
begin
select MAX(analyzetime), current_timestamp - MAX(analyzetime)
into l_maxdate, l_maxdatediff
FROM wp_analyze_history;
SELECT Status
into l_status
from wp_analyze_history
WHERE Status NOT IN ('OK','Table is already up to date','The Analyze task DID NOT run!')
AND analyzetime = l_maxdate;
l_statusnotrun := 'The Analyze task DID NOT run!';
IF l_maxdatediff > interval '7 days'
THEN
return l_statusnotrun;
ELSE
return ''; -- strings are enclosed in single quotes in SQL
end if;
end;
$BODY$
LANGUAGE plpgsql;
There is still room for a lot of optimization, but this matches your initial code as much as possible.
I have a plpgsql function that takes a jsonb input, and uses it to first check something, and then again in a query to get results. Something like:
CREATE OR REPLACE FUNCTION public.my_func(
a jsonb,
OUT inserted integer)
RETURNS integer
LANGUAGE 'plpgsql'
COST 100.0
VOLATILE NOT LEAKPROOF
AS $function$
BEGIN
-- fail if there's something already there
IF EXISTS(
select t.x from jsonb_populate_recordset(null::my_type, a) f inner join some_table t
on f.x = t.x and
f.y = t.y
) THEN
RAISE EXCEPTION 'concurrency violation... already present.';
END IF;
-- straight insert, and collect number of inserted
WITH inserted_rows AS (
INSERT INTO some_table (x, y, z)
SELECT f.x, f.y, f.z
FROM jsonb_populate_recordset(null::my_type, a) f
RETURNING 1
)
SELECT count(*) from inserted_rows INTO inserted
;
END
Here, I'm using jsonb_populate_recordset(null::my_type, a) both in the IF check, and also in the actual insert. Is there a way to do the parsing once - perhaps via a variable of some sort? Or would the query optimiser kick in and ensure the parse operation happens only once?
If I understand correctly you look to something like this:
CREATE OR REPLACE FUNCTION public.my_func(
a jsonb,
OUT inserted integer)
RETURNS integer
LANGUAGE 'plpgsql'
COST 100.0
VOLATILE NOT LEAKPROOF
AS $function$
BEGIN
WITH checked_rows AS (
SELECT f.x, f.y, f.z, t.x IS NOT NULL as present
FROM jsonb_populate_recordset(null::my_type, a) f
LEFT join some_table t
on f.x = t.x and f.y = t.y
), vioalted_rows AS (
SELECT count(*) AS violated FROM checked_rows AS c WHERE c.present
), inserted_rows AS (
INSERT INTO some_table (x, y, z)
SELECT c.x, c.y, c.z
FROM checked_rows AS c
WHERE (SELECT violated FROM vioalted_rows) = 0
RETURNING 1
)
SELECT count(*) from inserted_rows INTO inserted
;
IF inserted = 0 THEN
RAISE EXCEPTION 'concurrency violation... already present.';
END IF;
END;
$function$;
JSONB type is no need to parse more then once, at the assignment:
while jsonb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed.
Link
jsonb_populate_recordset function declared as STABLE:
STABLE indicates that the function cannot modify the database, and that within a single table scan it will consistently return the same result for the same argument values, but that its result could change across SQL statements.
Link
I am not sure about it. From the one side UDF call is considering as single statements, from the other side UDF can contains multiple statement. Clarification needed.
Finally if you want to cache such sings then you could to use arrays:
CREATE OR REPLACE FUNCTION public.my_func(
a jsonb,
OUT inserted integer)
RETURNS integer
LANGUAGE 'plpgsql'
COST 100.0
VOLATILE NOT LEAKPROOF
AS $function$
DECLARE
d my_type[]; -- There is variable for caching
BEGIN
select array_agg(f) into d from jsonb_populate_recordset(null::my_type, a) as f;
-- fail if there's something already there
IF EXISTS(
select *
from some_table t
where (t.x, t.y) in (select x, y from unnest(d)))
THEN
RAISE EXCEPTION 'concurrency violation... already present.';
END IF;
-- straight insert, and collect number of inserted
WITH inserted_rows AS (
INSERT INTO some_table (x, y, z)
SELECT f.x, f.y, f.z
FROM unnest(d) f
RETURNING 1
)
SELECT count(*) from inserted_rows INTO inserted;
END $function$;
If you actually want to reuse a result set repeatedly, the general solution would be a temporary table. Example:
Using temp table in PL/pgSQL procedure for cleaning tables
However, that's rather expensive. Looks like all you need is a UNIQUE constraint or index:
Simple and safe with UNIQUE constraint
ALTER TABLE some_table ADD CONSTRAINT some_table_x_y_uni UNIQUE (x,y);
As opposed to your procedural attempt, this is also concurrency-safe (no race conditions). Much faster, too.
Then the function can be dead simple:
CREATE OR REPLACE FUNCTION public.my_func(a jsonb, OUT inserted integer) AS
$func$
BEGIN
INSERT INTO some_table (x, y, z)
SELECT f.x, f.y, f.z
FROM jsonb_populate_recordset(null::my_type, a) f;
GET DIAGNOSTICS inserted = ROW_COUNT; -- OUT param, we're done here
END
$func$ LANGUAGE plpgsql;
If any (x,y) is already present in some_table you get your exception. Chose an instructive name for the constraint, which is reported in the error message.
And we can just read the command tag with GET DIAGNOSTICS, which is substantially cheaper than running another count query.
Related:
How does PostgreSQL enforce the UNIQUE constraint / what type of index does it use?
UNIQUE constraint not possible?
For the unlikely case that a UNIQUE constraint should not be feasible, you can still have it rather simple:
CREATE OR REPLACE FUNCTION public.my_func(a jsonb, OUT inserted integer) AS
$func$
BEGIN
INSERT INTO some_table (x, y, z)
SELECT f.x, f.y, f.z -- empty result set if there are any violations
FROM (
SELECT f.x, f.y, f.z, count(t.x) OVER () AS conflicts
FROM jsonb_populate_recordset(null::my_type, a) f
LEFT JOIN some_table t USING (x,y)
) f
WHERE f.conflicts = 0;
GET DIAGNOSTICS inserted = ROW_COUNT;
IF inserted = 0 THEN
RAISE EXCEPTION 'concurrency violation... already present.';
END IF;
END
$func$ LANGUAGE plpgsql;
Count the number of violations in the same query. (count() only counts non-null values). Related:
Best way to get result count before LIMIT was applied
You should have at least a simple index on some_table (x,y) anyway.
It's important to know that plpgsql does not return results before control exits the function. The exception cancels the return, the user never gets results, only the error message. We added a code example to the manual.
Note, however, that there are race conditions here under concurrent write load. Related:
Is SELECT or INSERT in a function prone to race conditions?
Would the query planner avoid repeated evaluation?
Certainly not between multiple SQL statements.
Even if the function itself is defined STABLE or IMMUTABLE (jsonb_populate_recordset() in the example is STABLE), the query planner does not know that values of input parameters are unchanged between calls. It would be expensive to keep track and make sure of it.
Actually, since plpgsql treats SQL statements like prepared statements, that's plain impossible, since the query is planned before parameter values are fed to the planned query.
I have custom aggregate sum function which accepts boolean data type:
create or replace function badd (bigint, boolean)
returns bigint as
$body$
select $1 + case when $2 then 1 else 0 end;
$body$ language sql;
create aggregate sum(boolean) (
sfunc=badd,
stype=int8,
initcond='0'
);
This aggregate should calculate number of rows with TRUE. For example the following should return 2 (and it does):
with t (x) as
(values
(true::boolean),
(false::boolean),
(true::boolean),
(null::boolean)
)
select sum(x) from t;
However, it's performance is quite bad, it is 5.5 times slower then using casting to integer:
with t as (select (gs > 0.5) as test_vector from generate_series(1,1000000,1) gs)
select sum(test_vector) from t; -- 52012ms
with t as (select (gs > 0.5) as test_vector from generate_series(1,1000000,1) gs)
select sum(test_vector::int) from t; -- 9484ms
Is the only way how to improve this aggregate to write some new C function - e.g. some alternative of int2_sum function in src/backend/utils/adt/numeric.c?
Your test case is misleading, you only count TRUE. You should have both TRUE and FALSE - or even NULL, if applicable.
Like #foibs already explained, one wouldn't use a custom aggregate function for this. The built-in C-functions are much faster and do the job. Use instead (also demonstrating a simpler and more sensible test):
SELECT count(NULLIF(g%2 = 1, FALSE)) AS ct
FROM generate_series(1,100000,1) g;
How does this work?
Compute percents from SUM() in the same SELECT sql query
Several fast & simple ways (plus a benchmark) under this related answer on dba.SE:
For absolute performance, is SUM faster or COUNT?
Or faster yet, test for TRUE in the WHERE clause, where possible:
SELECT count(*) AS ct
FROM generate_series(1,100000,1) g;
WHERE g%2 = 1 -- excludes FALSE and NULL !
If you'd have to write a custom aggregate for some reason, this form would be superior:
CREATE OR REPLACE FUNCTION test_sum_int8 (int8, boolean)
RETURNS bigint as
'SELECT CASE WHEN $2 THEN $1 + 1 ELSE $1 END' LANGUAGE sql;
The addition is only executed when necessary. Your original would add 0 for the FALSE case.
Better yet, use a plpgsql function. It saves a bit of overhead per call, since it works like a prepared statement (the query is not re-planned). Makes a difference for a tiny aggregate function that is called many times:
CREATE OR REPLACE FUNCTION test_sum_plpgsql (int8, boolean)
RETURNS bigint AS
$func$
BEGIN
RETURN CASE WHEN $2 THEN $1 + 1 ELSE $1 END;
END
$func$ LANGUAGE plpgsql;
CREATE AGGREGATE test_sum_plpgsql(boolean) (
sfunc = test_sum_plpgsql
,stype = int8
,initcond = '0'
);
Faster than what you had, but much slower than the presented alternative with a standard count(). And slower than any other C-function, too.
->SQLfiddle
I created custom C function and aggregate for boolean:
C function:
#include "postgres.h"
#include <fmgr.h>
#ifdef PG_MODULE_MAGIC
PG_MODULE_MAGIC;
#endif
int
bool_sum(int arg, bool tmp)
{
if (tmp)
{
arg++;
}
return arg;
}
Transition and aggregate functions:
-- transition function
create or replace function bool_sum(bigint, boolean)
returns bigint
AS '/usr/lib/postgresql/9.1/lib/bool_agg', 'bool_sum'
language C strict
cost 1;
alter function bool_sum(bigint, boolean) owner to postgres;
-- aggregate
create aggregate sum(boolean) (
sfunc=bool_sum,
stype=int8,
initcond='0'
);
alter aggregate sum(boolean) owner to postgres;
Performance test:
-- Performance test - 10m rows
create table tmp_test as (select (case when random() <.3 then null when random() < .6 then true else false end) as test_vector from generate_series(1,10000000,1) gs);
-- Casting to integer
select sum(test_vector::int) from tmp_test;
-- Boolean sum
select sum(test_vector) from tmp_test;
Now sum(boolean) is as fast as sum(boolean::int).
Update:
It turns out that I can call existing C transition functions directly, even with boolean data type. It gets somehow magically converted to 0/1 on the way. So my current solution for boolean sum and average is:
create or replace function bool_sum(bigint, boolean)
returns bigint as
'int2_sum'
language internal immutable
cost 1;
create aggregate sum(boolean) (
sfunc=bool_sum,
stype=int8
);
-- Average for boolean values (percentage of rows with TRUE)
create or replace function bool_avg_accum(bigint[], boolean)
returns bigint[] as
'int2_avg_accum'
language internal immutable strict
cost 1;
create aggregate avg(boolean) (
sfunc=bool_avg_accum,
stype=int8[],
finalfunc=int8_avg,
initcond='{0,0}'
);
I don't see the real issue here. First of all, using sum as your custom aggregate name is wrong. When you call sum with your test_vector cast to int, the embedded postgres sum is used and not yours, that's why it is so much faster. A C function will always be faster, but I'm not sure you need one in this case.
You could easily drop the badd function and your custom sum use the embedded sum with a where clause
with t as (select 1 as test_vector from generate_series(1,1000000,1) gs where gs > 0.5)
select sum(test_vector) from t;
EDIT:
To sum it up, the best way to optimize your custom aggregate is to remove it if it is not needed. The second best way would be to write a C function.
I have a postgresql function / stored proc that does the following:
1. calls another function and saves the value into a variable.
2. executes another sql statement using the value I got from step one as an argument.
My problem is that the query is not returning any data. No errors are returned either.
I'm just new to postgresql so I don't know the best way to debug... but I added a RAISE NOTICE command right after step 1, like so:
SELECT INTO active_id get_widget_id(widget_desc);
RAISE NOTICE 'Active ID is:(%)', active_id;
In the "Messages" section of the pgadmin3 screen, I see the debug message with the data:
NOTICE: Active ID is:(2)
I'm wondering whether or not the brackets are causing the problem for me.
Here's the sql I'm trying to run in step 2:
SELECT d.id, d.contact_id, d.priority, cp.contact
FROM widget_details d, contact_profile cp, contact_type ct
WHERE d.rule_id=active_id
AND d.active_yn = 't'
AND cp.id=d.contact_id
AND cp.contact_type_id=ct.id
AND ct.name = 'email'
Order by d.priority ASC
You'll notice that in my where clause I am referencing the variable "active_id".
I know that this query should return at least one row because when i run a straight sql select (vs using this function) and substitute the value 2 for the variable "active_id", I get back the data I'm looking for.
Any suggetions would be appreciated.
Thanks.
EDIT 1:
Here's the full function definition:
CREATE TYPE custom_return_type AS (
widgetnum integer,
contactid integer,
priority integer,
contactdetails character varying
);
CREATE OR REPLACE FUNCTION test(widget_desc integer)
RETURNS SETOF custom_return_type AS
$BODY$
DECLARE
active_id integer;
rec custom_return_type ;
BEGIN
SELECT INTO active_id get_widget_id(widget_desc);
RAISE NOTICE 'Active ID is:(%)', active_id;
FOR rec IN
SELECT d.id, d.contact_id, d.priority, cp.contact
FROM widget_details d, contact_profile cp, contact_type ct
WHERE d.rule_id=active_id
AND d.active_yn = 't'
AND cp.id=d.contact_id
AND cp.contact_type_id=ct.id
AND ct.name = 'email'
Order by d.priority ASC
LOOP
RETURN NEXT rec;
END LOOP;
END
$BODY$
That's several levels of too-complicated (edit: as it turns out that Erwin already explained to you last time you posted the same thing). Start by using RETURNS TABLE and RETURN QUERY:
CREATE OR REPLACE FUNCTION test(fmfm_number integer)
RETURNS TABLE (
widgetnum integer,
contactid integer,
priority integer,
contactdetails character varying
) AS
$BODY$
BEGIN
RETURN QUERY SELECT d.id, d.contact_id, d.priority, cp.contact
FROM widget_details d, contact_profile cp, contact_type ct
WHERE d.rule_id = get_widget_id(widget_desc)
AND d.active_yn = 't'
AND cp.id=d.contact_id
AND cp.contact_type_id=ct.id
AND ct.name = 'email'
Order by d.priority ASC;
END
$BODY$ LANGUAGE plpgsql;
at which point it's probably simple enough to be turned into a trivial SQL function or even a view. Hard to be sure, since the function doesn't make tons of sense as written:
You never use the parameter fmfm_number anywhere; and
widget_desc is never defined
so this function could never run. Clearly you haven't shown us the real source code, but some kind of "simplified" code that doesn't match the code you're really having issues with.
There is a difference between:
SELECT INTO ...
[http://www.postgresql.org/docs/current/interactive/sql-selectinto.html]
and
SELECT select_expressions INTO [STRICT] target FROM ...;
[http://www.postgresql.org/docs/current/interactive/plpgsql-statements.html#PLPGSQL-STATEMENTS-SQL-ONEROW]
I think you want:
SELECT get_widget_id(widget_desc) INTO active_id;