Postgres (Amazon RDS) how to calculate weighted average - postgresql

I am using an Amazon RDS Postgres database (9.4.4), and I would like to calculate the weighted mean of some data.
I have found the following extension which looks perfect for the job.
https://github.com/Kozea/weighted_mean
However I am now unsure of how to install the extension, as my initial research shows that only 'supported' extensions are allowed.
http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_PostgreSQL.html#PostgreSQL.Concepts.General.FeatureSupport
What options do i have for using this extension. I don't want to have to re-invent the wheel, and I am not familiar with installing any kind of function/extension within Postgres.
Thanks

Why you just don't issue a simple query like this:
select
case when sum(quantity) = 0 then
0
else
sum(unitprice * quantity) / sum(quantity)
end
from sales;
of course you need to change your fields (unitprice, quantity)

The SUM(value*weight)/SUM(weight) answer is fine, but if you need to reuse it constantly it is more convenient to create an aggregate function.
CREATE FUNCTION avg_weighted_step(state numeric[], value numeric, weight numeric)
RETURNS numeric[]
LANGUAGE plpgsql
AS $$
BEGIN
RETURN array[state[1] + value*weight, state[2] + weight];
END;
$$ IMMUTABLE;
CREATE FUNCTION avg_weighted_finalizer(state numeric[])
RETURNS numeric
LANGUAGE plpgsql
AS $$
BEGIN
IF state[2] = 0 THEN
RETURN null;
END IF;
RETURN state[1]/state[2];
END;
$$ IMMUTABLE;
CREATE AGGREGATE avg_weighted(value numeric, weight numeric) (
sfunc = avg_weighted_step,
stype = numeric[],
finalfunc = avg_weighted_finalizer,
initcond = '{0,0}' );
Example Usage:
$ table tmp;
v w
-- -
1 2
10 1
(2 rows)
$ select avg_weighted(v, w) from tmp;
avg_weighted
------------------
4.0000000000000000
(1 row)
$ select avg_weighted(v, w) from tmp where v is null;
avg_weighted
------------
∅
(1 row)

In order to register new custom extension you need to add appropriate scripts to contrib directory of Postgres installation. AWS does NOT allow you to have such a granular control.
Long story short, there is no way you can add any custom extensions (beside the ones specified in the link you provided or from pg_available_extensions view) to Amazon RDS service.
This is one of the drawbacks of using DBaaS (database as a service) solutions.

Related

How to count how many entries in a column are numeric using PostgreSQL

I am trying to count how many entries are in a column that are both numeric and fulfill other conditions. I understand how that script is meant to look in SQL:
SELECT COUNT(ingredients)
FROM data.pie
WHERE description LIKE 'cherry'
AND is.numeric(price) = true
But I'm not sure how to translate that into a PostgreSQL script. Any help would be appreciated.
Thank you.
Another alternative to the one shown by Tim is to create a function ...
CREATE OR REPLACE FUNCTION is_numeric(val VARCHAR) RETURNS BOOLEAN AS $$
DECLARE x NUMERIC;
BEGIN
x = val::NUMERIC;
RETURN TRUE;
EXCEPTION WHEN OTHERS THEN
RETURN FALSE;
END;
$$
STRICT
LANGUAGE plpgsql IMMUTABLE;
.. that can be used like this:
db=# SELECT is_numeric('foo'), is_numeric('1'), is_numeric('1.39');
is_numeric | is_numeric | is_numeric
------------+------------+------------
f | t | t
(1 Zeile)
Your current query, slightly modified, should work:
SELECT COUNT(ingredients)
FROM data.pie
WHERE
description LIKE 'cherry' AND
price ~ '^[0-9]+([.][0-9]+)?$';

PostgreSQL modifying fields dynamically in NEW record in a trigger function

I have a user table with IDs and usernames (and other details) and several other tables referring to this table with various column names (CONSTRAINT some_name FOREIGN KEY (columnname) REFERENCES "user" (userid)). What I need to do is add the usernames to the referring tables (in preparation for dropping the whole user table). This is of course easily accomplished with a single ALTER TABLE and UPDATE, and keeping these up-to-date with triggers is also (fairly) easy. But it's the trigger function that is causing me some annoyance. I could have used individual functions for each table, but this seemed redundant, so I created one common function for this purpose:
CREATE OR REPLACE FUNCTION public.add_username() RETURNS trigger AS
$BODY$
DECLARE
sourcefield text;
targetfield text;
username text;
existing text;
BEGIN
IF (TG_NARGS != 2) THEN
RAISE EXCEPTION 'Need source field and target field parameters';
END IF;
sourcefield = TG_ARGV[0];
targetfield = TG_ARGV[1];
EXECUTE 'SELECT username FROM "user" WHERE userid = ($1).' || sourcefield INTO username USING NEW;
EXECUTE format('SELECT ($1).%I', targetfield) INTO existing USING NEW;
IF ((TG_OP = 'INSERT' AND existing IS NULL) OR (TG_OP = 'UPDATE' AND (existing IS NULL OR username != existing))) THEN
CASE targetfield
WHEN 'username' THEN
NEW.username := username;
WHEN 'modifiername' THEN
NEW.modifiername := username;
WHEN 'creatorname' THEN
NEW.creatorname := username;
.....
END CASE;
END IF;
RETURN NEW;
END;
$BODY$
LANGUAGE 'plpgsql' VOLATILE;
And using the trigger function:
CREATE TRIGGER some_trigger_name BEFORE UPDATE OR INSERT ON my_schema.my_table FOR EACH ROW EXECUTE PROCEDURE public.add_username('userid', 'username');
The way this works is the trigger function receives the original source field name (for example userid) and the target field name (username) via TG_ARGV. These are then used to fill in the (possibly) missing information. All this works nice enough, but how can I get rid of that CASE-mess? Is there a way to dynamically modify the values in the NEW record when I don't know the name of the field in advance (or rather it can be a lot of things)? It is in the targetfield parameter, but obviously NEW.targetfield does not work, nor something like NEW[targetfield] (like Javascript for example).
Any ideas how this could be accomplished? Besides using for instance PL/Python..
There are not simple plpgsql based solutions. Some possible solutions:
Using hstore extension.
CREATE TYPE footype AS (a int, b int, c int);
postgres=# select row(10,20,30);
row
------------
(10,20,30)
(1 row)
postgres=# select row(10,20,30)::footype #= 'b=>100';
?column?
-------------
(10,100,30)
(1 row)
hstore based function can be very simple:
create or replace function update_fields(r anyelement,
variadic changes text[])
returns anyelement as $$
select $1 #= hstore($2);
$$ language sql;
postgres=# select *
from update_fields(row(10,20,30)::footype,
'b', '1000', 'c', '800');
a | b | c
----+------+-----
10 | 1000 | 800
(1 row)
Some years ago I wrote a extension pl toolbox. There is a function record_set_fields:
pavel=# select * from pst.record_expand(pst.record_set_fields(row(10,20),'f1',33));
name | value | typ
------+-------+---------
f1 | 33 | integer
f2 | 20 | integer
(2 rows)
Probably you can find some plpgsql only solutions based on some tricks with system tables and arrays like this, but I cannot to suggest it. It is too less readable and for not advanced user just only black magic. hstore is simple and almost everywhere so it should be preferred way.
On PostgreSQL 9.4 (maybe 9.3) you can try to black magic with JSON manipulations:
postgres=# select json_populate_record(NULL::footype, jo)
from (select json_object(array_agg(key),
array_agg(case key when 'b'
then 1000::text
else value
end)) jo
from json_each_text(row_to_json(row(10,20,30)::footype))) x;
json_populate_record
----------------------
(10,1000,30)
(1 row)
So I am able to write function:
CREATE OR REPLACE FUNCTION public.update_field(r anyelement,
fn text, val text,
OUT result anyelement)
RETURNS anyelement
LANGUAGE plpgsql
AS $function$
declare jo json;
begin
jo := (select json_object(array_agg(key),
array_agg(case key when 'b' then val
else value end))
from json_each_text(row_to_json(r)));
result := json_populate_record(r, jo);
end;
$function$
postgres=# select * from update_field(row(10,20,30)::footype, 'b', '1000');
a | b | c
----+------+----
10 | 1000 | 30
(1 row)
JSON based function should not be terrible fast. hstore should be faster.
UPDATE/caution: Erwin points out that this is currently undocumented, and the docs indicates it should not be possible to alter records this way.
Use Pavel's solution or hstore.
The json based solution is almost as fast as hstore when simplified. json_populate_record() modifies existing records for us, so we only have to create a json object from the keys we want to change.
See my similar answer, where you'll find benchmarks that compares the solutions.
The simplest solution requires Postgres 9.4:
SELECT json_populate_record (
record
,json_build_object('key', 'new-value')
);
But if you only have Postgres 9.3, you can use casting instead of json_object:
SELECT json_populate_record(
record
, ('{"'||'key'||'":"'||'new-value'||'"}')::json
);

Calculate daily sums in PostgreSQL

I am fairly new in postgres and what am trying to do is calculate sum values for each day for every month (i.e daily sum values). Based on scattering information I came up with something like this:
CREATE OR REPLACE FUNCTION sumvalues() RETURNS double precision AS
$BODY$
BEGIN
FOR i IN 0..31 LOOP
SELECT SUM("Energy")
FROM "public"."EnergyWh" e
WHERE e."DateTime" = day('01-01-2005 00:00:00'+ INTERVAL 'i' DAY);
END LOOP;
END
$BODY$
LANGUAGE plpgsql VOLATILE NOT LEAKPROOF;
ALTER FUNCTION public.sumvalues()
OWNER TO postgres;
The query returned successfully, so I thought I had made it. However when am trying to insert the values of the function to a table (which maybe wrong):
INSERT INTO "SumValues"
("EnergyDC")
(
SELECT sumvalues()
);
I get this:
ERROR: invalid input syntax for type interval: "01-01-2005 00:00:00"
LINE 3: WHERE e."DateTime" = day('01-01-2005 00:00:00'+ INTERVAL...
I tried to debug it myself but yet am not sure, which of the two I am doing wrong (or both) and why.
Here is an example of EnergyWh
(am using systemid and datetime as composite PK, but that should not matter)
see GROUP BY clause http://www.postgresql.org/docs/9.2/static/tutorial-agg.html
SELECT EXTRACT(day FROM e."DateTime"), EXTRACT(month FROM e."DateTime"),
EXTRACT(year FROM e."DateTime"), sum("Energy")
FROM "public"."EnergyWh" e
GROUP BY 1,2,3
but following query should to work too:
SELECT e."DateTime"::date, sum("Energy")
FROM "public"."EnergyWh" e
GROUP BY 1
I am using a short syntax for GROUP BY ~ GROUP BY 1 .. group by first column.
Here is simple Example that can help you:
Table :
create table demo (value double precision);
Function
CREATE OR REPLACE FUNCTION sumvalues() RETURNS void AS
$BODY$
DECLARE
inte text;
BEGIN
FOR i IN 0..31 LOOP
inte := 'INSERT INTO demo SELECT EXTRACT (DAY FROM TIMESTAMP ''01-01-2005 00:00:00''+ INTERVAL '''||i||' Days'')';
EXECUTE inte;
END LOOP;
END
$BODY$
LANGUAGE plpgsql VOLATILE NOT LEAKPROOF;
ALTER FUNCTION public.sumvalues()
OWNER TO postgres;
Function Call
SELECT sumvalues();
Output
SELECT * FROM demo;
Here if you want to use some variable value into SQL query than you must have to use some DYNAMIC QUERY for that.
Reference : Dynamic query in pgsql

Birt Report - Not gettting data from Postgres RefCursor

I have this:
POSTGRES
/*THE PARAMETER in_test_id IS ONLY A TEST!!*/
CREATE OR REPLACE FUNCTION public.test_birt(in_test_id bigint DEFAULT NULL::bigint)
RETURNS refcursor AS
$BODY$
DECLARE
query text;
tcursor refcursor = 'tcursor';
BEGIN
query := 'SELECT * FROM MY_TABLE';
OPEN tcursor FOR execute query;
return tcursor;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
BIRT
DATASET --> MyDataset --> select * from test_birt(?::bigint)
Here the screenshots:
Report Design
Report Preview
I need that Birt shows the values of MY_TABLE!!. In this case, this table have one varchar field, with the values: TEST1, TEST2, TEST3.
The Birt Version is 3.2 and the postgres is 9.2.
NOTE The unique solution that i found was create a datatype and change the return datatype from my function, something like this:
RETURNS SETOF my_type AS
But I need that Bird can read this RefCursor.
You miss a FETCH statement.
When you call a function, then cursor "tcursor" is created (and opened). But nobody try to read from it. And without explicit support in Birt is impossible to call function and fetch data from cursor. You can try a hack - that can work or not (depends on implementation in Birt) - use following commands as source for dataset:
SELECT test_birt(?::bigint); FETCH ALL FROM tcursor;
I found link that shows so Birt didn't support it 5 years ago.
On second hand. In 9.2 you don't need to define own types for returning tables. You can use a table types - when you can return all columns or you can define output columns via TABLE keywords:
CREATE TABLE foo(a int, b int); -- automatically it defined type foo
CREATE OR REPLACE FUNCTION read_from_foo_1(_a int)
RETURNS SETOF foo AS $$
SELECT * FROM foo WHERE foo.a = _a;
$$ LANGUAGE SQL;
or
CREATE OR REPLACE FUNCTION read_from_foo_2()
RETURNS TABLE(a int, b int, c int) AS $$
SELECT a, b, a + b FROM foo;
$$ LANGUAGE SQL;

Improve performance of custom aggregate function in PostgreSQL

I have custom aggregate sum function which accepts boolean data type:
create or replace function badd (bigint, boolean)
returns bigint as
$body$
select $1 + case when $2 then 1 else 0 end;
$body$ language sql;
create aggregate sum(boolean) (
sfunc=badd,
stype=int8,
initcond='0'
);
This aggregate should calculate number of rows with TRUE. For example the following should return 2 (and it does):
with t (x) as
(values
(true::boolean),
(false::boolean),
(true::boolean),
(null::boolean)
)
select sum(x) from t;
However, it's performance is quite bad, it is 5.5 times slower then using casting to integer:
with t as (select (gs > 0.5) as test_vector from generate_series(1,1000000,1) gs)
select sum(test_vector) from t; -- 52012ms
with t as (select (gs > 0.5) as test_vector from generate_series(1,1000000,1) gs)
select sum(test_vector::int) from t; -- 9484ms
Is the only way how to improve this aggregate to write some new C function - e.g. some alternative of int2_sum function in src/backend/utils/adt/numeric.c?
Your test case is misleading, you only count TRUE. You should have both TRUE and FALSE - or even NULL, if applicable.
Like #foibs already explained, one wouldn't use a custom aggregate function for this. The built-in C-functions are much faster and do the job. Use instead (also demonstrating a simpler and more sensible test):
SELECT count(NULLIF(g%2 = 1, FALSE)) AS ct
FROM generate_series(1,100000,1) g;
How does this work?
Compute percents from SUM() in the same SELECT sql query
Several fast & simple ways (plus a benchmark) under this related answer on dba.SE:
For absolute performance, is SUM faster or COUNT?
Or faster yet, test for TRUE in the WHERE clause, where possible:
SELECT count(*) AS ct
FROM generate_series(1,100000,1) g;
WHERE g%2 = 1 -- excludes FALSE and NULL !
If you'd have to write a custom aggregate for some reason, this form would be superior:
CREATE OR REPLACE FUNCTION test_sum_int8 (int8, boolean)
RETURNS bigint as
'SELECT CASE WHEN $2 THEN $1 + 1 ELSE $1 END' LANGUAGE sql;
The addition is only executed when necessary. Your original would add 0 for the FALSE case.
Better yet, use a plpgsql function. It saves a bit of overhead per call, since it works like a prepared statement (the query is not re-planned). Makes a difference for a tiny aggregate function that is called many times:
CREATE OR REPLACE FUNCTION test_sum_plpgsql (int8, boolean)
RETURNS bigint AS
$func$
BEGIN
RETURN CASE WHEN $2 THEN $1 + 1 ELSE $1 END;
END
$func$ LANGUAGE plpgsql;
CREATE AGGREGATE test_sum_plpgsql(boolean) (
sfunc = test_sum_plpgsql
,stype = int8
,initcond = '0'
);
Faster than what you had, but much slower than the presented alternative with a standard count(). And slower than any other C-function, too.
->SQLfiddle
I created custom C function and aggregate for boolean:
C function:
#include "postgres.h"
#include <fmgr.h>
#ifdef PG_MODULE_MAGIC
PG_MODULE_MAGIC;
#endif
int
bool_sum(int arg, bool tmp)
{
if (tmp)
{
arg++;
}
return arg;
}
Transition and aggregate functions:
-- transition function
create or replace function bool_sum(bigint, boolean)
returns bigint
AS '/usr/lib/postgresql/9.1/lib/bool_agg', 'bool_sum'
language C strict
cost 1;
alter function bool_sum(bigint, boolean) owner to postgres;
-- aggregate
create aggregate sum(boolean) (
sfunc=bool_sum,
stype=int8,
initcond='0'
);
alter aggregate sum(boolean) owner to postgres;
Performance test:
-- Performance test - 10m rows
create table tmp_test as (select (case when random() <.3 then null when random() < .6 then true else false end) as test_vector from generate_series(1,10000000,1) gs);
-- Casting to integer
select sum(test_vector::int) from tmp_test;
-- Boolean sum
select sum(test_vector) from tmp_test;
Now sum(boolean) is as fast as sum(boolean::int).
Update:
It turns out that I can call existing C transition functions directly, even with boolean data type. It gets somehow magically converted to 0/1 on the way. So my current solution for boolean sum and average is:
create or replace function bool_sum(bigint, boolean)
returns bigint as
'int2_sum'
language internal immutable
cost 1;
create aggregate sum(boolean) (
sfunc=bool_sum,
stype=int8
);
-- Average for boolean values (percentage of rows with TRUE)
create or replace function bool_avg_accum(bigint[], boolean)
returns bigint[] as
'int2_avg_accum'
language internal immutable strict
cost 1;
create aggregate avg(boolean) (
sfunc=bool_avg_accum,
stype=int8[],
finalfunc=int8_avg,
initcond='{0,0}'
);
I don't see the real issue here. First of all, using sum as your custom aggregate name is wrong. When you call sum with your test_vector cast to int, the embedded postgres sum is used and not yours, that's why it is so much faster. A C function will always be faster, but I'm not sure you need one in this case.
You could easily drop the badd function and your custom sum use the embedded sum with a where clause
with t as (select 1 as test_vector from generate_series(1,1000000,1) gs where gs > 0.5)
select sum(test_vector) from t;
EDIT:
To sum it up, the best way to optimize your custom aggregate is to remove it if it is not needed. The second best way would be to write a C function.