I have written an AGGREGATE function that approximates a SELECT COUNT(DISTINCT ...) over a UUID column, a kind of poor man's HyperLogLog (and having different perf characteristics).
However, it is very slow because I am using set_bit on a BIT and that has copy-on-write semantics.
So my question is:
is there a way to inplace / mutably update a BIT or bytea?
failing that, are there any binary data structures that allow mutable/in-place set_bit edits?
A constraint is that I can't push C code or extensions to implement this. But I can use extensions that are available in AWS RDS postgres. If it's not faster than HLL then I'll just be using HLL. Note that HLL is optimised for pre-aggregated counts, it isn't terribly fast at doing adhoc count estimates over millions of rows (although still faster than a raw COUNT DISTINCT).
Below is the code for context, probably buggy too:
CREATE OR REPLACE FUNCTION uuid_approx_count_distinct_sfunc (BIT(83886080), UUID)
RETURNS BIT(83886080) AS $$
DECLARE
s BIT(83886080) := $1;
BEGIN
IF s IS NULL THEN s := '0' :: BIT(83886080); END IF;
RETURN set_bit(s, abs(mod(uuid_hash($2), 83886080)), 1);
END
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION uuid_approx_count_distinct_ffunc (BIT(83886080))
RETURNS INTEGER AS $$
DECLARE
i INTEGER := 83886079;
s INTEGER := 0;
BEGIN
LOOP
EXIT WHEN i < 0;
IF get_bit($1, i) = 1 THEN s := s + 1; END IF;
i := i - 1;
END LOOP;
RETURN s;
END
$$ LANGUAGE plpgsql;
CREATE OR REPLACE AGGREGATE approx_count_distinct (UUID) (
SFUNC = uuid_approx_count_distinct_sfunc,
FINALFUNC = uuid_approx_count_distinct_ffunc,
STYPE = BIT(83886080),
COMBINEFUNC = bitor,
PARALLEL = SAFE
);
Yeah, SQL isn't actually that fast for raw computation. I might try a UDF, perhaps pljava or plv8 (JavaScript) which compile just-in-time to native and available on most major hosting providers. Of course for performance, use C (perhaps via LLVM) for maximum performance at maximum pain. Plv8 should take minutes to prototype, just pass an array constructed from array_agg(). Obviously keep the array size to millions of items, or find a way to roll-up your sketches ( bitwuse-AND ?)
https://plv8.github.io/#function-calls
https://www.postgresqltutorial.com/postgresql-aggregate-functions/postgresql-array_agg-function/
FYI HyperLogLog is available as an open source extension for PostgreSQL from Citus/Microsoft and of course available on Azure. https://www.google.com/search?q=hyperloglog+postgres
(You could crib from their coffee and just change the core algorithm, then test side by side). Citus is pretty easy to install, so this isn't a bad option.
Related
As I am newbie to plpgSQL,
I stuck while migrating a Oracle query into PostgreSQL.
Oracle query:
create or replace FUNCTION employee_all_case(
p_ugr_id IN integer,
p_case_type_id IN integer
)
RETURN number_tab_t PIPELINED
-- LANGUAGE 'plpgsql'
-- COST 100
-- VOLATILE
-- AS $$
-- DECLARE
is
l_user_id NUMBER;
l_account_id NUMBER;
BEGIN
l_user_id := p_ugr_id;
l_account_id := p_case_type_id;
FOR cases IN
(SELECT ccase.case_id, ccase.employee_id
FROM ct_case ccase
INNER JOIN ct_case_type ctype
ON (ccase.case_type_id=ctype.case_type_id)
WHERE ccase.employee_id = l_user_id)
LOOP
IF cases.employee_id IS NOT NULL THEN
PIPE ROW (cases.case_id);
END IF;
END LOOP;
RETURN;
END;
--$$
When I execute this function then I get the following result
select * from table(select employee_all_case(14533,1190) from dual)
My question here is: I really do not understand the pipelined function and how can I obtain the same result in PostgreSQL as Oracle query ?
Please help.
Thank you guys, your solution was very helpful.
I found the desire result:
-- select * from employee_all_case(14533,1190);
-- drop function employee_all_case
create or replace FUNCTION employee_all_case(p_ugr_id IN integer ,p_case_type_id IN integer)
returns table (case_id double precision)
-- PIPELINED
LANGUAGE 'plpgsql'
COST 100
VOLATILE
AS $$
DECLARE
-- is
l_user_id integer;
l_account_id integer;
BEGIN
l_user_id := cp_lookup$get_user_id_from_ugr_id(p_ugr_id);
l_account_id := cp_lookup$acctid_from_ugr(p_ugr_id);
RETURN QUERY SELECT ccase.case_id
FROM ct_case ccase
INNER JOIN ct_case_type ctype ON ccase.case_type_id = ctype.case_type_id
WHERE ccase.employee_id = p_ugr_id
and ccase.employee_id IS NOT NULL;
--return NEXT;
END;
$$
You would rewrite that to a set returning function:
Change the return type to
RETURNS SETOF integer
and do away with the PIPELINED.
Change the PIPE ROW statement to
RETURN NEXT cases.case_id;
Of course, you will have to do the obvious syntactic changes, like using integer instead of NUMBER and putting the IN before the parameter name.
But actually, it is quite unnecessary to write a function for that. Doing it in a single SELECT statement would be both simpler and faster.
Pipelined functions are best translated to a simple SQL function returning a table.
Something like this:
create or replace function employee_all_case(p_ugr_id integer, p_case_type_IN integer)
returns table (case_id integer)
as
$$
SELECT ccase.case_id
FROM ct_case ccase
INNER JOIN ct_case_type ctype ON ccase.case_type_id = ctype.case_type_id
WHERE ccase.employee_id = p_ugr_id
and cases.employee_id IS NOT NULL;
$$
language sql;
Note that your sample code did not use the second parameter p_case_type_id.
Usage is also more straightforward:
select *
from employee_all_case(14533,1190);
Before diving into the solution, I will provide some information which will help you to understand better.
So basically PIPELINED came into picture for improving memory allocation at run time.
As you all know collections will occupy space when ever they got created. So the more you use, the more memory will get allocated.
Pipelining negates the need to build huge collections by piping rows out of the function.
saving memory and allowing subsequent processing to start before all the rows are generated.
Pipelined table functions include the PIPELINED clause and use the PIPE ROW call to push rows out of the function as soon as they are created, rather than building up a table collection.
By using Pipelined how memory usage will be optimized?
Well, it's very simple. instead of storing data into an array, just process the data by using pipe row(desired type). This actually returns the row and process the next row.
coming to solution in plpgsql
simple but not preferred while storing large data.
Remove PIPELINED from return declaration and return an array of desired type. something like RETURNS typrec2[].
Where ever you are using pipe row(), add that entry to array and finally return that array.
create a temp table like
CREATE TEMPORARY TABLE temp_table (required fields) ON COMMIT DROP;
and insert data into it. Replace pipe row with insert statement and finally return statement like
return query select * from temp_table
**The best link for understanding PIPELINED in oracle [https://oracle-base.com/articles/misc/pipelined-table-functions]
pretty ordinary for postgres reference [http://manojadinesh.blogspot.com/2011/11/pipelined-in-oracle-as-well-in.html]
Hope this helps some one conceptually.
I have to count the number of bits in postgresql on very large integer columns for which i wrote a postgresql funtion to count the number of bits in an integer.
CREATE OR REPLACE FUNCTION bitcount(i integer) RETURNS integer AS $$
DECLARE n integer;
DECLARE bitCount integer;
BEGIN
bitCount := 0;
LOOP
IF i = 0 THEN
EXIT;
END IF;
i := i & (i-1);
bitCount:= bitCount+1;
END LOOP;
RETURN bitCount;
END
$$ LANGUAGE plpgsql;
but i found one more way to do this using pg's inbuilt functions as well
like
SELECT char_length( replace(100::bit(31)::TEXT, '0', ''));
so i decided to compare performance of both of the ways
so i used below queries to compare
First
SELECT a.n, bitcount(a.n)
from generate_series(1, 100000) as a(n);
Second
SELECT a.n, char_length( replace(a.n::bit(31)::TEXT, '0', ''))
FROM generate_series(1, 100000) as a(n);
I was expecting that First method will outperform the second one
but to my surprise both of them performed almost same. In fact on my machine second one always completed a few seconds faster even with large number of integers.
can anyone explain me why second is almost as fast as first despite of using cast and string operation ?
I'd say it is because PL/pgSQL is known to be slow.
Write the function in PL/Perl or PL/Python for better performance.
I would like to use my postgres server also to serve documents and images that I don't want to store in the database for several reasons.
There is an extension for that purpose https://github.com/darold/external_file external fileexternal file and changed the code a bit to serve my needs without changing the core (see below). I am using 9.5 as I expect this version to be final before I finish development ;-)
I encounter the following problems:
Writing works quick and seems to be reliable but big files lead to out of memory (1Gig and above).
Reading often hangs vor a very long time (select readEFile('aPath');) and is not reliable.
Both WAL and Database quickly grow in size although no database tables are involved.
My Questions:
What is wrong with the following code? How can I exclude all those operations from WAL? Has anyone alredy written something like that and would share his development?
CREATE OR REPLACE FUNCTION public.writeefile(
buffer bytea,
filename character varying)
RETURNS void AS
$BODY$
DECLARE
l_oid oid;
lfd integer;
lsize integer;
BEGIN
l_oid := lo_create(0);
lfd := lo_open(l_oid,131072); --0x00020000 write mode
lsize := lowrite(lfd,buffer);
PERFORM lo_close(lfd);
PERFORM lo_export(l_oid,filename);
PERFORM lo_unlink(l_oid);
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
ALTER FUNCTION public.writeefile(bytea, character varying)
OWNER TO itcms;
CREATE OR REPLACE FUNCTION public.readefile(filename character varying)
RETURNS bytea AS
$BODY$
DECLARE
l_oid oid;
r record;
buffer bytea;
BEGIN
buffer := '';
SELECT lo_import(filename) INTO l_oid;
FOR r IN ( SELECT data
FROM pg_largeobject
WHERE loid = l_oid
ORDER BY pageno ) LOOP
buffer = buffer || r.data;
END LOOP;
PERFORM lo_unlink(l_oid);
return buffer;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
ALTER FUNCTION public.readefile(character varying)
OWNER TO itcms;
To explain my need for the above: This will be part of a medical system that also serves and stores huge documents and images through unsecure connections. storing hundreds of GB in the database doesn't seem to be a good idea to me. Since they don't change and just new docs are added backup of files is much more easy. As the database already handles SSL connections it would be great not having to deploy an additional sftp server for serving those files!
Your concept is doomed to failure. You are using the database server as a cache for disk operations on large files . This is an obvious waste of time and resources because the server each time have to save the entire contents of the file to remove it for a moment.
In my opinion, the use of ftp server will be simpler, more natural and far more efficient solution.
Assume, i have 2 tables in my DB (postgresql-9.x)
CREATE TABLE FOLDER (
KEY BIGSERIAL PRIMARY KEY,
PATH TEXT,
NAME TEXT
);
CREATE TABLE FOLDERFILE (
FILEID BIGINT,
PATH TEXT,
PATHKEY BIGINT
);
I automatically update FOLDERFILE.PATHKEY from FOLDER.KEY whenever i insert into or update FOLDERFILE:
CREATE OR REPLACE FUNCTION folderfile_fill_pathkey() RETURNS trigger AS $$
DECLARE
pathkey bigint;
changed boolean;
BEGIN
IF tg_op = 'INSERT' THEN
changed := TRUE;
ELSE IF old.FILEID != new.FILEID THEN
changed := TRUE;
END IF;
END IF;
IF changed THEN
SELECT INTO pathkey key FROM FOLDER WHERE PATH = new.path;
IF FOUND THEN
new.pathkey = pathkey;
ELSE
new.pathkey = NULL;
END IF;
END IF;
RETURN new;
END
$$ LANGUAGE plpgsql VOLATILE;
CREATE TRIGGER folderfile_fill_pathkey_trigger AFTER INSERT OR UPDATE
ON FOLDERFILE FOR EACH ROW EXECUTE PROCEDURE fcliplink_fill_pathkey();
So the question is about function folderfile_fill_pathkey() volatility. Documentations says
Any function with side-effects must be labeled VOLATILE
But as far as i understand – this function does not change any data in the tables it rely on, so i can mark this function as IMMUTABLE. It that correct?
Would there be any problem with IMMUTABLE trigger function if I bulk-insert many rows into FOLDERFILE within the same transaction, like:
BEGIN;
INSERT INTO FOLDERFILE ( ... );
...
INSERT INTO FOLDERFILE ( ... );
COMMIT;
Firstly, as #pozs already pointed out, the function definition you have provided is most definitely STABLE rather than IMMUTABLE since it performs database look-ups. This means that the result is not simply derived from the input parameters (as IMMUTABLE would suggest), but also from the data stored in your FOLDER table (which is bound to change). As per the documentation:
STABLE indicates that the function cannot modify the database, and
that within a single table scan it will consistently return the same
result for the same argument values, but that its result could change
across SQL statements. This is the appropriate selection for functions
whose results depend on database lookups, parameter variables (such as
the current time zone), etc.
Secondly, adding stability modifiers (IMMUTABLE/STABLE/VOLATILE) to your trigger functions serves an illustrative purpose at best, since AFAIK PostgreSQL doesn't actually perform any planning that would warrant their use. The following post from the pgsql-hackers mailing list seems to support my claim:
Volatility is a complete no-op for a trigger function anyway, as are
other planner parameters such as cost/rows, because there is no
planning involved in trigger calls.
To sum up: you're probably better off avoiding the stability keywords in your trigger(!) procedures for now, since including them seems to add little to no benefit but entails several unexpected caveats/pitfalls (see the end of #pozs's first comment).
I have created a "merge" function which is supposed to execute either an UPDATE or an INSERT query, depending on existing data. Instead of writing an upsert-wrapper for each table (as in most of the available examples), this function takes entire SQL strings. Both of the SQL strings are automatically generated by our application.
The plan is to call the function like this:
-- hypothetical "settings" table, with a primary key of (user_id, setting):
SELECT merge(
$$UPDATE settings SET value = 'x' WHERE user_id = 42 AND setting = 'foo'$$,
$$INSERT INTO settings (user_id, setting, value) VALUES (42, 'foo', 'x')$$
);
Here's the full code of the merge() function:
CREATE OR REPLACE FUNCTION merge (update_sql TEXT, insert_sql TEXT) RETURNS TEXT AS
$func$
DECLARE
max_iterations INTEGER := 10;
i INTEGER := 0;
num_updated INTEGER;
BEGIN
-- usually returns before re-entering the loop
LOOP
-- first try the update
EXECUTE update_sql;
GET DIAGNOSTICS num_updated = ROW_COUNT;
IF num_updated > 0 THEN
RETURN 'UPDATE';
END IF;
-- nothing was updated: try the insert, watching out for concurrent inserts
BEGIN
EXECUTE insert_sql;
RETURN 'INSERT';
EXCEPTION WHEN unique_violation THEN
-- nop; just loop and try again from the top
END;
-- emergency brake
i := i + 1;
IF i >= max_iterations THEN
RAISE EXCEPTION 'merge(): tried looping % times, giving up now.', i;
EXIT;
END IF;
END LOOP;
END;
$func$
LANGUAGE plpgsql;
It appears to work well enough in my tests, but I'm not certain if I haven't missed anything crucial, especially regarding concurrent UPDATE/INSERT/DELETE queries, which may be issued without using this function. Did I overlook anything important?
Among the resources I consulted for this function are:
UPDATE/INSERT example 40.2 in the PostgreSQL manual
Why is UPSERT so complicated?
SO: Insert, on duplicate update (postgresql)
(Edit: one of the goals was to avoid locking the target table.)
The answer to your question depends your the context of how your application(s) will access the database. There are many ways to solve this as nicely discussed in depesz's post you cited by yourself. In addition you might want to also consider using writeable CTEs see here. Also the [question]Insert, on duplicate update in PostgreSQL? has some interesting discussions for your decision making process.