I have to count the number of bits in postgresql on very large integer columns for which i wrote a postgresql funtion to count the number of bits in an integer.
CREATE OR REPLACE FUNCTION bitcount(i integer) RETURNS integer AS $$
DECLARE n integer;
DECLARE bitCount integer;
BEGIN
bitCount := 0;
LOOP
IF i = 0 THEN
EXIT;
END IF;
i := i & (i-1);
bitCount:= bitCount+1;
END LOOP;
RETURN bitCount;
END
$$ LANGUAGE plpgsql;
but i found one more way to do this using pg's inbuilt functions as well
like
SELECT char_length( replace(100::bit(31)::TEXT, '0', ''));
so i decided to compare performance of both of the ways
so i used below queries to compare
First
SELECT a.n, bitcount(a.n)
from generate_series(1, 100000) as a(n);
Second
SELECT a.n, char_length( replace(a.n::bit(31)::TEXT, '0', ''))
FROM generate_series(1, 100000) as a(n);
I was expecting that First method will outperform the second one
but to my surprise both of them performed almost same. In fact on my machine second one always completed a few seconds faster even with large number of integers.
can anyone explain me why second is almost as fast as first despite of using cast and string operation ?
I'd say it is because PL/pgSQL is known to be slow.
Write the function in PL/Perl or PL/Python for better performance.
Related
I have Postgres function that needs to iterate on an ARRAY of tables_name and should save the value that will be returned from the query each time to array.
maybe this is not correct way so if there is better ways to do it I'll be glad to know :)
I've try with format function to generate different queries each time.
CREATE OR REPLACE FUNCTION array_iter(tables_name text[],idd integer)
RETURNS void
LANGUAGE 'plpgsql'
AS $BODY$
declare
current_table text;
current_height integer :=0;
quer text;
heights integer[];
begin
FOREACH current_table IN ARRAY $1
LOOP
quer:=format('SELECT height FROM %s WHERE %s.idd=$2', current_table);
current_height:=EXECUTE quer;
SELECT array_append(heights, current_height);
END LOOP;
RAISE NOTICE '%', heights;
end;
$BODY$;
First off you desperately need to update your Postgres as version 9.1 is no longer supported and has not been for 5 years (Oct 2016). I would suggest going to v13 as it is the latest, but an absolute minimum to 10.12. That will still has slightly over a year (Nov 2022) before it looses support. So with that in mind.
The statement quer:=format('SELECT height FROM %s WHERE %s.idd=$2', current_table); is invalid, it contains 2 format specifiers but only 1 argument. You could use the single argument by including the argument number on each specifier. So quer:=format('SELECT height FROM %1s WHERE %1s.idd=$2', current_table);. But that is not necessary as the 2nd is a table alias which need not be the table name and since you only have 1 table is not needed at all. I would however move the parameter ($2) out of the select and use a format specifiers/argument for it.
The statement current_height:=EXECUTE quer; is likewise invalid, you cannot make the Right Val of assignment a select. For this you use the INTO option which follows the statement. execute query into ....
While SELECT array_append(heights, current_height); is a valid statement a simple assignment heights = heights || current_height; seems easier (at least imho).
Finally there a a couple omissions. Prior to running a dynamic SQL statement it is good practice to 'print' or log the statement before executing. What happens when the statement has an error. And why build a function to do all this work just to throw away the results, so instead of void return integer array (integer[]).
So we arrive at:
create or replace function array_iter(tables_name text[],idd integer)
returns integer[]
language plpgsql
as $$
declare
current_table text;
current_height integer :=0;
quer text;
heights integer[];
begin
foreach current_table in array tables_name
loop
quer:=format('select height from %I where id=%s', current_table,idd);
raise notice 'Running Query:: %',quer;
execute quer into current_height;
heights = heights || current_height;
end loop;
raise notice '%', heights;
return heights;
exception
when others then
raise notice 'Query failed:: SqlState:%, ErrorMessage:%', sqlstate,sqlerrm;
raise;
end;
$$;
This does run on version as old as 9.5 (see fiddle) although I cannot say that it runs on the even older 9.1.
I have written an AGGREGATE function that approximates a SELECT COUNT(DISTINCT ...) over a UUID column, a kind of poor man's HyperLogLog (and having different perf characteristics).
However, it is very slow because I am using set_bit on a BIT and that has copy-on-write semantics.
So my question is:
is there a way to inplace / mutably update a BIT or bytea?
failing that, are there any binary data structures that allow mutable/in-place set_bit edits?
A constraint is that I can't push C code or extensions to implement this. But I can use extensions that are available in AWS RDS postgres. If it's not faster than HLL then I'll just be using HLL. Note that HLL is optimised for pre-aggregated counts, it isn't terribly fast at doing adhoc count estimates over millions of rows (although still faster than a raw COUNT DISTINCT).
Below is the code for context, probably buggy too:
CREATE OR REPLACE FUNCTION uuid_approx_count_distinct_sfunc (BIT(83886080), UUID)
RETURNS BIT(83886080) AS $$
DECLARE
s BIT(83886080) := $1;
BEGIN
IF s IS NULL THEN s := '0' :: BIT(83886080); END IF;
RETURN set_bit(s, abs(mod(uuid_hash($2), 83886080)), 1);
END
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION uuid_approx_count_distinct_ffunc (BIT(83886080))
RETURNS INTEGER AS $$
DECLARE
i INTEGER := 83886079;
s INTEGER := 0;
BEGIN
LOOP
EXIT WHEN i < 0;
IF get_bit($1, i) = 1 THEN s := s + 1; END IF;
i := i - 1;
END LOOP;
RETURN s;
END
$$ LANGUAGE plpgsql;
CREATE OR REPLACE AGGREGATE approx_count_distinct (UUID) (
SFUNC = uuid_approx_count_distinct_sfunc,
FINALFUNC = uuid_approx_count_distinct_ffunc,
STYPE = BIT(83886080),
COMBINEFUNC = bitor,
PARALLEL = SAFE
);
Yeah, SQL isn't actually that fast for raw computation. I might try a UDF, perhaps pljava or plv8 (JavaScript) which compile just-in-time to native and available on most major hosting providers. Of course for performance, use C (perhaps via LLVM) for maximum performance at maximum pain. Plv8 should take minutes to prototype, just pass an array constructed from array_agg(). Obviously keep the array size to millions of items, or find a way to roll-up your sketches ( bitwuse-AND ?)
https://plv8.github.io/#function-calls
https://www.postgresqltutorial.com/postgresql-aggregate-functions/postgresql-array_agg-function/
FYI HyperLogLog is available as an open source extension for PostgreSQL from Citus/Microsoft and of course available on Azure. https://www.google.com/search?q=hyperloglog+postgres
(You could crib from their coffee and just change the core algorithm, then test side by side). Citus is pretty easy to install, so this isn't a bad option.
As I am newbie to plpgSQL,
I stuck while migrating a Oracle query into PostgreSQL.
Oracle query:
create or replace FUNCTION employee_all_case(
p_ugr_id IN integer,
p_case_type_id IN integer
)
RETURN number_tab_t PIPELINED
-- LANGUAGE 'plpgsql'
-- COST 100
-- VOLATILE
-- AS $$
-- DECLARE
is
l_user_id NUMBER;
l_account_id NUMBER;
BEGIN
l_user_id := p_ugr_id;
l_account_id := p_case_type_id;
FOR cases IN
(SELECT ccase.case_id, ccase.employee_id
FROM ct_case ccase
INNER JOIN ct_case_type ctype
ON (ccase.case_type_id=ctype.case_type_id)
WHERE ccase.employee_id = l_user_id)
LOOP
IF cases.employee_id IS NOT NULL THEN
PIPE ROW (cases.case_id);
END IF;
END LOOP;
RETURN;
END;
--$$
When I execute this function then I get the following result
select * from table(select employee_all_case(14533,1190) from dual)
My question here is: I really do not understand the pipelined function and how can I obtain the same result in PostgreSQL as Oracle query ?
Please help.
Thank you guys, your solution was very helpful.
I found the desire result:
-- select * from employee_all_case(14533,1190);
-- drop function employee_all_case
create or replace FUNCTION employee_all_case(p_ugr_id IN integer ,p_case_type_id IN integer)
returns table (case_id double precision)
-- PIPELINED
LANGUAGE 'plpgsql'
COST 100
VOLATILE
AS $$
DECLARE
-- is
l_user_id integer;
l_account_id integer;
BEGIN
l_user_id := cp_lookup$get_user_id_from_ugr_id(p_ugr_id);
l_account_id := cp_lookup$acctid_from_ugr(p_ugr_id);
RETURN QUERY SELECT ccase.case_id
FROM ct_case ccase
INNER JOIN ct_case_type ctype ON ccase.case_type_id = ctype.case_type_id
WHERE ccase.employee_id = p_ugr_id
and ccase.employee_id IS NOT NULL;
--return NEXT;
END;
$$
You would rewrite that to a set returning function:
Change the return type to
RETURNS SETOF integer
and do away with the PIPELINED.
Change the PIPE ROW statement to
RETURN NEXT cases.case_id;
Of course, you will have to do the obvious syntactic changes, like using integer instead of NUMBER and putting the IN before the parameter name.
But actually, it is quite unnecessary to write a function for that. Doing it in a single SELECT statement would be both simpler and faster.
Pipelined functions are best translated to a simple SQL function returning a table.
Something like this:
create or replace function employee_all_case(p_ugr_id integer, p_case_type_IN integer)
returns table (case_id integer)
as
$$
SELECT ccase.case_id
FROM ct_case ccase
INNER JOIN ct_case_type ctype ON ccase.case_type_id = ctype.case_type_id
WHERE ccase.employee_id = p_ugr_id
and cases.employee_id IS NOT NULL;
$$
language sql;
Note that your sample code did not use the second parameter p_case_type_id.
Usage is also more straightforward:
select *
from employee_all_case(14533,1190);
Before diving into the solution, I will provide some information which will help you to understand better.
So basically PIPELINED came into picture for improving memory allocation at run time.
As you all know collections will occupy space when ever they got created. So the more you use, the more memory will get allocated.
Pipelining negates the need to build huge collections by piping rows out of the function.
saving memory and allowing subsequent processing to start before all the rows are generated.
Pipelined table functions include the PIPELINED clause and use the PIPE ROW call to push rows out of the function as soon as they are created, rather than building up a table collection.
By using Pipelined how memory usage will be optimized?
Well, it's very simple. instead of storing data into an array, just process the data by using pipe row(desired type). This actually returns the row and process the next row.
coming to solution in plpgsql
simple but not preferred while storing large data.
Remove PIPELINED from return declaration and return an array of desired type. something like RETURNS typrec2[].
Where ever you are using pipe row(), add that entry to array and finally return that array.
create a temp table like
CREATE TEMPORARY TABLE temp_table (required fields) ON COMMIT DROP;
and insert data into it. Replace pipe row with insert statement and finally return statement like
return query select * from temp_table
**The best link for understanding PIPELINED in oracle [https://oracle-base.com/articles/misc/pipelined-table-functions]
pretty ordinary for postgres reference [http://manojadinesh.blogspot.com/2011/11/pipelined-in-oracle-as-well-in.html]
Hope this helps some one conceptually.
Postgres PL/pgSQL docs say:
For any SQL command that does not return rows, for example INSERT
without a RETURNING clause, you can execute the command within a
PL/pgSQL function just by writing the command.
Any PL/pgSQL variable name appearing in the command text is treated as
a parameter, and then the current value of the variable is provided as
the parameter value at run time.
But when I use variable names in my queries I get an error:
ERROR: syntax error at or near "email"
LINE 16: ...d,email,password) values(identity_id,current_ts,''email'',''...
This is my function:
CREATE OR REPLACE FUNCTION app.create_identity(email varchar,passwd varchar)
RETURNS integer as $$
DECLARE
current_ts integer;
new_identity_id integer;
int_max integer;
int_min integer;
BEGIN
SELECT extract(epoch FROM now())::integer INTO current_ts;
int_min:=-2147483648;
int_max:= 2147483647;
LOOP
BEGIN
SELECT floor(int_min + (int_max - int_min + 1) * random()) INTO new_identity_id;
IF new_identity_id != 0 THEN
INSERT into app.identity(identity_id,date_inserted,email,password) values(identity_id,current_ts,''email'',''passwd'');
RETURN new_identity_id;
END IF;
EXCEPTION
WHEN unique_violation THEN
END;
END LOOP;
END;
$$ LANGUAGE plpgsql;
Why when I use variables in the query, Postgres throws an error. How is this supposed to be written?
You can not put the parameter names in single quotes (''email'' and you can't use the parameter email "as is" because it has the same name as a column in the table. This name clash is one of the reasons it is highly recommended to not use variables or parameters that have the same name as a column in one of the tables. You have three choices to deal with this:
rename the variable. A common naming convention is to prefix parameters with p_ e.g. p_email, then use the un-ambigous names in the insert
INSERT into app.identity(identity_id,date_inserted,email,password)
values(identity_id,current_ts,p_email,p_password);
use the $1 for the first parameter and $2 for the second:
INSERT into app.identity(identity_id,date_inserted,email,password)
values(identity_id,current_ts,$1,$2);
prefix the parameter name with the function name:
INSERT into app.identity(identity_id,date_inserted,email,password)
values(identity_id,current_ts,create_identity.email,create_identity.password);
I would highly recommend to go with option 1
Unrelated, but: you don't need SELECT statements to assign variable values if you don't retrieve those values from a table.
SELECT extract(epoch FROM now())::integer INTO current_ts;
can be simplified to:
current_ts := extract(epoch FROM now())::integer;
and
SELECT floor(int_min + (int_max - int_min + 1) * random()) INTO new_identity_id;
to
new_identity_id := floor(int_min + (int_max - int_min + 1) * random());
#a_horse answers your actual question and clarifies quoting issues and naming conflicts.
About quoting:
Insert text with single quotes in PostgreSQL
About naming conflicts (behavior of plpgsql changed slightly over time):
Set a default return value for a Postgres function
Postgres function returning a row as JSON value
Better solution
I suggest a completely different approach to begin with:
CREATE OR REPLACE FUNCTION app.create_identity(_email text, _passwd text
, OUT new_identity_id int) AS
$func$
DECLARE
_current_ts int := extract(epoch FROM now());
BEGIN
LOOP
--+ Generate compeltely random int4 numbers +-----------------------------
-- integer (= int4) in Postgres is a signed integer occupying 4 bytes --
-- int4 ranges from -2147483648 to +2147483647, i.e. -2^31 to 2^31 - 1 --
-- Multiply bigint 4294967296 (= 2^32) with random() (0.0 <= x < 1.0) --
-- trunc() the resulting (positive!) float8 - cheaper than floor() --
-- add result to -2147483648 and cast the next result back to int4 --
-- The result fits the int4 range *exactly* --
--------------------------------------------------------------------------
INSERT INTO app.identity
(identity_id, date_inserted, email , password)
SELECT _random_int, _current_ts , _email , _passwd
FROM (SELECT (bigint '-2147483648' -- could be int, but sum is bigint anyway
+ bigint '4294967296' * random())::int) AS t(_random_int) -- random int
WHERE _random_int <> 0 -- exclude 0 (no insert)
ON CONFLICT (identity_id) DO NOTHING -- no exception raised!
RETURNING identity_id -- return *actually* inserted identity_id
INTO new_identity_id; -- OUT parameter, returned at end
EXIT WHEN FOUND; -- exit after success
-- maybe add counter and raise exception when exceeding n (100?) iterations
END LOOP;
END
$func$ LANGUAGE plpgsql;
Major points
Your random integer calculation would result in an integer out of range error, because the intermediate term int_max - int_min + 1 operates with integer, but the result won't fit.
I suggest the cheaper, correct algorithm above.
Entering a block with exception clause is considerably more expensive than without. Luckily, you do not actually need to raise an exception to begin with. Use an UPSERT (INSERT ... ON CONFLICT ... DO NOTHING), to solve this cheaply and elegantly (Postgres 9.5+).
The manual:
Tip: A block containing an EXCEPTION clause is significantly more
expensive to enter and exit than a block without one. Therefore, don't
use EXCEPTION without need.
You don't need the extra IF construct either. Use SELECT with WHERE.
Make new_identity_id an OUT parameter to simplify.
Use the RETURNING clause and insert the resulting identity_id into the OUT parameter directly. Simpler code and faster execution aside, there is an additional, subtle benefit: you get the value that was actually inserted. If there are triggers or rules on the table, this might differ from what you sent with the INSERT.
Assignments are relatively expensive in PL/pgSQL. Reduce those to a minimum for efficient code.
You might remove the last remaining variable _current_ts as well, and do the calculation in the subquery, then you don't need a DECLARE at all. I left that one, since it might make sense to calculate it once, should the function loop multiple times ...
All that remains is one SQL command, wrapped into a LOOP to retry until success.
If there is a chance that your table might overflow (using all or most int4 numbers) - and strictly speaking, there is always the chance - I would add a counter and raise an exception after maybe 100 iterations to avoid endless loops.
Below is the function which i am running on two different tables which contains same column names.
-- Function: test(character varying)
-- DROP FUNCTION test(character varying);
CREATE OR REPLACE FUNCTION test(table_name character varying)
RETURNS SETOF void AS
$BODY$
DECLARE
recordcount integer;
j integer;
hstoredata hstore;
BEGIN
recordcount:=getTableName(table_name);
FOR j IN 1..recordcount LOOP
RAISE NOTICE 'RECORD NUMBER IS: %',j;
EXECUTE format('SELECT hstore(t) FROM datas.%I t WHERE id = $1', table_name) USING j INTO hstoredata;
RAISE NOTICE 'hstoredata: %', hstoredata;
END LOOP;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100
ROWS 1000;
When the above function is run on a table containing 1000 rows time taken is around 536 ms.
When the above function is run on a table containing 10000 rows time taken is around 27994 ms.
Logically time taken for 10000 rows should be around 5360 ms based on the calculation from 1000 rows, but the execution time is very high.
In order to reduce execution time, please suggest what changes to be made.
Logically time taken for 10000 rows should be around 5360 ms based on
the calculation from 1000 rows, but the execution time is very high.
It assumes that reading any particular row takes the same time as reading any other row, but this is not true.
For instance, if there's a text column in the table and it sometimes contains large contents, it's going to be fetched from TOAST storage (out of page) and dynamically uncompressed.
In order to reduce execution time, please suggest what changes to be
made.
To read all the table rows while not necessary fetching all in memory at once, you may use a cursor. That would avoid a new query at every loop iteration. Cursors accept dynamic queries through EXECUTE.
See Cursors in plpgsql documentation.
As far as I can tell you are over complicating things. As the "recordcount" is used to increment the ID values, I think you can do everything with a single statement instead of querying for each and every ID separately.
CREATE OR REPLACE FUNCTION test(table_name varchar)
RETURNS void AS
$BODY$
DECLARE
rec record;
begin
for rec in execute format ('select id, hstore(t) as hs from datas.%I', table_name) loop
RAISE NOTICE 'RECORD NUMBER IS: %',rec.id;
RAISE NOTICE 'hstoredata: %', rec.hs;
end loop;
end;
$BODY$
language plpgsql;
The only thing where this would be different than your solution is, that if an ID smaller than the count of rows in the table does not exist, you won't see a RECORD NUMBER message for that. But you would see ids that are bigger than the row count of the table.
Any time you execute the same statement again and again in a loop very, very loud alarm bells should ring in your head. SQL is optimized to deal with sets of data, not to do row-by-row processing (which is what your loop is doing).
You didn't tell us what the real problem is you are trying to solve (and I fear that you have over-simplified your example) but given the code from the question, the above should be a much better solution (definitely much faster).