bit_count function in PostgreSQL - postgresql

We are in the process of migrating a MySQL 5.7 database to PostgreSQL 9.6.
A real issue is the lack of bit_count function in PostgreSQL. This function is also not available in the upcoming version 10.
Current MySQL code snippet (simplified):
-- mysql specific, tested with 5.7.19
select code,phash,bit_count(phash ^ -9187530158960050433) as hd
from documents
where phash is not null and bit_count(phash ^ -9187530158960050433) < 7
order by hd;
We tried a naive solution (converting the BIGINT to a String and counting the "1"'s), but it performs terribly compared to MySQL.
Java uses a trick from Hacker's Delight, but AFAIK this is not possible with PostgreSQL, because the >>> operator is (also) not available.
Question: Is a there solution/workaround available comparable with MySQL performance wise?
UPDATE 1
Best solution i could find is based on this SO answer:
First create bit_count function:
CREATE OR REPLACE FUNCTION bit_count(value bigint)
RETURNS numeric
AS $$ SELECT SUM((value >> bit) & 1) FROM generate_series(0, 63) bit $$
LANGUAGE SQL IMMUTABLE STRICT;
Now we can use almost the same SQL as with MySQL:
-- postgresql specific, tested with 9.6.5
select code,phash,bit_count(phash # -9187530158960050433) as hd
from documents
where phash is not null and bit_count(phash # -9187530158960050433) < 7
order by hd;
UPDATE 2
Based on #a_horse_with_no_name comment, i tried this function:
-- fastest implementation so far. 10 - 11 x faster than the naive solution (see UPDATE 1)
CREATE OR REPLACE FUNCTION bit_count(value bigint)
RETURNS integer
AS $$ SELECT length(replace(value::bit(64)::text,'0','')); $$
LANGUAGE SQL IMMUTABLE STRICT;
However, this is still 5 - 6 times slower than MySQL (tested wit exact the same data set of 200k phash values on the same hardware).

Function bit_count is available since PostgreSQL version 14, see
Bit String Functions and Operators.
Example:
select bit_count(B'1101');
Result is 3.
Note that the function is defined for types bit and bit varying. So if you want to use it with integer values, you need to cast.
Example:
select cast (cast (1101 as text) as bit varying);
Result is B'1101'.
Combining both examples:
select bit_count(cast (cast (1101 as text) as bit varying));

Question: Is a there solution/workaround available comparable with
MySQL performance wise?
To get a comparable speed, a compiled C function should be used.
If you can compile C code, see for instance
https://github.com/dverite/postgresql-functions/tree/master/hamming_weight
The code itself is very simple.
The result seems 10 times faster than the bit_count function based on counting the 0 characters in the bit(64) string as text.
Example:
plpgsql function:
test=> select sum(bit_count(x)) from generate_series(1,1000000) x;
sum
---------
9884999
(1 row)
Time: 2442,340 ms
C function:
test=> select sum(hamming_weight(x::int8)) from generate_series(1,1000000) x;
sum
---------
9884999
(1 row)
Time: 239,749 ms

If you are trying to compute the hamming distance of perceptual hashes or similar LSH bit strings, then this question may be a closely related to this answer
If you are looking specifically for a pre-built way to do hamming distance queries on a PostgreSQL database, then this may be the cure: an extension for hamming distance search

Related

Alphanumeric sorting without any pattern on the strings [duplicate]

I've got a Postgres ORDER BY issue with the following table:
em_code name
EM001 AAA
EM999 BBB
EM1000 CCC
To insert a new record to the table,
I select the last record with SELECT * FROM employees ORDER BY em_code DESC
Strip alphabets from em_code usiging reg exp and store in ec_alpha
Cast the remating part to integer ec_num
Increment by one ec_num++
Pad with sufficient zeors and prefix ec_alpha again
When em_code reaches EM1000, the above algorithm fails.
First step will return EM999 instead EM1000 and it will again generate EM1000 as new em_code, breaking the unique key constraint.
Any idea how to select EM1000?
Since Postgres 9.6, it is possible to specify a collation which will sort columns with numbers naturally.
https://www.postgresql.org/docs/10/collation.html
-- First create a collation with numeric sorting
CREATE COLLATION numeric (provider = icu, locale = 'en#colNumeric=yes');
-- Alter table to use the collation
ALTER TABLE "employees" ALTER COLUMN "em_code" type TEXT COLLATE numeric;
Now just query as you would otherwise.
SELECT * FROM employees ORDER BY em_code
On my data, I get results in this order (note that it also sorts foreign numerals):
Value
0
0001
001
1
06
6
13
۱۳
14
One approach you can take is to create a naturalsort function for this. Here's an example, written by Postgres legend RhodiumToad.
create or replace function naturalsort(text)
returns bytea language sql immutable strict as $f$
select string_agg(convert_to(coalesce(r[2], length(length(r[1])::text) || length(r[1])::text || r[1]), 'SQL_ASCII'),'\x00')
from regexp_matches($1, '0*([0-9]+)|([^0-9]+)', 'g') r;
$f$;
Source: http://www.rhodiumtoad.org.uk/junk/naturalsort.sql
To use it simply call the function in your order by:
SELECT * FROM employees ORDER BY naturalsort(em_code) DESC
The reason is that the string sorts alphabetically (instead of numerically like you would want it) and 1 sorts before 9.
You could solve it like this:
SELECT * FROM employees
ORDER BY substring(em_code, 3)::int DESC;
It would be more efficient to drop the redundant 'EM' from your em_code - if you can - and save an integer number to begin with.
Answer to question in comment
To strip any and all non-digits from a string:
SELECT regexp_replace(em_code, E'\\D','','g')
FROM employees;
\D is the regular expression class-shorthand for "non-digits".
'g' as 4th parameter is the "globally" switch to apply the replacement to every occurrence in the string, not just the first.
After replacing every non-digit with the empty string, only digits remain.
This always comes up in questions and in my own development and I finally tired of tricky ways of doing this. I finally broke down and implemented it as a PostgreSQL extension:
https://github.com/Bjond/pg_natural_sort_order
It's free to use, MIT license.
Basically it just normalizes the numerics (zero pre-pending numerics) within strings such that you can create an index column for full-speed sorting au naturel. The readme explains.
The advantage is you can have a trigger do the work and not your application code. It will be calculated at machine-speed on the PostgreSQL server and migrations adding columns become simple and fast.
you can use just this line
"ORDER BY length(substring(em_code FROM '[0-9]+')), em_code"
I wrote about this in detail in this related question:
Humanized or natural number sorting of mixed word-and-number strings
(I'm posting this answer as a useful cross-reference only, so it's community wiki).
I came up with something slightly different.
The basic idea is to create an array of tuples (integer, string) and then order by these. The magic number 2147483647 is int32_max, used so that strings are sorted after numbers.
ORDER BY ARRAY(
SELECT ROW(
CAST(COALESCE(NULLIF(match[1], ''), '2147483647') AS INTEGER),
match[2]
)
FROM REGEXP_MATCHES(col_to_sort_by, '(\d*)|(\D*)', 'g')
AS match
)
I thought about another way of doing this that uses less db storage than padding and saves time than calculating on the fly.
https://stackoverflow.com/a/47522040/935122
I've also put it on GitHub
https://github.com/ccsalway/dbNaturalSort
The following solution is a combination of various ideas presented in another question, as well as some ideas from the classic solution:
create function natsort(s text) returns text immutable language sql as $$
select string_agg(r[1] || E'\x01' || lpad(r[2], 20, '0'), '')
from regexp_matches(s, '(\D*)(\d*)', 'g') r;
$$;
The design goals of this function were simplicity and pure string operations (no custom types and no arrays), so it can easily be used as a drop-in solution, and is trivial to be indexed over.
Note: If you expect numbers with more than 20 digits, you'll have to replace the hard-coded maximum length 20 in the function with a suitable larger length. Note that this will directly affect the length of the resulting strings, so don't make that value larger than needed.

Get MAX and MIN ip from subnet/mask in postgres

I have the following issue:
Right now I have a table with the subnet/mask information (example 192.168.1.0 / 255.255.255.0 ) .. but I need to obtain the MAX and MIN IP from this subnet:
192.168.1.0 / 192.168.1.255
I've found this answer:
how to query for min or max inet/cidr with postgres
But it seems that:
network_smaller(inet, inet) and network_larger(inet, inet)
Doesn't exists. Even googling that I can't find any answer for those functions.
Thanks!
Edit:
Version info:
PostgreSQL 9.2.15 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4), 64-bit
I don't think that question is relavent to your needs anyway. The min and max defined there are similar to the SQL min() and max() functions for finding the smallest / largest in a table, not the smallest / largest in a subnet.
I'm not generally a fan of relying on undocumented features. They may be safe but may isn't a word I generally like.
There's a page of documented network functions here:
https://www.postgresql.org/docs/current/static/functions-net.html
The two you would need would be:
Min would be network(inet)
Max would be broadcast(inet)
That's because the network name is always the "first" ip in the range and the broadcast address is always the "last" ip in the range.
Don't google, just try:
select network_smaller('192.168.0.9'::inet, '192.168.0.11'::inet);
network_smaller
-----------------
192.168.0.9
(1 row)
Postgres has more than 2,600 internal functions. Most of them are useful for creating operator classes of various types. Not all of them are described in the documentation, but they are all generally available.
You can find them using pgAdmin III in pg_catalog. You only need to set the option: File -> Options -> UI Miscellaneous -> Show System Objects in treeview.
The aggregate functions min(inet) and max(inet) has been introduced in Postgres 9.5:
with test(ip) as (
values
('192.168.0.123'::inet),
('192.168.0.12'),
('192.168.0.1'),
('192.168.0.125')
)
select max(ip), min(ip)
from test;
max | min
---------------+-------------
192.168.0.125 | 192.168.0.1
(1 row)
See how the aggregate min(inet) is defined (it can be found in pg_catalog):
CREATE AGGREGATE min(inet) (
SFUNC=network_smaller,
STYPE=inet,
SORTOP="<"
);
The question How to query for min or max inet/cidr with postgres concerned Postgres 9.4. In my answer I suggested to use the functions network_smaller(inet, inet) and network_larger(inet, inet). I'm sure they were added for creating aggregate functions min(inet) and max(inet) but for some reasons (maybe oversight) the aggregates appeared only in Postgres 9.5.
In Postgres 9.2 you can create your own functions as substitutes, e.g.
create or replace function inet_larger(inet, inet)
returns inet language sql as $$
select case when network_gt($1, $2) then $1 else $2 end
$$;
create or replace function inet_smaller(inet, inet)
returns inet language sql as $$
select case when network_lt($1, $2) then $1 else $2 end
$$;

Is there a CRC32 or other hash function in DB2 for zOS?

I'm looking for a DB2 function to calculate hashes on large CLOB values in order to quickly track changes. Other engines have functions such as CHECKSUM,CRC32 or MD5. The function in LUW is GET_HASH_VALUE but is not available in zOS.
Constraints: No access to UDFs or Stored Procedures.
Here is a quick and dirty code fragment that computes a CRC32, it only works to about 100 characters.
WITH crc(t,c,j) AS (
SELECT 'Hello World!',4294967295,0 FROM SYSIBM.SYSDUMMY1
UNION ALL
SELECT SUBSTR(t,2),bitxor(c,ASCII(t)),8 FROM crc WHERE t>'' AND j=0
UNION ALL
SELECT t,BITXOR(c/2,BITAND(3988292384,-BITAND(c,1))),j-1 FROM crc WHERE j>0
)
SELECT RIGHT(HEX(BITNOT(c)),8) FROM CRC WHERE t='' AND j=0
Result checked against http://www.lammertbies.nl/comm/info/crc-calculation.html :
1
--------
1C291CA3
Source: http://www.hackersdelight.org/hdcodetxt/crc.c.txt
The answer depends on which version of DB2 you have. If you are on DB2 9.7 or higher, have a look here: https://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.sql.rtn.doc/doc/r0055167.html

PortgreSQL - Function to calculate percentages

In postgresql if I want percentages I just write:
select x / sum(x) over() ...
Inside a function it doesn't work since aggregate functions don't behave well.
I tried to find a solution but with no success.
This is a simple version of what I really need, but I believe the solution to this problem would surely point me in the right direction.
Some more details...
If I create this simple table:
create table ttt(v1 numeric, v2 numeric);
insert into ttt values (2,1),(5,2),(10,4);
If I run:
select v1/sum(v1) over() from ttt; --returns relative frequencies
I get:
select v1/sum(v1) over() from ttt;
?column?
------------------------
0.11764705882352941176
0.29411764705882352941
0.58823529411764705882
(3 rows)
Now, if I want to create a function which does the same thing, I would write:
create or replace function rfreq (double precision)
returns double precision
AS
'
select
$1 / sum($1) over()
'
LANGUAGE 'sql';
I get:
select rfreq(v1) from bruto;
rfreq
-------
1
1
1
(3 rows)
Postgresql is not summing up inside a function.
Any suggestions?
Thank you,
Ali.
To debug your function, write the query with arbitrary parameters in a text file, and then use psql to run it:
\i ./myfunc.sql
Content of myfunc.sql would be:
select x / sum(y) over (...) ...
This will allow you to debug the function before wrapping it in a function.
When you're done and happy with the results for a few samples, copy/paste it into your function, and replace the hard-coded test values with parameters where applicable.
As to optimizing it when it has parameters, I'm not aware of any means to run explain analyze within the Postgres function, but you can get a plan which -- best I'm aware -- is the same as the function will use by preparing a statement with the same parameters. So you can explain analyze the latter instead.
Seeing the new details, note that if you prepare the query that you're running in function, you should always get 1 -- bar with zero.
You've an error in there, in the sense that you'd need to keep state from a call to the next first to return the expected result. Per Pavel's suggestion, you actually need a custom aggregate or a custom window function here. See the link he suggested in a comment, as well as:
http://www.postgresql.org/docs/current/static/xaggr.html
I found the solution browsing through the pl/r mailing list.
Percentages (or relative frequencies) can be calculated in postgres using the following code:
CREATE OR REPLACE
FUNCTION rel_freq(float8)
RETURNS float8 AS
$BODY$
var <- as.vector(farg1)
return((var/sum(var))[prownum]
$BODY$
LANGUAGE plr WINDOW;

Conversion of int values to numerals in Postgresql?

I would like to know if there is a built-in way to convert integer values into numerals in PostgreSQL?
As an example, is it possible to convert the integer 10 into the string TEN.
Thank You.
There's nothing built-in. For this sort of thing your best bet will be to make use of PostgreSQL's pluggable procedural languages. Use PL/Perl or PL/Python with a suitable Perl or Python library to do the job.
In this case I'd probably use PL/Perl with Lingua::EN::Numbers.
CREATE OR REPLACE FUNCTION num2en(numeric) RETURNS text AS $$
use Lingua::EN::Numbers qw(num2en);
return num2en($_[0]);
$$ LANGUAGE plperlu;
You'll need to install Lingua::EN::Numbers into the Perl being used by PostgreSQL using CPAN or system packages first. In my case (Fedora 19) this was a simple yum install perl-Lingua-EN-Numbers.noarch, then I could:
regress=> SELECT num2en(10);
num2en
--------
ten
(1 row)
regress=# SELECT num2en(NUMERIC '142.5');
num2en
--------------------------------------
one hundred and forty-two point five
(1 row)
By default the function is accessible by normal users so you don't have to issue any extra GRANTs.
Try this query:
SELECT split_part (cash_words (10::VARCHAR::MONEY), 'dollar', 1);
It's a internal function of PostgreSQL.