Similar UTF-8 strings for autocomplete field - postgresql

Background
Users can type in a name and the system should match the text, even if the either the user input or the database field contains accented (UTF-8) characters. This is using the pg_trgm module.
Problem
The code resembles the following:
SELECT
t.label
FROM
the_table t
WHERE
label % 'fil'
ORDER BY
similarity( t.label, 'fil' ) DESC
When the user types fil, the query matches filbert but not filé powder. (Because of the accented character?)
Failed Solution #1
I tried to implement an unaccent function and rewrite the query as:
SELECT
t.label
FROM
the_table t
WHERE
unaccent( label ) % unaccent( 'fil' )
ORDER BY
similarity( unaccent( t.label ), unaccent( 'fil' ) ) DESC
This returns only filbert.
Failed Solution #2
As suggested:
CREATE EXTENSION pg_trgm;
CREATE EXTENSION unaccent;
CREATE OR REPLACE FUNCTION unaccent_text(text)
RETURNS text AS
$BODY$
SELECT unaccent($1);
$BODY$
LANGUAGE sql IMMUTABLE
COST 1;
All other indexes on the table have been dropped. Then:
CREATE INDEX label_unaccent_idx
ON the_table( lower( unaccent_text( label ) ) );
This returns only one result:
SELECT
t.label
FROM
the_table t
WHERE
label % 'fil'
ORDER BY
similarity( t.label, 'fil' ) DESC
Question
What is the best way to rewrite the query to ensure that both results are returned?
Thank you!
Related
http://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.0#Unaccent_filtering_dictionary
http://postgresql.1045698.n5.nabble.com/index-refuses-to-build-td5108810.html

You are not using the operator class provided by the pg_trgm module. Create an index like this:
CREATE INDEX label_Lower_unaccent_trgm_idx
ON test_trgm USING gist (lower(unaccent_text(label)) gist_trgm_ops);
Originally, I had a GIN index here, but a GiST is typically better suited for this kind of query because it can return values sorted by similarity. See:
Matching patterns between multiple columns
Finding similar strings with PostgreSQL quickly
Your query has to match the index expression to be able to use it.
SELECT label
FROM the_table
WHERE lower(unaccent_text(label)) % 'fil'
ORDER BY similarity(label, 'fil') DESC; -- ok to use original string here
However, "filbert" and "filé powder" are not actually very similar to "fil" according to the % operator. I suspect you really want:
SELECT label
FROM the_table
WHERE lower(unaccent_text(label)) LIKE 'fil%' -- !
ORDER BY similarity(label, 'fil') DESC; -- ok to use original string here
This finds all strings starting with the search string, and sorts the best matches according to the % operator first.
The expression can use a GIN or GiST index since PostgreSQL 9.1! The manual:
Beginning in PostgreSQL 9.1, these index types also support index
searches for LIKE and ILIKE, for example
If you actually meant to use the % operator:
Try adapting the threshold for the similarity operator %:
SET pg_trgm.similarity_threshold = 0.1; -- Postgres 9.6 or later
SELECT set_limit(0.1); -- Postgres 9.5 or older
Or even lower? The default is 0.3. Just to see whether the threshold filters additional matches.

A solution for PostgreSQL 9.1:
-- Install the requisite extensions.
CREATE EXTENSION pg_trgm;
CREATE EXTENSION unaccent;
-- Function fixes STABLE vs. IMMUTABLE problem of the unaccent function.
CREATE OR REPLACE FUNCTION unaccent_text(text)
RETURNS text AS
$BODY$
-- unaccent is STABLE, but indexes must use IMMUTABLE functions.
SELECT unaccent($1);
$BODY$
LANGUAGE sql IMMUTABLE
COST 1;
-- Create an unaccented index.
CREATE INDEX the_table_label_unaccent_idx
ON the_table USING gin (lower(unaccent_text(label)) gin_trgm_ops);
-- Define the matching threshold.
SELECT set_limit(0.175);
-- Test the query (matching against the index expression).
SELECT
label
FROM
the_table
WHERE
lower(unaccent_text(label)) % 'fil'
ORDER BY
similarity(label, 'fil') DESC
Returns "filbert", "fish fillet", and "filé powder".
Without calling SELECT set_limit(0.175);, you can use the double tilde (~~) operator:
-- Test the query (matching against the index expression).
SELECT
label
FROM
the_table
WHERE
lower(unaccent_text(label)) ~~ 'fil'
ORDER BY
similarity(label, 'fil') DESC
Also returns "filbert", "fish fillet", and "filé powder".

Related

ILIKE query with indexing for jsonb array data in postgres

I have table in which has city as jsonb column which has json array like below
[{"name":"manchester",..},{"name":"liverpool",....}]
now I want to query table on "name" column with ILIKE query.
I have tried with below but it is not working for me
select * from data where city->>'name' ILIKE '%man%'
while i know, I can search with exact match by below query
select * from data where city->>'name' #> 'manchester'
Also I know we can jsonb functions to make it flat data and search but it will not use than indexing.
is there anyway to search data with ilike in a way it also use indexing?
Index support will be difficult; for that, a schema that adheres to the first normal form would be beneficial.
Other than that, you can use the JSONPATH language from v12 on:
WITH t(c) AS (
SELECT '[{"name":"manchester"},{"name":"liverpool"}]'::jsonb
)
SELECT jsonb_path_exists(
c,
'$.**.name ? (# like_regex "man" flag "i")'::jsonpath
)
FROM t;
jsonb_path_exists
═══════════════════
t
(1 row)
You should really store your data differently.
You can do the ilike query "naturally" but without index support, like this:
select * from data where exists (select 1 from jsonb_array_elements(city) f(x) where x->>'name' ILIKE '%man%');
You can get index support like this:
create index on data using gin ((city::text) gin_trgm_ops);
select * from data where city::text ilike '%man%';
But it will find matches within the text of the keys, as well as the values, and using irrelevant keys/values of any are present. You could get around this by creating a function that returns just the values, all banged together into one string, and then use a functional index. But the index will get less effective as the length of the string gets longer, as there will be more false positives that need to be tracked down and weeded out.
create or replace function concat_val(jsonb, text) returns text immutable language sql as $$
select string_agg(x->>$2,' ') from jsonb_array_elements($1) f(x)
$$ parallel safe;
create index on data using gin (concat_val(city,'name') gin_trgm_ops);
select * from data where concat_val(city,'name') ilike '%man%';
You should really store your data differently.

PostgreSQL index for comparison of JSONB values

We're experimenting with JSONB on PostgreSQL 12/13 to see whether it's better alternative for customizable extension attributes than a bunch of extension tables (EAV, I guess) and so far I'm impressed by the results, although using GIN indexes is more tricky than it seems at first.
Experimental table is simple enough:
create TABLE jtest (
id SERIAL PRIMARY KEY,
text text,
ext jsonb
);
CREATE INDEX jtest_ext_gin_idx ON jtest USING gin (ext);
I'm inserting some various data with (a bigger version of) this monstrous block (quoted only for db-fiddle):
DO 'BEGIN
FOR r IN 1..100000 LOOP
IF r % 10 <= 3 THEN
-- some entries have no extension
INSERT INTO jtest (text, ext) VALUES (''json-'' || LPAD(r::text, 10, ''0''), NULL);
ELSEIF r % 10 = 7 THEN
-- let''s add some numbers and wannabe "dates"
INSERT INTO jtest (text, ext)
VALUES (''json-'' || LPAD(r::text, 10, ''0''), (''{'' ||
''"hired": "'' || current_date - width_bucket(random(), 0, 1, 1000) || ''",'' ||
''"rating": '' || width_bucket(random(), 0, 1, 10) || ''}'')::jsonb);
ELSE
INSERT INTO jtest (text, ext)
VALUES (''json-'' || LPAD(r::text, 10, ''0''), (''{"email": "user'' || r || ''#mycompany.com", "other-key-'' || r || ''": "other-value-'' || r || ''"}'')::jsonb);
END IF;
END LOOP;
END';
Various exact value match operations are easy an GIN works very well for these. But we also need < and LIKE, but let's just focus on comparison for now.
The example query is:
select * from jtest
where ext->>'hired' >= '2020-06-01' -- not using function index on its own
But if I add semantically useless AND the index kicks in:
select * from jtest
where ext->>'hired' >= '2020-06-01'
and ext?'hired';
Here is a fiddle example.
Question #1: I have no problem to implement a query interpreter in our application to make it work, but is it expected behavior? Can't PG figure out that when >= is used the left side is indeed not null?
I also experimented with functional index on (ext->>'hired') - fiddle here:
CREATE INDEX jtest_ext_hired1_idx ON jtest ((ext->>'hired'));
CREATE INDEX jtest_ext_hired2_idx ON jtest ((ext->>'hired')) WHERE ext ? 'hired';
The second index is MUCH smaller than the first and I'm not sure what the first one is good.
Question #2: When I execute the query with ext->>'hired' >= '2020-06-01' it uses the first one in the fiddle - but not in my tests with 15M of rows (only 18k of them returned). So that's the first confusion - my internal tests I don't want to recreate on fiddle (it would execute for far too long) should be more specific - yet use sequential scan for whatever reason. Why does it use sequential scan on much bigger table?
Answer #2: After running ANALYZE it did and it became fast. As this is not the most important question I answer it directly here.
Finally, not a question, with additional AND ext ? 'hired' it uses jtest_ext_hired2_idx index just fine (both in the fiddle and in my much bigger table).
Question #3: Rather generic, is this even the right approach? If I expect using comparison and LIKE operations on values from JSONB, can I just cover it with additional functional indexes? It's still seems more flexible for our case than adding custom columns or joining extension tables, but can't it bite us in the future?
As documented in the manual GIN index only supports the operators: ?, ?&, ?|, #>, #?, ##. So by adding the (seemingly useless) ext?'hired' condition you enable the optimizer to use the GIN index (not the functional index).
To index the hire date, I would create a function that extracts the value as a proper date. You can't do that with a cast in the index expression as the cast is not immutable. But as we know that a cast from a yyyy-mm-dd is indeed immutable, there is nothing wrong with creating a function that is marked immutable.
create function hire_date(p_input jsonb)
returns date
as
$$
select (p_input ->> 'hired')::date;
$$
language sql
strict
immutable;
Then you can use:
CREATE INDEX jtest_ext_hired1_idx ON jtest ( (hire_date(ext)) );
And that index is used directly when the function is used in the where clause:
select *
from jtest
where hire_date(ext) >= '2020-06-01';
Of course that will fail if key 'hire_date' doesn't actually contain a proper DATE value (but it will fail during insert already as the index can't be updated).
Indexing LIKE expressions is in general tricky, but if you only have left anchored search strings (like 'foo%') a regular b-tree index can be used:
create index jtest_email on jtest ( (ext ->> 'email') varchar_pattern_ops);
To index a LIKE expression with a right anchored search string ( like '%foo%') you would need a trigram index.

Postgresql: query on jsonb column - index doesn't make it quicker

There is a table in Postgresql 9.6, query on jsonb column is slow compared to a relational table, and adding a GIN index on it doesn't make it quicker.
Table:
-- create table
create table dummy_jsonb (
id serial8,
data jsonb,
primary key (id)
);
-- create index
CREATE INDEX dummy_jsonb_data_index ON dummy_jsonb USING gin (data);
-- CREATE INDEX dummy_jsonb_data_index ON dummy_jsonb USING gin (data jsonb_path_ops);
Generate data:
-- generate data,
CREATE OR REPLACE FUNCTION dummy_jsonb_gen_data(n integer) RETURNS integer AS $$
DECLARE
i integer:=1;
name varchar;
create_at varchar;
json_str varchar;
BEGIN
WHILE i<=n LOOP
name:='dummy_' || i::text;
create_at:=EXTRACT(EPOCH FROM date_trunc('milliseconds', now())) * 1000;
json_str:='{
"name": "' || name || '",
"size": ' || i || ',
"create_at": ' || create_at || '
}';
insert into dummy_jsonb(data) values
(json_str::jsonb
);
i:= i + 1;
END LOOP;
return n;
END;
$$ LANGUAGE plpgsql;
-- call function,
select dummy_jsonb_gen_data(1000000);
-- drop function,
DROP FUNCTION IF EXISTS dummy_jsonb_gen_data(integer);
Query:
select * from dummy_jsonb
where data->>'name' like 'dummy_%' and data->>'size' >= '500000'
order by data->>'size' desc
offset 50000 limit 10;
Test result:
The query takes 1.8 seconds on a slow vm.
Adding or removing the index, don't make a difference.
Changing to index gin with jsonb_path_ops, also don't make a difference.
Questions:
Is it possible to make the query quicker, either improve index or sql?
If not, the does it means, within pg a relational table is more proper in this case?
And, in my test, mongodb performs better, does that means mongodb is more proper for such storage & query?
Quote from the manual
The default GIN operator class for jsonb supports queries with top-level key-exists operators ?, ?& and ?| operators and path/value-exists operator #> [...] The non-default GIN operator class jsonb_path_ops supports indexing the #> operator only.
Your query uses LIKE and string comparison with > (which is probably not correct to begin with), neither of those are supported by a GIN index.
But even an index on (data ->> 'name') wouldn't be used for the condition data->>'name' like 'dummy_%' as that is true for all rows because every name starts with dummy.
You can create a regular btree index on the name:
CREATE INDEX ON dummy_jsonb ( (data ->> 'name') varchar_pattern_ops);
Which will be used if the condition is restrictive enough, e.g.:
where data->>'name' like 'dummy_9549%'
If you need to query for the size, you can create an index on ((data ->> 'size')::int) and then use something like this:
where (data->>'size')::int >= 500000
However your use of limit and offset will always force the database to read all rows, sort them and the limit the result. This is never going to be very fast. You might want to read this article for more information why limit/offset is not very efficient.
JSON is a nice addition to the relational world, but only if you use it appropriately. If you don't need dynamic attributes for a row, then use standard columns and data types. Even though JSON support is Postgres is extremely good, this doesn't mean one should use it for everything, just because it's the current hype. Postgres is still a relational database and should be used as such.
Unrelated, but: your function to generate the test data can be simplified to a single SQL statement. You might not have been aware of the generate_series() function for things like that:
insert into dummy_jsonb(data)
select jsonb_build_object('name', 'dummy_'||i,
'size', i::text,
'created_at', (EXTRACT(EPOCH FROM date_trunc('milliseconds', clock_timestamp())) * 1000)::text)
from generate_series(1,1000000) as t(i);
While a btree index (the standard PostgreSQL index based on binary trees) is able to optimize ordering-based queries like >= '500000', the gin index, using an inverted index structure, is meant to quickly find data containing specific elements (it is quite used e.g. for text search to find rows containing given words), so (AFAIK) it can't be used for the query you provide.
PostgreSQL docs on jsonb indexing indicates on which WHERE conditions the index may be applied. As pointed out there, you can create a btree index on specific elements in a jsonb column: indexes on the specific elements referenced in the WHERE clause should work for the query you indicate.
Also, as commented above, think whether you actually need JSON for your use case.

Postgres: Can I create an index to use in the SELECT clause?

I have defined a function that determines the timezone from table tz_world for a set of lon, lat values:
create function get_timezone(numeric, numeric)
returns character varying(30) as $$
select tzid from tz_world where ST_Contains(geom, ST_MakePoint($1, $2));
$$ language SQL immutable;
Now I would like to use this function in the SELECT clause of a query on a different table:
select get_timezone(lon, lat) from event where...;
The function is rather slow, so I tried using an index to speed things up:
create index event_timezone_idx on event (get_timezone(event.lon, event.lat));
While this speeds up queries where the function is used in the WHERE clause, it has no effect on the variant above where get_timezone(lon, lat) is used in the SELECT clause.
Is it possible to rephrase the query and/or index to speed up the timezone determination?
Update
Thank you for the answers!! I decided to include an extra column for the timezone in the end and populate it when creating/updating the events.
I would recommend you create a local temporary table of the part of the select where you want to create the index on and then create an index on the temporary one:
CREATE LOCAL TEMPORARY TABLE temp_table AS (
select
.
.
.
);
CREATE INDEX temp_table idx
ON temp_table
USING btree
(col1,col2,....);
Otherwise write what you want your WHERE condition to be, indexes only work on WHERE clauses and values for the index should be exactly the ones you are trying to filter on.

Exclusion constraint on a bitstring column with bitwise AND operator

I read about Exclusion Constraints in PostgreSQL but can't seem to find a way to use bitwise operators on bitstrings.
I have two columns (name text, value bit(8)). And I want to create a constraint that basically says this:
ADD CONSTRAINT route_method_overlap
EXCLUDE USING gist(name WITH =, value WITH &)
But this doesn't work since:
operator &(bit,bit) is not a member of operator family "gist_bit_ops"
I assume this is because the bit_ops operator & doesn't return boolean. But is there a way to do what I'm trying to do? Is there a way to coerce operator & to cast its return value as a boolean?
Currently using Postgres 9.1.4 with the "btree_gist" extension installed, all from the Ubuntu 12.04 repos. But the version doesn't matter. If there's fixes/updates upstream, I can install from the repos. I'm still in the design phase.
You installed the extension btree_gist. Without it, the example would already fail at name WITH =.
CREATE EXTENSION btree_gist;
The operator classes installed by btree_gist cover many operators. Unfortunately, the & operator is not among them. Obviously, because it does not return a boolean which would be expected of an operator to qualify.
Alternative solution
I would use a combination of a b-tree multi-column index (for speed) and a trigger instead. Consider this demo, tested on PostgreSQL 9.1:
CREATE TABLE t (
name text
, value bit(8)
);
INSERT INTO t VALUES ('a', B'10101010');
CREATE INDEX t_name_value_idx ON t (name, value);
CREATE OR REPLACE FUNCTION trg_t_name_value_inversion_prohibited()
RETURNS trigger
LANGUAGE plpgsql AS
$func$
BEGIN
IF EXISTS (
SELECT FROM t
WHERE (name, value) = (NEW.name, ~ NEW.value) -- example: exclude inversion
) THEN
RAISE EXCEPTION 'Your text here!';
END IF;
RETURN NEW;
END
$func$;
CREATE TRIGGER insup_bef_t_name_value_inversion_prohibited
BEFORE INSERT OR UPDATE OF name, value -- only involved columns relevant!
ON t
FOR EACH ROW
EXECUTE FUNCTION trg_t_name_value_inversion_prohibited();
INSERT INTO t VALUES ('a', ~ B'10101010'); -- fails with your error msg.
In Postgres 10 or older use instead:
...
EXECUTE PROCEDURE trg_t_name_value_inversion_prohibited();
See:
Trigger function does not exist, but I am pretty sure it does
~ is the inversion operator.
The extension btree_gist is not required in this scenario.
I restricted the trigger to INSERT OR UPDATE OF relevant columns for efficiency.
A check constraint wouldn't work. I quote the manual on CREATE TABLE:
Currently, CHECK expressions cannot contain subqueries nor refer to
variables other than columns of the current row.
Bold emphasis mine.
Should perform very well, actually better than the exclusion constraint, because maintenance of a b-tree index is cheaper than a GiST index. And the look-up with basic = operators should be faster than hypothetical look-ups with the & operator.
This solution is not as bullet-proof as an exclusion constraint, because triggers can more easily be circumvented - in a subsequent trigger on the same event for instance, or if the trigger is disabled temporarily. Be prepared to run extra checks on the whole table if such conditions apply.
More complex condition
The example trigger only catches the inversion of value. As you clarified in your comment, you actually need a condition like this instead:
IF EXISTS (
SELECT FROM t
WHERE name = NEW.name
AND value & NEW.value <> B'00000000'::bit(8)
) THEN
This condition is slightly more expensive, but can still use an index. The multi-column index from above would work - if you have use for it anyway. Or, more efficiently, a simple index on name:
CREATE INDEX t_name_idx ON t (name);
You commented that there can only be a maximum of 8 distinct rows per name, fewer in practice. So this should still be fast.
Ultimate INSERT performance
If INSERT performance is paramount, especially if many attempted INSERTs fail the condition, you could do more: create a materialized view that pre-aggregated value per name:
CREATE TABLE mv_t AS
SELECT name, bit_or(value) AS value
FROM t
GROUP BY 1
ORDER BY 1;
name is guaranteed to be unique here. I'd use a PRIMARY KEY on name to provide the index we're after:
ALTER TABLE mv_t SET (FILLFACTOR=90);
ALTER TABLE mv_t
ADD CONSTRAINT mv_t_pkey PRIMARY KEY(name);
Then your INSERT could look like this:
WITH i(n,v) AS (SELECT 'a'::text, B'10101010'::bit(8))
INSERT INTO t (name, value)
SELECT n, v
FROM i
LEFT JOIN mv_t m ON m.name = i.n
AND m.value & i.v <> B'00000000'::bit(8)
WHERE m.n IS NULL; -- alternative syntax for EXISTS (...)
The fillfactor is only useful if your table gets a lot of updates.
Update rows in the materialized view in a TRIGGER AFTER INSERT OR UPDATE OF name, value OR DELETE to keep it current. Cost and gain of additional objects have to be weighed carefully. Largely depends on your typical load.