Tsquery return exact matched keyword - postgresql

I have a query like
select * from mytable where posttext ## to_tsquery('Intelence');
I just want to return results with exact match of the keyword 'Intelence' rather than 'intel', how can I do this in postgresql?
Thanks.

This is not possible with full-text search unless you want to tell PostgreSQL not to stem Intelence at all by changing the text search dictionary. Pg doesn't include the word in the index, only the stems:
regress=> SELECT to_tsvector('english', 'Intelence');
to_tsvector
-------------
'intel':1
(1 row)
You can suppress stemming entirely with the simple dictionary:
regress=> SELECT to_tsvector('simple','Intelence');
to_tsvector
---------------
'intelence':1
(1 row)
but that must be done on the index, you can't do it per-query let alone per search term. So the text cats are bothering me would not match a search for cat in the simple dictionary because of the plural, or bother because the unstemmed words are not the same.
If you want to make individual exceptions you can edit the english dictionary used by tsearch2 and define a custom dictionary with the desired changes, then use that dictionary instead of english in queries where you want the exceptions. Again, you must use the same dictionary for the index creation and the queries, though.
This might land up with you needing multiple fulltext indexes, which is very undesirable from the point of view of slowing down updates/inserts/deletes and from a memory use efficiency perspective.

Related

Alphanumeric sorting without any pattern on the strings [duplicate]

I've got a Postgres ORDER BY issue with the following table:
em_code name
EM001 AAA
EM999 BBB
EM1000 CCC
To insert a new record to the table,
I select the last record with SELECT * FROM employees ORDER BY em_code DESC
Strip alphabets from em_code usiging reg exp and store in ec_alpha
Cast the remating part to integer ec_num
Increment by one ec_num++
Pad with sufficient zeors and prefix ec_alpha again
When em_code reaches EM1000, the above algorithm fails.
First step will return EM999 instead EM1000 and it will again generate EM1000 as new em_code, breaking the unique key constraint.
Any idea how to select EM1000?
Since Postgres 9.6, it is possible to specify a collation which will sort columns with numbers naturally.
https://www.postgresql.org/docs/10/collation.html
-- First create a collation with numeric sorting
CREATE COLLATION numeric (provider = icu, locale = 'en#colNumeric=yes');
-- Alter table to use the collation
ALTER TABLE "employees" ALTER COLUMN "em_code" type TEXT COLLATE numeric;
Now just query as you would otherwise.
SELECT * FROM employees ORDER BY em_code
On my data, I get results in this order (note that it also sorts foreign numerals):
Value
0
0001
001
1
06
6
13
۱۳
14
One approach you can take is to create a naturalsort function for this. Here's an example, written by Postgres legend RhodiumToad.
create or replace function naturalsort(text)
returns bytea language sql immutable strict as $f$
select string_agg(convert_to(coalesce(r[2], length(length(r[1])::text) || length(r[1])::text || r[1]), 'SQL_ASCII'),'\x00')
from regexp_matches($1, '0*([0-9]+)|([^0-9]+)', 'g') r;
$f$;
Source: http://www.rhodiumtoad.org.uk/junk/naturalsort.sql
To use it simply call the function in your order by:
SELECT * FROM employees ORDER BY naturalsort(em_code) DESC
The reason is that the string sorts alphabetically (instead of numerically like you would want it) and 1 sorts before 9.
You could solve it like this:
SELECT * FROM employees
ORDER BY substring(em_code, 3)::int DESC;
It would be more efficient to drop the redundant 'EM' from your em_code - if you can - and save an integer number to begin with.
Answer to question in comment
To strip any and all non-digits from a string:
SELECT regexp_replace(em_code, E'\\D','','g')
FROM employees;
\D is the regular expression class-shorthand for "non-digits".
'g' as 4th parameter is the "globally" switch to apply the replacement to every occurrence in the string, not just the first.
After replacing every non-digit with the empty string, only digits remain.
This always comes up in questions and in my own development and I finally tired of tricky ways of doing this. I finally broke down and implemented it as a PostgreSQL extension:
https://github.com/Bjond/pg_natural_sort_order
It's free to use, MIT license.
Basically it just normalizes the numerics (zero pre-pending numerics) within strings such that you can create an index column for full-speed sorting au naturel. The readme explains.
The advantage is you can have a trigger do the work and not your application code. It will be calculated at machine-speed on the PostgreSQL server and migrations adding columns become simple and fast.
you can use just this line
"ORDER BY length(substring(em_code FROM '[0-9]+')), em_code"
I wrote about this in detail in this related question:
Humanized or natural number sorting of mixed word-and-number strings
(I'm posting this answer as a useful cross-reference only, so it's community wiki).
I came up with something slightly different.
The basic idea is to create an array of tuples (integer, string) and then order by these. The magic number 2147483647 is int32_max, used so that strings are sorted after numbers.
ORDER BY ARRAY(
SELECT ROW(
CAST(COALESCE(NULLIF(match[1], ''), '2147483647') AS INTEGER),
match[2]
)
FROM REGEXP_MATCHES(col_to_sort_by, '(\d*)|(\D*)', 'g')
AS match
)
I thought about another way of doing this that uses less db storage than padding and saves time than calculating on the fly.
https://stackoverflow.com/a/47522040/935122
I've also put it on GitHub
https://github.com/ccsalway/dbNaturalSort
The following solution is a combination of various ideas presented in another question, as well as some ideas from the classic solution:
create function natsort(s text) returns text immutable language sql as $$
select string_agg(r[1] || E'\x01' || lpad(r[2], 20, '0'), '')
from regexp_matches(s, '(\D*)(\d*)', 'g') r;
$$;
The design goals of this function were simplicity and pure string operations (no custom types and no arrays), so it can easily be used as a drop-in solution, and is trivial to be indexed over.
Note: If you expect numbers with more than 20 digits, you'll have to replace the hard-coded maximum length 20 in the function with a suitable larger length. Note that this will directly affect the length of the resulting strings, so don't make that value larger than needed.

Eliminate accents of a string in postgresql [duplicate]

In Microsoft SQL Server, it's possible to specify an "accent insensitive" collation (for a database, table or column), which means that it's possible for a query like
SELECT * FROM users WHERE name LIKE 'João'
to find a row with a Joao name.
I know that it's possible to strip accents from strings in PostgreSQL using the unaccent_string contrib function, but I'm wondering if PostgreSQL supports these "accent insensitive" collations so the SELECT above would work.
Update for Postgres 12 or later
Postgres 12 adds nondeterministic ICU collations, enabling case-insensitive and accent-insensitive grouping and ordering. The manual:
ICU locales can only be used if support for ICU was configured when PostgreSQL was built.
If so, this works for you:
CREATE COLLATION ignore_accent (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false);
CREATE INDEX users_name_ignore_accent_idx ON users(name COLLATE ignore_accent);
SELECT * FROM users WHERE name = 'João' COLLATE ignore_accent;
fiddle
Read the manual for details.
This blog post by Laurenz Albe may help to understand.
But ICU collations also have drawbacks. The manual:
[...] they also have some drawbacks. Foremost, their use leads to a
performance penalty. Note, in particular, that B-tree cannot use
deduplication with indexes that use a nondeterministic collation.
Also, certain operations are not possible with nondeterministic
collations, such as pattern matching operations. Therefore, they
should be used only in cases where they are specifically wanted.
My "legacy" solution may still be superior:
For all versions
Use the unaccent module for that - which is completely different from what you are linking to.
unaccent is a text search dictionary that removes accents (diacritic
signs) from lexemes.
Install once per database with:
CREATE EXTENSION unaccent;
If you get an error like:
ERROR: could not open extension control file
"/usr/share/postgresql/<version>/extension/unaccent.control": No such file or directory
Install the contrib package on your database server like instructed in this related answer:
Error when creating unaccent extension on PostgreSQL
Among other things, it provides the function unaccent() you can use with your example (where LIKE seems not needed).
SELECT *
FROM users
WHERE unaccent(name) = unaccent('João');
Index
To use an index for that kind of query, create an index on the expression. However, Postgres only accepts IMMUTABLE functions for indexes. If a function can return a different result for the same input, the index could silently break.
unaccent() only STABLE not IMMUTABLE
Unfortunately, unaccent() is only STABLE, not IMMUTABLE. According to this thread on pgsql-bugs, this is due to three reasons:
It depends on the behavior of a dictionary.
There is no hard-wired connection to this dictionary.
It therefore also depends on the current search_path, which can change easily.
Some tutorials on the web instruct to just alter the function volatility to IMMUTABLE. This brute-force method can break under certain conditions.
Others suggest a simple IMMUTABLE wrapper function (like I did myself in the past).
There is an ongoing debate whether to make the variant with two parameters IMMUTABLE which declares the used dictionary explicitly. Read here or here.
Another alternative would be this module with an IMMUTABLE unaccent() function by Musicbrainz, provided on Github. Haven't tested it myself. I think I have come up with a better idea:
Best for now
This approach is more efficient than other solutions floating around, and safer.
Create an IMMUTABLE SQL wrapper function executing the two-parameter form with hard-wired, schema-qualified function and dictionary.
Since nesting a non-immutable function would disable function inlining, base it on a copy of the C-function, (fake) declared IMMUTABLE as well. Its only purpose is to be used in the SQL function wrapper. Not meant to be used on its own.
The sophistication is needed as there is no way to hard-wire the dictionary in the declaration of the C function. (Would require to hack the C code itself.) The SQL wrapper function does that and allows both function inlining and expression indexes.
CREATE OR REPLACE FUNCTION public.immutable_unaccent(regdictionary, text)
RETURNS text
LANGUAGE c IMMUTABLE PARALLEL SAFE STRICT AS
'$libdir/unaccent', 'unaccent_dict';
Then:
CREATE OR REPLACE FUNCTION public.f_unaccent(text)
RETURNS text
LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
$func$
SELECT public.immutable_unaccent(regdictionary 'public.unaccent', $1)
$func$;
In Postgres 14 or later, an SQL-standard function is slightly cheaper, yet:
CREATE OR REPLACE FUNCTION public.f_unaccent(text)
RETURNS text
LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT
BEGIN ATOMIC
SELECT public.immutable_unaccent(regdictionary 'public.unaccent', $1);
END;
See:
What does BEGIN ATOMIC mean in a PostgreSQL SQL function / procedure?
Drop PARALLEL SAFE from both functions for Postgres 9.5 or older.
public being the schema where you installed the extension (public is the default).
The explicit type declaration (regdictionary) defends against hypothetical attacks with overloaded variants of the function by malicious users.
Previously, I advocated a wrapper function based on the STABLE function unaccent() shipped with the unaccent module. That disabled function inlining. This version executes ten times faster than the simple wrapper function I had here earlier.
And that was already twice as fast as the first version which added SET search_path = public, pg_temp to the function - until I discovered that the dictionary can be schema-qualified, too. Still (Postgres 12) not too obvious from documentation.
If you lack the necessary privileges to create C functions, you are back to the second best implementation: An IMMUTABLE function wrapper around the STABLE unaccent() function provided by the module:
CREATE OR REPLACE FUNCTION public.f_unaccent(text)
RETURNS text
LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
$func$
SELECT public.unaccent('public.unaccent', $1) -- schema-qualify function and dictionary
$func$;
Finally, the expression index to make queries fast:
CREATE INDEX users_unaccent_name_idx ON users(public.f_unaccent(name));
Remember to recreate indexes involving this function after any change to function or dictionary, like an in-place major release upgrade that would not recreate indexes. Recent major releases all had updates for the unaccent module.
Adapt queries to match the index (so the query planner will use it):
SELECT * FROM users
WHERE f_unaccent(name) = f_unaccent('João');
We don't need the function in the expression to the right of the operator. There we can also supply unaccented strings like 'Joao' directly.
The faster function does not translate to much faster queries using the expression index. Index look-ups operate on pre-computed values and are very fast either way. But index maintenance and queries not using the index benefit. And access methods like bitmap index scans may have to recheck values in the heap (the main relation), which involves executing the underlying function. See:
"Recheck Cond:" line in query plans with a bitmap index scan
Security for client programs has been tightened with Postgres 10.3 / 9.6.8 etc. You need to schema-qualify function and dictionary name as demonstrated when used in any indexes. See:
'text search dictionary “unaccent” does not exist' entries in postgres log, supposedly during automatic analyze
Ligatures
In Postgres 9.5 or older ligatures like 'Œ' or 'ß' have to be expanded manually (if you need that), since unaccent() always substitutes a single letter:
SELECT unaccent('Œ Æ œ æ ß');
unaccent
----------
E A e a S
You will love this update to unaccent in Postgres 9.6:
Extend contrib/unaccent's standard unaccent.rules file to handle all
diacritics known to Unicode, and expand ligatures correctly (Thomas
Munro, Léonard Benedetti)
Bold emphasis mine. Now we get:
SELECT unaccent('Œ Æ œ æ ß');
unaccent
----------
OE AE oe ae ss
Pattern matching
For LIKE or ILIKE with arbitrary patterns, combine this with the module pg_trgm in PostgreSQL 9.1 or later. Create a trigram GIN (typically preferable) or GIST expression index. Example for GIN:
CREATE INDEX users_unaccent_name_trgm_idx ON users
USING gin (f_unaccent(name) gin_trgm_ops);
Can be used for queries like:
SELECT * FROM users
WHERE f_unaccent(name) LIKE ('%' || f_unaccent('João') || '%');
GIN and GIST indexes are more expensive (to maintain) than plain B-tree:
Difference between GiST and GIN index
There are simpler solutions for just left-anchored patterns. More about pattern matching and performance:
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
pg_trgm also provides useful operators for "similarity" (%) and "distance" (<->).
Trigram indexes also support simple regular expressions with ~ et al. and case insensitive pattern matching with ILIKE:
PostgreSQL accent + case insensitive search
No, PostgreSQL does not support collations in that sense
PostgreSQL does not support collations like that (accent insensitive or not) because no comparison can return equal unless things are binary-equal. This is because internally it would introduce a lot of complexities for things like a hash index. For this reason collations in their strictest sense only affect ordering and not equality.
Workarounds
Full-Text-Search Dictionary that Unaccents lexemes.
For FTS, you can define your own dictionary using unaccent,
CREATE EXTENSION unaccent;
CREATE TEXT SEARCH CONFIGURATION mydict ( COPY = simple );
ALTER TEXT SEARCH CONFIGURATION mydict
ALTER MAPPING FOR hword, hword_part, word
WITH unaccent, simple;
Which you can then index with a functional index,
-- Just some sample data...
CREATE TABLE myTable ( myCol )
AS VALUES ('fóó bar baz'),('qux quz');
-- No index required, but feel free to create one
CREATE INDEX ON myTable
USING GIST (to_tsvector('mydict', myCol));
You can now query it very simply
SELECT *
FROM myTable
WHERE to_tsvector('mydict', myCol) ## 'foo & bar'
mycol
-------------
fóó bar baz
(1 row)
See also
Creating a case-insensitive and accent/diacritics insensitive search on a field
Unaccent by itself.
The unaccent module can also be used by itself without FTS-integration, for that check out Erwin's answer
I'm pretty sure PostgreSQL relies on the underlying operating system for collation. It does support creating new collations, and customizing collations. I'm not sure how much work that might be for you, though. (Could be quite a lot.)

postgres store with composite value type, or a better way of attributing an inverted index

can't seem to figure out the syntax for populating a hstore with a value of composite type -- note: I do not want to convert a record to a hstore.
select hstore('hello => ROW(1,2)');
I know it's something simple; However, google is not my friend today.
use case : custom inverted index.
The data is modelling an inverted index of lexemes, the composite data types are various probabilities related to the lexemes which I will use to implement document clustering. Does anyone know a better way of doing this ? I'm open to using an external system if it allows attaching attributes to key->posting pairs in the inverted index.
I'd use something external if it had solid support for what I am trying to do, I suspect that sticking 3-10k lexemes per tuple and then doing batch processing on them is gonna be nasty as the whole hstore will have to be parsed and converted.
At the moment my lexemes are in the range of 50-1k per tuple, mostly in the low hundreds, and i'm just doing it for my research algorithms. But there has to be a better way of doing this.
Strings in hstore are double-quoted. hstore only supports text values, nothing else, so you must store other types as their text representations:
SELECT hstore('k => "(1,2)"');
eg:
regress=> SELECT (hstore('k => "(1,2)"')) -> 'k';
?column?
----------
(1,2)
(1 row)
This means you have to cast the values to use them, eg:
regress=> CREATE TYPE pair AS (a integer, b integer);
CREATE TYPE
regress=> SELECT ((hstore('k => "(1,2)"')) -> 'k')::pair;
pair
-------
(1,2)
(1 row)
or using arrays instead to avoid the composite type:
regress=> SELECT ((hstore('k => "{1,2}"')) -> 'k')::integer[];
int4
-------
{1,2}
(1 row)
Arrays are indexed from 1 with the [] operator, eg [1].
This is generally horrendously inefficient because the values must be parsed and converted every single time. It's not really practical to suggest alternatives when you haven't said much about the nature of your problem and why you want hstore in the first place.

How to make fulltext search in PostgreSQL useful?

I have a russian dictionary in postgresql 8.4.
I try to use full search but have some troubles.
I dont get any results becouse try to find words by 4-5symbols. For example:
select * from parcels_temp where name_dispatcher ## to_tsquery('Нику');
Get result: 0 rows.
select * from parcels_temp where name_dispatcher ## to_tsquery('Никуд');
Get result: 2 rows. its correct.
I try to do search by words not contained in dictionary. What i gonna do in this case? How can i update dictionary in PostgreSQL?
Its must create column to tsvector or i can use to_tsvector function i queries? Or its more slowly?
Postgresql indexer does not process strings shorter than 3 characters. It is done to reduce index size and generation time.
Obviously, you get nothing. There is way to update dictionary, refer to documentation.
Use GIN index on tsvector field.
1: By default, postgreSQL only matches the whole word. You can opt in to do a prefix matching:
select * from parcels_temp where name_dispatcher ## to_tsquery('Нику:*');
See: https://www.postgresql.org/docs/9.5/static/textsearch-controls.html
3: You can create a column and GIN index on it, or you can just create the index without creating a column. I think they are the same in terms of performance. For details, see:
https://www.postgresql.org/docs/9.5/static/textsearch-tables.html

How to make "case-insensitive" query in Postgresql?

Is there any way to write case-insensitive queries in PostgreSQL, E.g. I want that following 3 queries return same result.
SELECT id FROM groups where name='administrator'
SELECT id FROM groups where name='ADMINISTRATOR'
SELECT id FROM groups where name='Administrator'
Use LOWER function to convert the strings to lower case before comparing.
Try this:
SELECT id
FROM groups
WHERE LOWER(name)=LOWER('Administrator')
using ILIKE instead of LIKE
SELECT id FROM groups WHERE name ILIKE 'Administrator'
The most common approach is to either lowercase or uppercase the search string and the data. But there are two problems with that.
It works in English, but not in all languages. (Maybe not even in
most languages.) Not every lowercase letter has a corresponding
uppercase letter; not every uppercase letter has a corresponding
lowercase letter.
Using functions like lower() and upper() will give you a sequential
scan. It can't use indexes. On my test system, using lower() takes
about 2000 times longer than a query that can use an index. (Test data has a little over 100k rows.)
There are at least three less frequently used solutions that might be more effective.
Use the citext module, which mostly mimics the behavior of a case-insensitive data type. Having loaded that module, you can create a case-insensitive index by CREATE INDEX ON groups (name::citext);. (But see below.)
Use a case-insensitive collation. This is set when you initialize a
database. Using a case-insensitive collation means you can accept
just about any format from client code, and you'll still return
useful results. (It also means you can't do case-sensitive queries. Duh.)
Create a functional index. Create a lowercase index by using CREATE
INDEX ON groups (LOWER(name));. Having done that, you can take advantage
of the index with queries like SELECT id FROM groups WHERE LOWER(name) = LOWER('ADMINISTRATOR');, or SELECT id FROM groups WHERE LOWER(name) = 'administrator'; You have to remember to use LOWER(), though.
The citext module doesn't provide a true case-insensitive data type. Instead, it behaves as if each string were lowercased. That is, it behaves as if you had called lower() on each string, as in number 3 above. The advantage is that programmers don't have to remember to lowercase strings. But you need to read the sections "String Comparison Behavior" and "Limitations" in the docs before you decide to use citext.
You can use ILIKE. i.e.
SELECT id FROM groups where name ILIKE 'administrator'
You can also read up on the ILIKE keyword. It can be quite useful at times, albeit it does not conform to the SQL standard. See here for more information: http://www.postgresql.org/docs/9.2/static/functions-matching.html
You could also use POSIX regular expressions, like
SELECT id FROM groups where name ~* 'administrator'
SELECT 'asd' ~* 'AsD' returns t
use ILIKE
select id from groups where name ILIKE 'adminstration';
If your coming the expressjs background and name is a variable
use
select id from groups where name ILIKE $1;
Using ~* can improve greatly on performance, with functionality of INSTR.
SELECT id FROM groups WHERE name ~* 'adm'
return rows with name that contains OR equals to 'adm'.
ILIKE work in this case:
SELECT id
FROM groups
WHERE name ILIKE 'Administrator'
For a case-insensitive parameterized query, you can use the following syntax:
"select * from article where upper(content) LIKE upper('%' || $1 || '%')"
-- Install 'Case Ignore Test Extension'
create extension citext;
-- Make a request
select 'Thomas'::citext in ('thomas', 'tiago');
select name from users where name::citext in ('thomas', 'tiago');
If you want not only upper/lower case but also diacritics, you can implement your own func:
CREATE EXTENSION unaccent;
CREATE OR REPLACE FUNCTION lower_unaccent(input text)
RETURNS text
LANGUAGE plpgsql
AS $function$
BEGIN
return lower(unaccent(input));
END;
$function$;
Call is then
select lower_unaccent('Hôtel')
>> 'hotel'
A tested approach is using ~*
As in the example below
SELECT id FROM groups WHERE name ~* 'administrator'
select id from groups where name in ('administrator', 'ADMINISTRATOR', 'Administrator')