How to index a postgres table by name, when the name can be in any language? - postgresql

I have a large postgres table of locations (shops, landmarks, etc.) which the user can search in various ways. When the user wants to do a search for the name of a place, the system currently does (assuming the search is on cafe):
lower(location_name) LIKE '%cafe%'
as part of the query. This is hugely inefficient. Prohibitively so. It is essential I make this faster. I've tried indexing the table on
gin(to_tsvector('simple', location_name))
and searching with
(to_tsvector('simple',location_name) ## to_tsquery('simple','cafe'))
which works beautifully, and cuts down the search time by a couple of orders of magnitude.
However, the location names can be in any language, including languages like Chinese, which aren't whitespace delimited. This new system is unable to find any Chinese locations, unless I search for the exact name, whereas the old system could find matches to partial names just fine.
So, my question is: Can I get this to work for all languages at once, or am I on the wrong track?

If you want to optimize arbitrary substring matches, one option is to use the pg_tgrm module. Add an index:
CREATE INDEX table_location_name_trigrams_key ON table
USING gin (location_name gin_trgm_ops);
This will break "Simple Cafe" into "sim", "imp", "mpl", etc., and add an entry to the index for each trigam in each row. The query planner can then automatically use this index for substring pattern matches, including:
SELECT * FROM table WHERE location_name ILIKE '%cafe%';
This query will look up "caf" and "afe" in the index, find the intersection, fetch those rows, then check each row against your pattern. (That last check is necessary since the intersection of "caf" and "afe" matches both "simple cafe" and "unsafe scaffolding", while "%cafe%" should only match one). The index becomes more effective as the input pattern gets longer since it can exclude more rows, but it's still not as efficient as indexing whole words, so don't expect a performance improvement over to_tsvector.
Catch is, trigrams don't work at all for patterns that under three characters. That may or may not be a deal-breaker for your application.
Edit: I initially added this as a comment.
I had another thought last night when I was mostly asleep. Make a cjk_chars function that takes an input string, regexp_matches the entire CJK Unicode ranges, and returns an array of any such characters or NULL if none. Add a GIN index on cjk_chars(location_name). Then query for:
WHERE CASE
WHEN cjk_chars('query') IS NOT NULL THEN
cjk_chars(location_name) #> cjk_chars('query')
AND location_name LIKE '%query%'
ELSE
<tsvector/trigrams>
END
Ta-da, unigrams!

For full text search in a multi-language environment you need to store the language each datum is in along side the text its self. You can then use the language-specified flavours of the tsearch functions to get proper stemming, etc.
eg given:
CREATE TABLE location(
location_name text,
location_name_language text
);
... plus any appropriate constraints, you might write:
CREATE INDEX location_name_ts_idx
USING gin(to_tsvector(location_name_language, location_name));
and for search:
SELECT to_tsvector(location_name_language,location_name) ## to_tsquery('english','cafe');
Cross-language searches will be problematic no matter what you do. In practice I'd use multiple matching strategies: I'd compare the search term to the tsvector of location_name in the simple configuration and the stored language of the text. I'd possibly also use a trigram based approach like willglynn suggests, then I'd unify the results for display, looking for common terms.
It's possible you may find Pg's fulltext search too limited, in which case you might want to check out something like Lucerne / Solr.
See:
* controlling full text search.
* tsearch dictionaries

Similar to what #willglynn already posted, I would consider the pg_trgm module. But preferably with a GiST index:
CREATE INDEX tbl_location_name_trgm_idx
USING gist(location_name gist_trgm_ops);
The gist_trgm_ops operator class ignore case generally, and ILIKE is just as fast as LIKE. Quoting the source code:
Caution: IGNORECASE macro means that trigrams are case-insensitive.
I use COLLATE "C" here - which is effectively no special collation (byte order instead), because you obviously have a mix of various collations in your column. Collation is relevant for ordering or ranges, for a basic similarity search, you can do without it. I would consider setting COLLATE "C" for your column to begin with.
This index would lend support to your first, simple form of the query:
SELECT * FROM tbl WHERE location_name ILIKE '%cafe%';
Very fast.
Retains capability to find partial matches.
Adds capability for fuzzy search.
Check out the % operator and set_limit().
GiST index is also very fast for queries with LIMIT n to select n "best" matches. You could add to the above query:
ORDER BY location_name <-> 'cafe'
LIMIT 20
Read more about the "distance" operator <-> in the manual here.
Or even:
SELECT *
FROM tbl
WHERE location_name ILIKE '%cafe%' -- exact partial match
OR location_name % 'cafe' -- fuzzy match
ORDER BY
(location_name ILIKE 'cafe%') DESC -- exact beginning first
,(location_name ILIKE '%cafe%') DESC -- exact partial match next
,(location_name <-> 'cafe') -- then "best" matches
,location_name -- break remaining ties (collation!)
LIMIT 20;
I use something like that in several applications for (to me) satisfactory results. Of course, it gets a bit slower with multiple features applied in combination. Find your sweet spot ...
You could go one step further and create a separate partial index for every language and use a matching collation for each:
CREATE INDEX location_name_trgm_idx
USING gist(location_name COLLATE "de_DE" gist_trgm_ops)
WHERE location_name_language = 'German';
-- repeat for each language
That would only be useful, if you only want results of a specific language per query and would be very fast in this case.

Related

Efficient way to find ordered string's exact, prefix and postfix match in PostgreSQL

Given a table name table and a string column named column, I want to search for the word word in that column in the following way: exact matches be on top, followed by prefix matches and finally postfix matches.
Currently I got the following solutions:
Solution 1:
select column
from (select column,
case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end as rank
from table) as ranked
where rank is not null
order by rank;
Solution 2:
select column
from table
where column like 'word'
or column like 'word%'
or column like '%word'
order by case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end;
Now my question is which one of the two solutions are more efficient or better yet, is there a solution better than both of them?
Your 2nd solution looks simpler for the planner to optimize, but it is possible that the first one gets the same plan as well.
For the Where, is not needed as it is covered by ; it might confuse the DB to do 2 checks instead of one.
But the biggest problem is the third one as this has no way to be optimized by an index.
So either way, PostgreSQL is going to scan your full table and manually extract the matches. This is going to be slow for 20,000 rows or more.
I recommend you to explore fuzzy string matching and full text search; looks like that is what you're trying to emulate.
Even if you don't want the full power of FTS or fuzzy string matching, you definitely should add the extension "pgtrgm", as it will enable you to add a GIN index on the column that will speedup LIKE '%word' searches.
https://www.postgresql.org/docs/current/pgtrgm.html
And seriously, have a look to FTS. It does provide ranking. If your requirements are strict to what you described, you can still perform the FTS query to "prefilter" and then apply this logic afterwards.
There are tons of introduction articles to PostgreSQL FTS, here's one:
https://www.compose.com/articles/mastering-postgresql-tools-full-text-search-and-phrase-search/
And even I wrote a post recently when I added FTS search to my site:
https://deavid.wordpress.com/2019/05/28/sedice-adding-fts-with-postgresql-was-really-easy/

Convert to SARGable query

I want to write a query to search the containing string in the table.
Table:
Create table tbl_sarg
(
colname varchar(100),
coladdres varchar(500)
);
Note: I just want to use Index Seek for searching on 300 millions of records.
Index:
create nonclustered index ncidx_colname on tbl_sarg(colname);
Sample Records:
insert into tbl_sarg values('John A Mak','HNo 102 Street Road Uk');
insert into tbl_sarg values('Shawn A Meben','Church road USA');
insert into tbl_sarg values('Lee Decose','ShopNo 22 K Mark UK');
insert into tbl_sarg values('James Don','A Mall, 90 feet road UAE');
Query 1:
select * from tbl_sarg
where colname like '%ee%'
Actual Execution Plan:
Query 2:
select * from tbl_sarg
where charindex('ee',colname)>0
Actual Execution Plan:
Query 3:
select * from tbl_sarg
where patindex('%ee%',colname)>0
Actual Execution Plan:
How to force the query processor to use the index seek instead table/index scan on large data set?
All the queries that you have posted, by definition are not SARgable, for instance, the use of '%..%'' automatically force the Query Engine to do a Scan, the other case is the use of functions (as charindex or patindex) inside your column inside a predicate.
Here some post: https://bertwagner.com/2017/08/22/how-to-search-and-destroy-non-sargable-queries-on-your-server/
Kimberly Tripp has written very interesting articles about it if for you is mandatory to execute this kind of query with wildcards, maybe it is worth to check about the possibility of using FullTextSearch feature. My point is, or your limit and do a precise predicate into your queries or you will have to change of strategy, almost forget, don't try to force the use of Seek with HINT, I can't see that this medicine will be better than the illness.
A search argument, or SARG in short, is a filter predicate that enables the optimizer to rely on
index order. The filter predicate uses the following form (or a variant with two delimiters of a
range, or with the operand positions flipped):
WHERE <column> <operator> <expression>
Such a filter is sargable if:
You don’t apply manipulation to the filtered column.
The operator identifies a consecutive range of qualifying rows in the index. That’s the
case with operators like =, >, >=, <, <=, BETWEEN, LIKE with a known prefix, and so on.
That’s not the case with operators like <>, LIKE with a wildcard as a prefix.
In most cases, when you apply manipulation to the filtered column, the optimizer doesn’t
try to be too smart and understand the meaning of the calculation, and if index ordering
can still be relied on. It simply assumes that the result values might sort differently than the
source values, and therefore index ordering can’t be trusted.
So why doesn’t SQL Server use the index for the %ee% query? Pretend for a moment that you held a phone book in your hand, and I asked you to find everyone whose last name contains the letters %ee%. You would have to scan every single page in the phone book, because the results would include things like:
Anne Lee
Lee Yung
Kathlee
Aleen
When I asked you for all last names containing %ee% anywhere in the name, my query was not sargable – meaning, you couldn’t leverage the indexes to do an index seek.
That’s where SQL Server’s Full Text Search comes in.

fuzzy match in postgresql

I have two table in my database , agridata and geoname. I am trying to find out geoid column for names in agridata like below
select geonameid , name from geoname where name in (select distinct district_name from agridata );
I want to do a fuzzy match of the names as exact names are not in database. How to go about it ?
You can use a variety of matching algorithms (see here), but I'm not 100% sure they will work with an in clause. I'd imagine you really want to use a soundex join e.g.
select distinct g.geonameid, g.name from geoname g join agridata a on soundex(a.name) = g.name
or similar.
If you've got a huge match set to deal with, you may want to consider using some kind of search index such as ElasticSearch/Solr.
Use extension for PostgreSQL called pg_trgm, implementation of trigram matching.
"We can measure the similarity of two strings by counting the number of trigrams they share. This simple idea turns out to be very effective for measuring the similarity of words in many natural languages"
I used it, it's very fast and gives great results.

Create index on first 3 characters (area code) of phone field?

I have a Postgres table with a phone field stored as varchar(10), but we search on the area code frequently, e.g.:
select * from bus_t where bus_phone like '555%'
I wanted to create an index to facilitate with these searches, but I got an error when trying:
CREATE INDEX bus_ph_3 ON bus_t USING btree (bus_phone::varchar(3));
ERROR: 42601: syntax error at or near "::"
My first question is, how do I accomplish this, but also I am wondering if it makes sense to index on the first X characters of a field or if indexing on the entire field is just as effective.
Actually, a plain B-tree index is normally useless for pattern matching with LIKE (~~) or regexp (~), even with left-anchored patterns, if your installation runs on any other locale than "C", which is the typical case. Here is an overview over pattern matching and indices in a related answer on dba.SE
Create an index with the varchar_pattern_ops operator class (matching your varchar column) and be sure to read the chapter on operator classes in the manual.
CREATE INDEX bus_ph_pattern_ops_idx ON bus_t (bus_phone varchar_pattern_ops);
Your original query can use this index:
... WHERE bus_phone LIKE '555%'
Performance of a functional index on the first 3 characters as described in the answer by #a_horse is pretty much the same in this case.
-> SQLfiddle demo.
Generally a functional index on relevant leading characters would be be a good idea, but your column has only 10 characters. Consider that the overhead per tuple is already 28 bytes. Saving 7 bytes is just not substantial enough to make a big difference. Add the cost for the function call and the fact that xxx_pattern_ops are generally a bit faster.
In Postgres 9.2 or later the index on the full column can also serve as covering index in index-only scans.
However, the more characters in the columns, the bigger the benefit from a functional index.
You may even have to resort to a prefix index (or some other kind of hash) if the strings get too long. There is a maximum length for indices.
If you decide to go with the functional index, consider using the xxx_pattern_ops variant for a small additional performance benefit. Be sure to read about the pros and cons in the manual and in Peter Eisentraut's blog entry:
CREATE INDEX bus_ph_3 ON bus_t (left(bus_phone, 3) varchar_pattern_ops);
Explain error message
You'd have to use the standard SQL cast syntax for functional indices. This would work - pretty much like the one with left(), but like #a_horse I'd prefer left().
CREATE INDEX bus_ph_3 ON bus_t USING btree (cast(bus_phone AS varchar(3));
When using like '555%' an index on the complete column will be used just as well. There is no need to only index the first three characters.
If you do want to index only the first 3 characters (e.g. to save space), then you could use the left() funcion:
CREATE INDEX bus_ph_3 ON bus_t USING btree (left(bus_phone,3));
But in order for that index to be used, you would need to use that expression in your where clause:
where left(bus_phone,3) = '555';
But again: that is most probably overkill and the index on the complete column will be good enough and can be used for other queries as well e.g. bus_phone = '555-1234' which the index on just the first three characters would not.

PostgreSQL Full Text Search and Trigram Confusion

I'm a little bit confused with the whole concept of PostgreSQL, full text search and Trigram. In my full text search queries, I'm using tsvectors, like so:
SELECT * FROM articles
WHERE search_vector ## plainto_tsquery('english', 'cat, bat, rat');
The problem is, this method doesn't account for misspelling. Then I started to read about Trigram and pg_trgm:
Looking through other examples, it seems like trigram is used or vectors are used, but never both. So my questions are: Are they ever used together? If so, how? Does trigram replace full text? Are trigrams more accurate? And how are trigrams on performance?
They serve very different purposes.
Full Text Search is used to return documents that match a search query of stemmed words.
Trigrams give you a method for comparing two strings and determining how similar they look.
Consider the following examples:
SELECT 'cat' % 'cats'; --true
The above returns true because 'cat' is quite similar to 'cats' (as dictated by the pg_trgm limit).
SELECT 'there is a cat with a dog' % 'cats'; --false
The above returns false because % is looking for similarily between the two entire strings, not looking for the word cats within the string.
SELECT to_tsvector('there is a cat with a dog') ## to_tsquery('cats'); --true
This returns true becauase tsvector transformed the string into a list of stemmed words and ignored a bunch of common words (stop words - like 'is' & 'a')... then searched for the stemmed version of cats.
It sounds like you want to use trigrams to auto-correct your ts_query but that is not really possible (not in any efficient way anyway). They do not really know a word is misspelt, just how similar it might be to another word. They could be used to search a table of words to try and find similar words, allowing you to implement a "did you mean..." type feature, but this word require maintaining a separate table containing all the words used in your search field.
If you have some commonly misspelt words/phrases that you want the text-index to match you might want to look at Synonym Dictorionaries