Fuzzy search on large table - postgresql

I have a very large PostgreSQL table with 12M names and I would like to show an autocomplete. Previously I used a ILIKE "someth%" clause but I'm not really satisfied with it. For example it doesn't sort by similarity and any spelling error would cause wrong or no results. The field is a string, usually one or two words (in any language). I need a fast response because suggestions are shown live to the user while he is typing (i.e. autocomplete). I cannot restrict the fuzzy match to a subset because all names are equally important. I can also say that most names are different.
I have tried pg_trgm but even with a gin index is very slow. The search of a name similar to 'html' takes a few milliseconds, but - don't ask me why - other searches like 'htm' take a lot of seconds - e.g. 25 seconds. Also other people has reported performance issues with pg_trgm on large tables.
Is there anything I can do to efficiently show an autocomplete on that field?
Would a full text search engine (e.g. Lucene, Solr) be an appropriate solution? Or I would encounter the same inefficiency?

Related

postgressql - performance when searching a phone number column using LIKE operator

I have a phone number column in my database that could potentially have somewhere close to 50 million records.
As I have the phone numbers stored with the country code, I am a bit confused on how to implement the search functionality.
Options I have in mind
When the user puts in a phone number to search - use the LIKE operator to find the right phone number [When using LIKE operator does it slow down the search?]
Split the phone number column into two one with just the area code and the other with the phone number. [Why I am looking into this implmentation is I dont have to use LIKE operator here]
Please suggest any other ideas! People here who has really good experience with postgres please chime in with the best practises.
Since they are stored with a country code, you can just include the country code when you search for them. That should be by far the most performant. If you know what country each person is in, or if your user base is dominantly from one country, you could just add the code to "short" numbers in order to complete it.
If LIKE is too slow (at 50 million rows it probably would be) you can put a pg_trgm index on it. You will probably need to remove, or at least standardize, the punctuation in both data and in the query, or it could cause problems with the LIKE (as well as every other method).
The problem I see with making two columns, country code (plus area code? I would expect that to go in the other column) and one column for the main body of the number, is that it probably wouldn't do what people want. I would think people are going to either expect partial matching at any number of digits they feel like typing meaning you would still need to use LIKE, or people who type in the full number (minus country code) are going to expect it to find only numbers in "their" country. On the other hand splitting off the country code from the main body of the number might avoid having an extremely common country code pollute any pg_trgm indexes you do build with low selectivity trigrams.

Postgresql Text Search Performance

I have been looking into text search (without tsvector) of a varchar field (more or less between 10 to 400 chars) that has the following format:
field,field_a,field_b,field_c,...,field_n
The query I am planning to run is probably similar to:
select * from information_table where fields like '%field_x%'
As there are no spaces in fields, I wonder if there are some performance issues if I run the search across 500k+ rows.
Any insights into this?
Any documentation around performance of varchar and maybe varchar index?
I am not sure if tsvector will work on a full string without spaces. What do you think about this solution? Do you see another solutions that could help improve the performance?
Thanks and I look forward to hearing from you.
R
In general the text search parser will treat commas and spaces the same, so if you want to use FTS, the structure with commas does not pose a problem. pg_trgm also treats commas and spaces the same, so if you want to use that method instead it will also not have a problem due to the commas.
The performance is going to depend on how popular or rare the tokens in the query are in the body of text. It is hard to generalize that based on one example row and one example query, neither of which looks very realistic. Best way to figure it out would be to run some real queries with real (or at least realistic) data with EXPLAIN (ANALYZE, BUFFERS) and with track_io_timing turned on.

Pattern matching performance issue Postgres

I found the query like below taking longer time as this pattern matching causes the performance in my batch job,
Query:
select a.id, b.code
from table a
left join table b
on a.desc_01 like '%'||b.desc_02||'%';
I have tried with LEFT, STRPOS functions to improve the performance. But at the end am losing few data if i apply these functions.
Any other suggestion please.
It's not that clear what your data (or structure) really looks like, but your search is performing a contains comparison. That's not the simplest thing to optimize because a standard index, and many matching algorithms, are biased towards the start of the string. When you lead with %, then then a B-tree can't be used efficiently as it splits/branches based on the front of the string.
Depending on how you really want to search, have you considered trigram indexes? they're pretty great. Your string gets split into three letter chunks, which overcomes a lot of the problems with left-anchored text comparison. The reason why is simple: now every character is the start of a short, left-anchored chunk. There are traditionally two methods of generating trigrams (n-grams), one with leading padding, one without. Postgres uses padding, which is the better default. I got help with a related question recently that may be relevant to you:
Searching on expression indexes
If you want something more like a keyword match, then full text search might be of help. I had not been using them much because I've got a data set where converting words to "lexemes" doesn't make sense. It turns out that you can tell the parser to use the "simple" dictionary instead, and that gets you a unique word list without any stemming transformations. Here's a recent question on that:
https://dba.stackexchange.com/questions/251177/postgres-full-text-search-on-words-not-lexemes/251185#251185
If that sounds more like what you need, you might also want to get rid of stop/skip/noise words. Here's a thread that I think is a bit clearer on the docs regarding how to set this up (it's not hard):
https://dba.stackexchange.com/questions/145016/finding-the-most-commonly-used-non-stop-words-in-a-column/186754#186754
The long term answer is to clean up and re-organize your data so you don't need to do this.
Using a pg_trgm index might be the short term answer.
create extension pg_trgm;
create index on a using gin (desc_01 gin_trgm_ops);
How fast this will be is going to depend on what is in b.desc_02.

Which column compression type should i choose in amazon redshift?

I have a table over 120 million rows.
Following command analyze compression tbl; shows LZO encoding for almost every VARCHAR field, but i think that runlenght encoding may be better for fields with finite number of options (traffic source, category, etc.).
So should i move certain fields to another encoding or stay with LZO?
Thoughts on runlength
The point about runlength, rather than a finite number of options, is that field values are repeated over many consecutive rows. This is usually the case when table is sorted by that column. You are right, though, that the fewer distinct values you have, the more likely it is for any particular value to occur in a sequence.
Documentation
Redshift states in their documentation:
We do not recommend applying runlength encoding on any column that is designated as a sort key. Range-restricted scans perform better when blocks contain similar numbers of rows. If sort key columns are compressed much more highly than other columns in the same query, range-restricted scans might perform poorly.
And also:
LZO encoding provides a very high compression ratio with good performance. LZO encoding works especially well for CHAR and VARCHAR columns that store very long character strings, especially free form text, such as product descriptions, user comments, or JSON strings.
Benchmark
So, ultimately, you'll have to take a close look at your data, the way it is sorted, the way you are going to join other tables on it and, if in doubt, benchmark the encodings. Create the same table twice and apply runlength encoding to the column in one table, and lzo in the other. Ideally, you already have a query that you know will be used often. Run it several times for each table and compare the results.
My recommendation
Do your queries perform ok? Then don't worry about encoding and take Redshift's suggestion. If you want to take it as a learning project, then make sure that you are aware of how performance improves or degrades when you double (quadruple, ...) the rows in the table. 120 million rows are not many and it might well be that one encoding looks great now but will cause queries to perform poorly when a certain threshold is passed.

Postgresql Misspelling in Full Text Search

I'm using postgresql to perform Full Text Search and I am finding that users will not receive results if there are misspellings.
What is the best way to handle misspelt words in Postgres full text search?
Take a look at pg_similarity extension which stuffs PSQL with a lot of similarity operators and functions. This will allow you to add (easy enough) some forgiveness into queries.
By typing "spelling correction postgresql fts" into google I get the top result being a page that links to just such a topic.
It suggests using a separate table of all the valid words in your database and running search terms against that to suggest corrections. The trigram matching allows you to measure how "similar" the real words in your table are to the search terms supplied.