Scalable way to search for (similar) strings in a database - postgresql

Let me describe my problem. There is an input string, and a table containing many thousands of strings. I am looking for best way to search for the most similar* strings to the input string. The search should return a list of ~10 suggested strings, sorted by degree of similarity. Strings also have numerical weights (popularity) associated with them in database, in another column, so the ones with higher weights should have higher chance of appearing in results, if possible.
What is the best library to achieve this? I am looking for something similar to Elasticsearch, I guess. I don't have much experience with these kinds of libraries, so I would need something easy to include in my project and preferably open-source. I am using Python (Flask and SQLAlchemy) and Postgresql, but could also use e.g. Node.js, if needed.
*I also want to clarify what kind of similarity I am looking for. Ideally, it would be semantic similarity, but lexical similarity is fine as well. I would be happy with anything that works okay, is easy to implement, and is as scalable and performant as possible.
Example input sentence:
I don't like cangaroos.
Example suggestions from the database:
Cangaroos are not my favorite.
Cangaroos are evil.
I once had a cangaroo. Never again.
These suggestions should appear first because 'cangaroo' is not a frequent word in my database, so any string with the word 'cangaroo' should have a high chance appearing in results. It is probably much harder to detect 'don't like', so that part is completely optional for me.
P.s. Could PostgreSQL's full text search do something like this?
Thank you.

PostgreSQL Full-text search cannot do what you're looking for. However, PostgreSQL trigram similarity can do it.
You first need to install the packages with 'trigram similarity' and 'btree_gist', by executing (once) in your database:
CREATE EXTENSION pg_trgm;
CREATE EXTENSION btree_gist;
I assume you have one table that looks like this one:
CREATE TABLE sentences
(
sentence_id integer PRIMARY KEY,
sentence text
) ;
INSERT INTO sentences (sentence_id, sentence)
VALUES
(1, 'Cangaroos are not my favorite.'),
(2, 'A vegetable sentence.'),
(3, 'Cangaroos are evil.'),
(4, 'Again, some plants in my garden.'),
(5, 'I once had a cangaroo. Never again.') ;
This table needs a 'trigram index', to allow the PostgreSQL database to 'index by similarity'. This is accomplished by executing:
CREATE INDEX ON sentences USING GIST (sentence gist_trgm_ops, sentence_id) ;
To find the answers you're looking for, you execute:
-- Set the minimum similarity you want to be able to search
SELECT set_limit(0.2) ;
-- And now, select the sentences 'similar' to the input one
SELECT
similarity(sentence, 'I don''t like cangaroos') AS similarity,
sentence_id,
sentence
FROM
sentences
WHERE
/* That's how you choose your sentences:
% means 'similar to', in the trigram sense */
sentence % 'I don''t like cangaroos'
ORDER BY
similarity DESC ;
The result that you get is:
similarity | sentence_id | sentence
-----------+-------------+-------------------------------------
0.3125 | 3 | Cangaroos are evil.
0.2325 | 1 | Cangaroos are not my favorite.
0.2173 | 5 | I once had a cangaroo. Never again.
Hope this gives you what you want...

Related

Better Postgres trigram ranking

I'm searching several million names and addresses in a Postgres table. I'd like to use pg_trgm to do fast fuzzy search.
My app is actually very similar to the one in Optimizing a postgres similarity query (pg_trgm + gin index), and the answer there is pretty good.
My problem is that the relevance ranking isn't very good. There are two issues:
I want names to get a heavier weight in the ranking than addresses, and it's not clear how to do that and still get good performance. For example, if a user searches for 'smith', I want 'Bob Smith' to appear higher in the results than '123 Smith Street'.
The current results are biased toward columns that contain fewer characters. For example, a search for 'bob' will rank 'Bobby Smith' (without an address) above 'Bob Smith, 123 Bob Street, Smithville Illinois, 12345 with some other info here'. The reason for this is that the similarity score penalizes for parts of the string that do not match the search terms.
I'm thinking that I'll get a much better result if I could get a score that simply returns the number of matched trigrams in a record, not the number of trigrams scaled by the length of the target string. That's the way most search engines (like Elastic) work -- they rank by the weighted number of hits and do not penalize long documents.
Is it possible to do this with pg_trgm AND get good (sub-second) performance? I could do an arbitrary ranking of results, but if the ORDER BY clause does not match the index, then performance will be poor.
I know that this is an old question but this might be useful for others.
If the text you want to search falls in ascii table (characters in the range of [a-zA-Z0-9] and some other symbols), then you probably want to use Full Text Search feature (read official document Full Text Search).
because not only it gives you ability to sort by relevancy, but also ability to customize things like text steming (using snowball algorithm), which maps words like connection, connections, connective, connected, and connecting to connect (Read more about Snowball Stem). This makes your application performs better on search.
But if your requirement is to search text out of range of ascii table, like unicode, which is common if you try to support Asian languages like Japanese, Thai, Korean, etc. then using pg_trgm is perfectly fine.
To do the search that is not biased to the shorter text as mentioned in the question, you could use word_similarity(), instead of similarity().
As per the official documentation:
word_similarity( text, text )
Returns a number that indicates the greatest similarity between the set of trigrams in the first string and any continuous extent of an ordered set of trigrams in the second string. For details, see the explanation below.
So for example:
postgres=# SELECT word_similarity('white cat', 'white dog and black cat') as "similarity 1", word_similarity('white cat', 'I have a white dog and a black cat') as "similarity 2", word_similarity('white cat', 'I have a lovely white dog and a cute big black cat in a house') as "similarity 3";
similarity 1 | similarity 2 | similarity 3
--------------+--------------+--------------
0.6 | 0.6 | 0.6
(1 row)
As shown above, they all have equal scores.
And when you want to use it in a query:
SELECT col, word_similarity('some query', col) from my_table where col <% 'some query';
According to the document:
text <% text → boolean
Returns true if the similarity between the trigram set in the first argument and a continuous extent of an
ordered trigram set in the second argument is greater than the current
word similarity threshold set by pg_trgm.word_similarity_threshold
parameter.
For somehting more complicated like calculating hit scores, relevance weight/boost, and faster response time on larger dataset, you should use Elastic instead, but keep in mind that Elastic instance needs at least 2GB of ram and more, so you need dedicated EC2 instance(s) for that purpose. But for small-medium app, pg_trgm works just fine while saving your server cost.
Hope you find this helpful.

Search engine like full text search in PostgreSQL

I have a list of titles and descriptions in a table which are indexed in a tsvector column. How can I implement Google Search like full text search functionality in Postgres for these fields. I tried various functions offered by standard Postgres like
to_tsquery('apple | orange') -- apple | orange
This function returns rows as long as it has one of these terms so it doesn't produce highly relevant results at top which should have both of the terms.
plainto_tsquery('apple orange') -- apple & orange
This function requires all of the terms in the query. But I want results including both apple and orange first but can still have results including even one of these terms just later in the results.
phraseto_tsquery('apple orange') -- apple <> orange
This function only matches orange followed by apple but not vice versa. But for me orange <> apple is also still relevant.
I also tried websearch_to_tsquery() but it behaves very similar to above functions.
How can I ask Postgres to list highly relevant rows first which contains most of the terms in the search query no matter the order of the terms and then followed by rows with less number of terms?
to_tsquery('apple | orange') -- apple | orange
This function returns rows as long as it has one of these terms so it doesn't produce highly relevant results at top which should have both of the terms.
Unless you tell it how to order the rows, rows of a single query are returned in arbitrary order. There is no "top" without an ORDER BY, there is just something which happens to be seen first.
How can I ask Postgres to list highly relevant rows first which contains most of the terms in the search query no matter the order of the terms and then followed by rows with less number of terms?
Use the | operator, then rank those rows using ts_rank, ts_rank_cd, or a custom ranking function you write yourself. For performance, you might want to use the & operator first, then revert to | if you don't get enough rows.
The built in ranking functions don't care about order, but also don't care about proximity. So they might not do what you want. But writing your own won't be particularly easy, so I'd at least try them out first.
It would be nice if the introduction of websearch_to_tsquery or phraseto_tsquery had also introduced some corresponding ranking functions. But since they invented only ordered proximity, not proximity without order, it is unlikely they would do you want if they did exist.

How to make a tsquery perform a partial match?

I have the following situation. In our database, our user has the ability to search part numbers as 'keywords'. Part numbers are attached as 'footnotes' which get attached to certain items. An example of a footnote of this nature would have a description of:
Part Number: 09C888
Our keyword search searches multiple tables through an incredibly fun set of LEFT JOINs eventually forming a ts_vector which then is used against a tsquery. Our current issue is that this methodology seems to only accept exact matches. Example:
select to_tsvector('Part Number: 09C888') ## to_tsquery('09C888:*');
?column?
---------
t
Using the full version of the part number as the search criteria works fine. However...
select to_tsvector('Part Number: 09C888') ## to_tsquery('9C888:*');
?column?
----------
f
Is there a way to modify the above tsquery item to match against 09C888 with values of 09C888 AND 9C888? Normally, I could do something similar with the LIKE construct, but we're currently using full text search for efficiency on large amounts of data. From perusing the postgresql documentation, I cannot figure out an easy way to do this. I am also hesitant to change the overall query since it's doing... well, its doing a lot of stuff of which the text matching is only one part of. (Obviously a potential place for improvement.)
EDIT:
I've actually figured out how to do this using a modified query
select to_tsvector('Part Number: 09C888') ## to_tsquery('09C888|9C888:*');
Is there a better way to determine match than what I've listed above? Mostly because the solution in incredibly specific, but essentially these part numbers may or may not have leading 0s.
Have you considered storing the part number with leading zeroes removed in a separate column and search against that?
+---------------------+-------+
| Part Number: 09C888 | 9C888 |
+---------------------+-------+
CREATE INDEX footnote_part_number_txt_idx
ON footnotes (stripped_part_number text_pattern_ops);
then you can query (using the index)
SELECT footnote_str
FROM footnotes
WHERE stripped_part_number LIKE '9C88%'
See: http://petereisentraut.blogspot.se/2009/10/rethink-your-text-column-indexing-with.html

PostgreSQL Full Text Search and Trigram Confusion

I'm a little bit confused with the whole concept of PostgreSQL, full text search and Trigram. In my full text search queries, I'm using tsvectors, like so:
SELECT * FROM articles
WHERE search_vector ## plainto_tsquery('english', 'cat, bat, rat');
The problem is, this method doesn't account for misspelling. Then I started to read about Trigram and pg_trgm:
Looking through other examples, it seems like trigram is used or vectors are used, but never both. So my questions are: Are they ever used together? If so, how? Does trigram replace full text? Are trigrams more accurate? And how are trigrams on performance?
They serve very different purposes.
Full Text Search is used to return documents that match a search query of stemmed words.
Trigrams give you a method for comparing two strings and determining how similar they look.
Consider the following examples:
SELECT 'cat' % 'cats'; --true
The above returns true because 'cat' is quite similar to 'cats' (as dictated by the pg_trgm limit).
SELECT 'there is a cat with a dog' % 'cats'; --false
The above returns false because % is looking for similarily between the two entire strings, not looking for the word cats within the string.
SELECT to_tsvector('there is a cat with a dog') ## to_tsquery('cats'); --true
This returns true becauase tsvector transformed the string into a list of stemmed words and ignored a bunch of common words (stop words - like 'is' & 'a')... then searched for the stemmed version of cats.
It sounds like you want to use trigrams to auto-correct your ts_query but that is not really possible (not in any efficient way anyway). They do not really know a word is misspelt, just how similar it might be to another word. They could be used to search a table of words to try and find similar words, allowing you to implement a "did you mean..." type feature, but this word require maintaining a separate table containing all the words used in your search field.
If you have some commonly misspelt words/phrases that you want the text-index to match you might want to look at Synonym Dictorionaries

postgresql phrase extraction & ranking

From selected rows in a table, how can one extract and rank phrases based on how often they occur?
example 1: http://developer.yahoo.com/search/content/V1/termExtraction.html
example 2: http://mirror.me/i/love
INPUT:
CREATE TABLE phrases (
id BIGSERIAL,
phrase VARCHAR(10000)
);
INSERT INTO phrases (phrase) VALUES (‘Italian sculptors and painters of the renaissance favored the Virgin Mary for inspiration.’)
INSERT INTO phrases (phrase) VALUES (‘Andrea Bolgi was an italian sculptor’)
DESIRED OUTPUT:
phrase | weight
italian sculptor | 5
virgin mary | 2
painters | 1
renaissance | 1
inspiration | 1
Andrea Bolgi | 1
To find just words, not phrases, one could use
SELECT * FROM ts_stat('SELECT to_tsvector(''simple'', phrase) FROM phrases')
ORDER BY nentry DESC, ndoc DESC, word;
Some notes:
phrases could contain “stop words”, e.g. “easy to answer”
ideally, english language variations and synonyms would be automatically grouped.
Could pg_trgm help? (it’s ok if only 2 and 3 word phrases are found). How exactly?
Related questions:
What techniques/tools are there for discovering common phrases in chunks of text?
How to find common phrases in a large body of text
How to extract common / significant phrases from a series of text entries
I agree with Craig that this is certainly way beyond the scope of what tsearch2 was intended to do as well as any other existing PostgreSQL tools. However, I do think that this might not be too bad to do in the db engine. One of the strengths of PostgreSQL is programmability and this strength gives you some very underutilized options.
As Craig notes, this is the domain of natural language processing, not of SQL per se, so the first thing you want to do is settle on a natural language processing toolkit with support for a stored procedure language that PostgreSQL supports. In other words, you want something that supports Perl, Python, C, etc. Whatever PostgreSQL supports and you feel comfortable working in.
The second step is to create functional interfaces for this toolkit in stored procedure languages. This should take text in, and output the phrase breakdown in some sort of type PostgreSQL can handle reasonably well. You need to pay attention to the type carefully because that affects things like GIN indexing.
From there you can incorporate it into your database interfaces and queries.