PostgreSQL: Find sentences closest to a given sentence - postgresql

I have a table of images with sentence captions. Given a new sentence I want to find the images that best match it based on how close the new sentence is to the stored old sentences.
I know that I can use the ## operator with a to_tsquery but tsquery accepts specific words as queries.
One problem is I don't know how to convert the given sentence into a meaningful query. The sentence may have punctuation and numbers.
However, I also feel that some kind of cosine similarity thing is what I need but I don't know how to get that out of PostgresQL. I am using the latest GA version and am happy to use the development version if that would solve my problem.

Full Text Search (FTS)
You could use plainto_tsquery() to (per documentation) ...
produce tsquery ignoring punctuation
SELECT plainto_tsquery('english', 'Sentence: with irrelevant words (and punctuation) in it.')
plainto_tsquery
------------------
'sentenc' & 'irrelev' & 'word' & 'punctuat'
Use it like:
SELECT *
FROM tbl
WHERE to_tsvector('english', sentence) ## plainto_tsquery('english', 'My new sentence');
But that is still rather strict and only provides very limited tolerance for similarity.
Trigram similarity
Might be better suited to search for similarity, even overcome typos to some degree.
Install the additional module pg_trgm, create a GiST index and use the similarity operator % in a nearest neighbour search:
Basically, with a trigram GiST index on sentence:
-- SELECT set_limit(0.3); -- adjust tolerance if needed
SELECT *
FROM tbl
WHERE sentence % 'My new sentence'
ORDER BY sentence <-> 'My new sentence'
LIMIT 10;
More:
Finding similar strings with PostgreSQL quickly
Finding similar posts with PostgreSQL
Slow fulltext search for terms with high occurence
Combine both
You can even combine FTS and trigram similarity:
PostgreSQL FTS and Trigram-similarity Query Optimization

it's a pretty late answer, but I'm adding in case anyone encounters. If you add ": *" to the end of the words, it will bring up similar ones.
Sample:
JS autocomlete -> Codeigniter:
barcode = $ this-> input-> get ("term"). ":*";
Query:
$ query = 'select * from tablaneme where xx ##? LIMIT 15 ';
$ barcodequery = $ this-> db-> query ($ query, array (explode ("", $ barcode)))) -> result_array ();

Related

Scalable way to search for (similar) strings in a database

Let me describe my problem. There is an input string, and a table containing many thousands of strings. I am looking for best way to search for the most similar* strings to the input string. The search should return a list of ~10 suggested strings, sorted by degree of similarity. Strings also have numerical weights (popularity) associated with them in database, in another column, so the ones with higher weights should have higher chance of appearing in results, if possible.
What is the best library to achieve this? I am looking for something similar to Elasticsearch, I guess. I don't have much experience with these kinds of libraries, so I would need something easy to include in my project and preferably open-source. I am using Python (Flask and SQLAlchemy) and Postgresql, but could also use e.g. Node.js, if needed.
*I also want to clarify what kind of similarity I am looking for. Ideally, it would be semantic similarity, but lexical similarity is fine as well. I would be happy with anything that works okay, is easy to implement, and is as scalable and performant as possible.
Example input sentence:
I don't like cangaroos.
Example suggestions from the database:
Cangaroos are not my favorite.
Cangaroos are evil.
I once had a cangaroo. Never again.
These suggestions should appear first because 'cangaroo' is not a frequent word in my database, so any string with the word 'cangaroo' should have a high chance appearing in results. It is probably much harder to detect 'don't like', so that part is completely optional for me.
P.s. Could PostgreSQL's full text search do something like this?
Thank you.
PostgreSQL Full-text search cannot do what you're looking for. However, PostgreSQL trigram similarity can do it.
You first need to install the packages with 'trigram similarity' and 'btree_gist', by executing (once) in your database:
CREATE EXTENSION pg_trgm;
CREATE EXTENSION btree_gist;
I assume you have one table that looks like this one:
CREATE TABLE sentences
(
sentence_id integer PRIMARY KEY,
sentence text
) ;
INSERT INTO sentences (sentence_id, sentence)
VALUES
(1, 'Cangaroos are not my favorite.'),
(2, 'A vegetable sentence.'),
(3, 'Cangaroos are evil.'),
(4, 'Again, some plants in my garden.'),
(5, 'I once had a cangaroo. Never again.') ;
This table needs a 'trigram index', to allow the PostgreSQL database to 'index by similarity'. This is accomplished by executing:
CREATE INDEX ON sentences USING GIST (sentence gist_trgm_ops, sentence_id) ;
To find the answers you're looking for, you execute:
-- Set the minimum similarity you want to be able to search
SELECT set_limit(0.2) ;
-- And now, select the sentences 'similar' to the input one
SELECT
similarity(sentence, 'I don''t like cangaroos') AS similarity,
sentence_id,
sentence
FROM
sentences
WHERE
/* That's how you choose your sentences:
% means 'similar to', in the trigram sense */
sentence % 'I don''t like cangaroos'
ORDER BY
similarity DESC ;
The result that you get is:
similarity | sentence_id | sentence
-----------+-------------+-------------------------------------
0.3125 | 3 | Cangaroos are evil.
0.2325 | 1 | Cangaroos are not my favorite.
0.2173 | 5 | I once had a cangaroo. Never again.
Hope this gives you what you want...

Longest matching substring

How would you search for the longest match within a varchar variable? For example, table GOB has entries as follows:
magic_word | prize
===================
sh| $0.20
sha| $0.40
shaz| $0.60
shaza| $1.50
I would like to write a plpgsql function that takes amongst other arguments a string as input (e.g. shazam), and returns the 'prize' column on the row of GOB with the longest matching substring. In the example shown, that would be $1.50 on the row with magic_word shaza.
All the function format I can handle, it's just the matching bit. I can't think of an elegant solution. I'm guessing it's probably really easy, but I am scratching my head. I don't know the input string at the start, as it will be derived from the result of a query on another table.
Any ideas?
Simple solution
SELECT magic_word
FROM gob
WHERE 'shazam' LIKE (magic_word || '%')
ORDER BY magic_word DESC
LIMIT 1;
This works because the longest match sorts last - so I sort DESC and pick the first match.
I am assuming from your example that you want to match left-anchored, from the beginning of the string. If you want to match anywhere in the string (which is more expensive and even harder to back up with an index), use:
...
WHERE 'shazam' LIKE ('%' || magic_word || '%')
...
SQL Fiddle.
Performance
The query is not sargable. It might help quite a bit if you had additional information, like a minimum length that you could base an index on, to reduce the number of rows to consider. It needs to be criteria that gets you less than ~ 5% of the table to be effective. So, initials (a natural minimum pick) may or may not be useful. But two or three letters at the start might help quite a bit.
In fact you could optimize this iteratively. Something along the line of:
Try a partial index of words with 15 letters+
If not found, try 12 letters+
If not found, try 9 letters+
...
A simple case of what I outlined in this related answer on dba.SE:
Can spatial index help a “range - order by - limit” query
Another approach would be to use a trigram index. You'd need the additional module pg_trgm for that. Normally you would search with a short pattern in a table with longer strings. But trigrams work for your reverse approach, too, with some limitations. Obviously you couldn't match a string with just two characters in the middle of a longer string using trigrams ... Test for corner cases.
There are a number of answers here on SO with more information. Example:
Effectively query on column that includes a substring
Advanced solution
Consider the solution under this closely related question for a whole table of search strings. Implemented with a recursive CTE:
Longest Prefix Match
How about
1
select max(FOO.matchingValue)
from
(
select magic_word as matchingValue
from T
where substr( "abracadabra", 1, length(magic_word)) = magic_word
)
as FOO
2
select prize from
T
join
(
select max(FOO.matchingValue) as MaxValue
from
(
select magic_word as matchingValue
from T
where substr( "abracadabra", 1, length(magic_word)) = magic_word
)
as FOO
) as BAR
on BAR.MaxValue = T.magic_word

PostgreSQL Full Text Search and Trigram Confusion

I'm a little bit confused with the whole concept of PostgreSQL, full text search and Trigram. In my full text search queries, I'm using tsvectors, like so:
SELECT * FROM articles
WHERE search_vector ## plainto_tsquery('english', 'cat, bat, rat');
The problem is, this method doesn't account for misspelling. Then I started to read about Trigram and pg_trgm:
Looking through other examples, it seems like trigram is used or vectors are used, but never both. So my questions are: Are they ever used together? If so, how? Does trigram replace full text? Are trigrams more accurate? And how are trigrams on performance?
They serve very different purposes.
Full Text Search is used to return documents that match a search query of stemmed words.
Trigrams give you a method for comparing two strings and determining how similar they look.
Consider the following examples:
SELECT 'cat' % 'cats'; --true
The above returns true because 'cat' is quite similar to 'cats' (as dictated by the pg_trgm limit).
SELECT 'there is a cat with a dog' % 'cats'; --false
The above returns false because % is looking for similarily between the two entire strings, not looking for the word cats within the string.
SELECT to_tsvector('there is a cat with a dog') ## to_tsquery('cats'); --true
This returns true becauase tsvector transformed the string into a list of stemmed words and ignored a bunch of common words (stop words - like 'is' & 'a')... then searched for the stemmed version of cats.
It sounds like you want to use trigrams to auto-correct your ts_query but that is not really possible (not in any efficient way anyway). They do not really know a word is misspelt, just how similar it might be to another word. They could be used to search a table of words to try and find similar words, allowing you to implement a "did you mean..." type feature, but this word require maintaining a separate table containing all the words used in your search field.
If you have some commonly misspelt words/phrases that you want the text-index to match you might want to look at Synonym Dictorionaries

How to index a postgres table by name, when the name can be in any language?

I have a large postgres table of locations (shops, landmarks, etc.) which the user can search in various ways. When the user wants to do a search for the name of a place, the system currently does (assuming the search is on cafe):
lower(location_name) LIKE '%cafe%'
as part of the query. This is hugely inefficient. Prohibitively so. It is essential I make this faster. I've tried indexing the table on
gin(to_tsvector('simple', location_name))
and searching with
(to_tsvector('simple',location_name) ## to_tsquery('simple','cafe'))
which works beautifully, and cuts down the search time by a couple of orders of magnitude.
However, the location names can be in any language, including languages like Chinese, which aren't whitespace delimited. This new system is unable to find any Chinese locations, unless I search for the exact name, whereas the old system could find matches to partial names just fine.
So, my question is: Can I get this to work for all languages at once, or am I on the wrong track?
If you want to optimize arbitrary substring matches, one option is to use the pg_tgrm module. Add an index:
CREATE INDEX table_location_name_trigrams_key ON table
USING gin (location_name gin_trgm_ops);
This will break "Simple Cafe" into "sim", "imp", "mpl", etc., and add an entry to the index for each trigam in each row. The query planner can then automatically use this index for substring pattern matches, including:
SELECT * FROM table WHERE location_name ILIKE '%cafe%';
This query will look up "caf" and "afe" in the index, find the intersection, fetch those rows, then check each row against your pattern. (That last check is necessary since the intersection of "caf" and "afe" matches both "simple cafe" and "unsafe scaffolding", while "%cafe%" should only match one). The index becomes more effective as the input pattern gets longer since it can exclude more rows, but it's still not as efficient as indexing whole words, so don't expect a performance improvement over to_tsvector.
Catch is, trigrams don't work at all for patterns that under three characters. That may or may not be a deal-breaker for your application.
Edit: I initially added this as a comment.
I had another thought last night when I was mostly asleep. Make a cjk_chars function that takes an input string, regexp_matches the entire CJK Unicode ranges, and returns an array of any such characters or NULL if none. Add a GIN index on cjk_chars(location_name). Then query for:
WHERE CASE
WHEN cjk_chars('query') IS NOT NULL THEN
cjk_chars(location_name) #> cjk_chars('query')
AND location_name LIKE '%query%'
ELSE
<tsvector/trigrams>
END
Ta-da, unigrams!
For full text search in a multi-language environment you need to store the language each datum is in along side the text its self. You can then use the language-specified flavours of the tsearch functions to get proper stemming, etc.
eg given:
CREATE TABLE location(
location_name text,
location_name_language text
);
... plus any appropriate constraints, you might write:
CREATE INDEX location_name_ts_idx
USING gin(to_tsvector(location_name_language, location_name));
and for search:
SELECT to_tsvector(location_name_language,location_name) ## to_tsquery('english','cafe');
Cross-language searches will be problematic no matter what you do. In practice I'd use multiple matching strategies: I'd compare the search term to the tsvector of location_name in the simple configuration and the stored language of the text. I'd possibly also use a trigram based approach like willglynn suggests, then I'd unify the results for display, looking for common terms.
It's possible you may find Pg's fulltext search too limited, in which case you might want to check out something like Lucerne / Solr.
See:
* controlling full text search.
* tsearch dictionaries
Similar to what #willglynn already posted, I would consider the pg_trgm module. But preferably with a GiST index:
CREATE INDEX tbl_location_name_trgm_idx
USING gist(location_name gist_trgm_ops);
The gist_trgm_ops operator class ignore case generally, and ILIKE is just as fast as LIKE. Quoting the source code:
Caution: IGNORECASE macro means that trigrams are case-insensitive.
I use COLLATE "C" here - which is effectively no special collation (byte order instead), because you obviously have a mix of various collations in your column. Collation is relevant for ordering or ranges, for a basic similarity search, you can do without it. I would consider setting COLLATE "C" for your column to begin with.
This index would lend support to your first, simple form of the query:
SELECT * FROM tbl WHERE location_name ILIKE '%cafe%';
Very fast.
Retains capability to find partial matches.
Adds capability for fuzzy search.
Check out the % operator and set_limit().
GiST index is also very fast for queries with LIMIT n to select n "best" matches. You could add to the above query:
ORDER BY location_name <-> 'cafe'
LIMIT 20
Read more about the "distance" operator <-> in the manual here.
Or even:
SELECT *
FROM tbl
WHERE location_name ILIKE '%cafe%' -- exact partial match
OR location_name % 'cafe' -- fuzzy match
ORDER BY
(location_name ILIKE 'cafe%') DESC -- exact beginning first
,(location_name ILIKE '%cafe%') DESC -- exact partial match next
,(location_name <-> 'cafe') -- then "best" matches
,location_name -- break remaining ties (collation!)
LIMIT 20;
I use something like that in several applications for (to me) satisfactory results. Of course, it gets a bit slower with multiple features applied in combination. Find your sweet spot ...
You could go one step further and create a separate partial index for every language and use a matching collation for each:
CREATE INDEX location_name_trgm_idx
USING gist(location_name COLLATE "de_DE" gist_trgm_ops)
WHERE location_name_language = 'German';
-- repeat for each language
That would only be useful, if you only want results of a specific language per query and would be very fast in this case.

When will Postgres's full text search supports phrase match and proximity match?

As of Postgres 8.4 the database fts does not support exact phrase match, nor does it support proximity match if given 2 terms. For example, there is no way to tell Postgres to match on content that have word #1 which is in a specified proximity of word #2. Any one know the plan of Postgres and possibly which version will phrase and proximity match be supported?
PostgreSQL 9.6 text search supports phrases now
select
*
from (values
('i heart new york'),
('i hate york new')
) docs(body)
where
to_tsvector(body) ## phraseto_tsquery('new york')
(1 row retrieved)
or by distance between words:
-- a distance of exactly 2 "hops" between "quick" and "fox"
select
*
from (values
('the quick brown fox'),
('quick brown cute fox')
) docs(body)
where
to_tsvector(body) ## to_tsquery('quick <2> fox')
(1 row retrieved)
http://linuxgazette.net/164/sephton.html
<snip>
Search Vectors
How does one turn document content into an array of lexemes using the parser and dictionaries? How does one match a search criterion ti body text? PostgreSQL provides a number of functions to do this. The first one we will look at is to_tsvector().
A tsvector is an internal data type containing an array of lexemes with position information. The lexeme positions are used when searching, to rank the search result based on proximity and other information. One may control the ranking by labelling the different portions which make up the search document content, for example the title, body and abstract may be weighted differently during search by labelling these sections differently. The section labels, quite simply A,B,C & D, are associated with the tsvector at the time it is created, but the weight modifiers associated with those labels may be controlled after the fact.
</snip>
For full-phrase searching, see here.
The Postgresql website does not have a roadmap. Instead, you are referred to the Open Issues page. At the moment, this page makes no mention of full-phrase searching.