postgresql phrase extraction & ranking - postgresql

From selected rows in a table, how can one extract and rank phrases based on how often they occur?
example 1: http://developer.yahoo.com/search/content/V1/termExtraction.html
example 2: http://mirror.me/i/love
INPUT:
CREATE TABLE phrases (
id BIGSERIAL,
phrase VARCHAR(10000)
);
INSERT INTO phrases (phrase) VALUES (‘Italian sculptors and painters of the renaissance favored the Virgin Mary for inspiration.’)
INSERT INTO phrases (phrase) VALUES (‘Andrea Bolgi was an italian sculptor’)
DESIRED OUTPUT:
phrase | weight
italian sculptor | 5
virgin mary | 2
painters | 1
renaissance | 1
inspiration | 1
Andrea Bolgi | 1
To find just words, not phrases, one could use
SELECT * FROM ts_stat('SELECT to_tsvector(''simple'', phrase) FROM phrases')
ORDER BY nentry DESC, ndoc DESC, word;
Some notes:
phrases could contain “stop words”, e.g. “easy to answer”
ideally, english language variations and synonyms would be automatically grouped.
Could pg_trgm help? (it’s ok if only 2 and 3 word phrases are found). How exactly?
Related questions:
What techniques/tools are there for discovering common phrases in chunks of text?
How to find common phrases in a large body of text
How to extract common / significant phrases from a series of text entries

I agree with Craig that this is certainly way beyond the scope of what tsearch2 was intended to do as well as any other existing PostgreSQL tools. However, I do think that this might not be too bad to do in the db engine. One of the strengths of PostgreSQL is programmability and this strength gives you some very underutilized options.
As Craig notes, this is the domain of natural language processing, not of SQL per se, so the first thing you want to do is settle on a natural language processing toolkit with support for a stored procedure language that PostgreSQL supports. In other words, you want something that supports Perl, Python, C, etc. Whatever PostgreSQL supports and you feel comfortable working in.
The second step is to create functional interfaces for this toolkit in stored procedure languages. This should take text in, and output the phrase breakdown in some sort of type PostgreSQL can handle reasonably well. You need to pay attention to the type carefully because that affects things like GIN indexing.
From there you can incorporate it into your database interfaces and queries.

Related

Better Postgres trigram ranking

I'm searching several million names and addresses in a Postgres table. I'd like to use pg_trgm to do fast fuzzy search.
My app is actually very similar to the one in Optimizing a postgres similarity query (pg_trgm + gin index), and the answer there is pretty good.
My problem is that the relevance ranking isn't very good. There are two issues:
I want names to get a heavier weight in the ranking than addresses, and it's not clear how to do that and still get good performance. For example, if a user searches for 'smith', I want 'Bob Smith' to appear higher in the results than '123 Smith Street'.
The current results are biased toward columns that contain fewer characters. For example, a search for 'bob' will rank 'Bobby Smith' (without an address) above 'Bob Smith, 123 Bob Street, Smithville Illinois, 12345 with some other info here'. The reason for this is that the similarity score penalizes for parts of the string that do not match the search terms.
I'm thinking that I'll get a much better result if I could get a score that simply returns the number of matched trigrams in a record, not the number of trigrams scaled by the length of the target string. That's the way most search engines (like Elastic) work -- they rank by the weighted number of hits and do not penalize long documents.
Is it possible to do this with pg_trgm AND get good (sub-second) performance? I could do an arbitrary ranking of results, but if the ORDER BY clause does not match the index, then performance will be poor.
I know that this is an old question but this might be useful for others.
If the text you want to search falls in ascii table (characters in the range of [a-zA-Z0-9] and some other symbols), then you probably want to use Full Text Search feature (read official document Full Text Search).
because not only it gives you ability to sort by relevancy, but also ability to customize things like text steming (using snowball algorithm), which maps words like connection, connections, connective, connected, and connecting to connect (Read more about Snowball Stem). This makes your application performs better on search.
But if your requirement is to search text out of range of ascii table, like unicode, which is common if you try to support Asian languages like Japanese, Thai, Korean, etc. then using pg_trgm is perfectly fine.
To do the search that is not biased to the shorter text as mentioned in the question, you could use word_similarity(), instead of similarity().
As per the official documentation:
word_similarity( text, text )
Returns a number that indicates the greatest similarity between the set of trigrams in the first string and any continuous extent of an ordered set of trigrams in the second string. For details, see the explanation below.
So for example:
postgres=# SELECT word_similarity('white cat', 'white dog and black cat') as "similarity 1", word_similarity('white cat', 'I have a white dog and a black cat') as "similarity 2", word_similarity('white cat', 'I have a lovely white dog and a cute big black cat in a house') as "similarity 3";
similarity 1 | similarity 2 | similarity 3
--------------+--------------+--------------
0.6 | 0.6 | 0.6
(1 row)
As shown above, they all have equal scores.
And when you want to use it in a query:
SELECT col, word_similarity('some query', col) from my_table where col <% 'some query';
According to the document:
text <% text → boolean
Returns true if the similarity between the trigram set in the first argument and a continuous extent of an
ordered trigram set in the second argument is greater than the current
word similarity threshold set by pg_trgm.word_similarity_threshold
parameter.
For somehting more complicated like calculating hit scores, relevance weight/boost, and faster response time on larger dataset, you should use Elastic instead, but keep in mind that Elastic instance needs at least 2GB of ram and more, so you need dedicated EC2 instance(s) for that purpose. But for small-medium app, pg_trgm works just fine while saving your server cost.
Hope you find this helpful.

Search engine like full text search in PostgreSQL

I have a list of titles and descriptions in a table which are indexed in a tsvector column. How can I implement Google Search like full text search functionality in Postgres for these fields. I tried various functions offered by standard Postgres like
to_tsquery('apple | orange') -- apple | orange
This function returns rows as long as it has one of these terms so it doesn't produce highly relevant results at top which should have both of the terms.
plainto_tsquery('apple orange') -- apple & orange
This function requires all of the terms in the query. But I want results including both apple and orange first but can still have results including even one of these terms just later in the results.
phraseto_tsquery('apple orange') -- apple <> orange
This function only matches orange followed by apple but not vice versa. But for me orange <> apple is also still relevant.
I also tried websearch_to_tsquery() but it behaves very similar to above functions.
How can I ask Postgres to list highly relevant rows first which contains most of the terms in the search query no matter the order of the terms and then followed by rows with less number of terms?
to_tsquery('apple | orange') -- apple | orange
This function returns rows as long as it has one of these terms so it doesn't produce highly relevant results at top which should have both of the terms.
Unless you tell it how to order the rows, rows of a single query are returned in arbitrary order. There is no "top" without an ORDER BY, there is just something which happens to be seen first.
How can I ask Postgres to list highly relevant rows first which contains most of the terms in the search query no matter the order of the terms and then followed by rows with less number of terms?
Use the | operator, then rank those rows using ts_rank, ts_rank_cd, or a custom ranking function you write yourself. For performance, you might want to use the & operator first, then revert to | if you don't get enough rows.
The built in ranking functions don't care about order, but also don't care about proximity. So they might not do what you want. But writing your own won't be particularly easy, so I'd at least try them out first.
It would be nice if the introduction of websearch_to_tsquery or phraseto_tsquery had also introduced some corresponding ranking functions. But since they invented only ordered proximity, not proximity without order, it is unlikely they would do you want if they did exist.

Scalable way to search for (similar) strings in a database

Let me describe my problem. There is an input string, and a table containing many thousands of strings. I am looking for best way to search for the most similar* strings to the input string. The search should return a list of ~10 suggested strings, sorted by degree of similarity. Strings also have numerical weights (popularity) associated with them in database, in another column, so the ones with higher weights should have higher chance of appearing in results, if possible.
What is the best library to achieve this? I am looking for something similar to Elasticsearch, I guess. I don't have much experience with these kinds of libraries, so I would need something easy to include in my project and preferably open-source. I am using Python (Flask and SQLAlchemy) and Postgresql, but could also use e.g. Node.js, if needed.
*I also want to clarify what kind of similarity I am looking for. Ideally, it would be semantic similarity, but lexical similarity is fine as well. I would be happy with anything that works okay, is easy to implement, and is as scalable and performant as possible.
Example input sentence:
I don't like cangaroos.
Example suggestions from the database:
Cangaroos are not my favorite.
Cangaroos are evil.
I once had a cangaroo. Never again.
These suggestions should appear first because 'cangaroo' is not a frequent word in my database, so any string with the word 'cangaroo' should have a high chance appearing in results. It is probably much harder to detect 'don't like', so that part is completely optional for me.
P.s. Could PostgreSQL's full text search do something like this?
Thank you.
PostgreSQL Full-text search cannot do what you're looking for. However, PostgreSQL trigram similarity can do it.
You first need to install the packages with 'trigram similarity' and 'btree_gist', by executing (once) in your database:
CREATE EXTENSION pg_trgm;
CREATE EXTENSION btree_gist;
I assume you have one table that looks like this one:
CREATE TABLE sentences
(
sentence_id integer PRIMARY KEY,
sentence text
) ;
INSERT INTO sentences (sentence_id, sentence)
VALUES
(1, 'Cangaroos are not my favorite.'),
(2, 'A vegetable sentence.'),
(3, 'Cangaroos are evil.'),
(4, 'Again, some plants in my garden.'),
(5, 'I once had a cangaroo. Never again.') ;
This table needs a 'trigram index', to allow the PostgreSQL database to 'index by similarity'. This is accomplished by executing:
CREATE INDEX ON sentences USING GIST (sentence gist_trgm_ops, sentence_id) ;
To find the answers you're looking for, you execute:
-- Set the minimum similarity you want to be able to search
SELECT set_limit(0.2) ;
-- And now, select the sentences 'similar' to the input one
SELECT
similarity(sentence, 'I don''t like cangaroos') AS similarity,
sentence_id,
sentence
FROM
sentences
WHERE
/* That's how you choose your sentences:
% means 'similar to', in the trigram sense */
sentence % 'I don''t like cangaroos'
ORDER BY
similarity DESC ;
The result that you get is:
similarity | sentence_id | sentence
-----------+-------------+-------------------------------------
0.3125 | 3 | Cangaroos are evil.
0.2325 | 1 | Cangaroos are not my favorite.
0.2173 | 5 | I once had a cangaroo. Never again.
Hope this gives you what you want...

How to index a postgres table by name, when the name can be in any language?

I have a large postgres table of locations (shops, landmarks, etc.) which the user can search in various ways. When the user wants to do a search for the name of a place, the system currently does (assuming the search is on cafe):
lower(location_name) LIKE '%cafe%'
as part of the query. This is hugely inefficient. Prohibitively so. It is essential I make this faster. I've tried indexing the table on
gin(to_tsvector('simple', location_name))
and searching with
(to_tsvector('simple',location_name) ## to_tsquery('simple','cafe'))
which works beautifully, and cuts down the search time by a couple of orders of magnitude.
However, the location names can be in any language, including languages like Chinese, which aren't whitespace delimited. This new system is unable to find any Chinese locations, unless I search for the exact name, whereas the old system could find matches to partial names just fine.
So, my question is: Can I get this to work for all languages at once, or am I on the wrong track?
If you want to optimize arbitrary substring matches, one option is to use the pg_tgrm module. Add an index:
CREATE INDEX table_location_name_trigrams_key ON table
USING gin (location_name gin_trgm_ops);
This will break "Simple Cafe" into "sim", "imp", "mpl", etc., and add an entry to the index for each trigam in each row. The query planner can then automatically use this index for substring pattern matches, including:
SELECT * FROM table WHERE location_name ILIKE '%cafe%';
This query will look up "caf" and "afe" in the index, find the intersection, fetch those rows, then check each row against your pattern. (That last check is necessary since the intersection of "caf" and "afe" matches both "simple cafe" and "unsafe scaffolding", while "%cafe%" should only match one). The index becomes more effective as the input pattern gets longer since it can exclude more rows, but it's still not as efficient as indexing whole words, so don't expect a performance improvement over to_tsvector.
Catch is, trigrams don't work at all for patterns that under three characters. That may or may not be a deal-breaker for your application.
Edit: I initially added this as a comment.
I had another thought last night when I was mostly asleep. Make a cjk_chars function that takes an input string, regexp_matches the entire CJK Unicode ranges, and returns an array of any such characters or NULL if none. Add a GIN index on cjk_chars(location_name). Then query for:
WHERE CASE
WHEN cjk_chars('query') IS NOT NULL THEN
cjk_chars(location_name) #> cjk_chars('query')
AND location_name LIKE '%query%'
ELSE
<tsvector/trigrams>
END
Ta-da, unigrams!
For full text search in a multi-language environment you need to store the language each datum is in along side the text its self. You can then use the language-specified flavours of the tsearch functions to get proper stemming, etc.
eg given:
CREATE TABLE location(
location_name text,
location_name_language text
);
... plus any appropriate constraints, you might write:
CREATE INDEX location_name_ts_idx
USING gin(to_tsvector(location_name_language, location_name));
and for search:
SELECT to_tsvector(location_name_language,location_name) ## to_tsquery('english','cafe');
Cross-language searches will be problematic no matter what you do. In practice I'd use multiple matching strategies: I'd compare the search term to the tsvector of location_name in the simple configuration and the stored language of the text. I'd possibly also use a trigram based approach like willglynn suggests, then I'd unify the results for display, looking for common terms.
It's possible you may find Pg's fulltext search too limited, in which case you might want to check out something like Lucerne / Solr.
See:
* controlling full text search.
* tsearch dictionaries
Similar to what #willglynn already posted, I would consider the pg_trgm module. But preferably with a GiST index:
CREATE INDEX tbl_location_name_trgm_idx
USING gist(location_name gist_trgm_ops);
The gist_trgm_ops operator class ignore case generally, and ILIKE is just as fast as LIKE. Quoting the source code:
Caution: IGNORECASE macro means that trigrams are case-insensitive.
I use COLLATE "C" here - which is effectively no special collation (byte order instead), because you obviously have a mix of various collations in your column. Collation is relevant for ordering or ranges, for a basic similarity search, you can do without it. I would consider setting COLLATE "C" for your column to begin with.
This index would lend support to your first, simple form of the query:
SELECT * FROM tbl WHERE location_name ILIKE '%cafe%';
Very fast.
Retains capability to find partial matches.
Adds capability for fuzzy search.
Check out the % operator and set_limit().
GiST index is also very fast for queries with LIMIT n to select n "best" matches. You could add to the above query:
ORDER BY location_name <-> 'cafe'
LIMIT 20
Read more about the "distance" operator <-> in the manual here.
Or even:
SELECT *
FROM tbl
WHERE location_name ILIKE '%cafe%' -- exact partial match
OR location_name % 'cafe' -- fuzzy match
ORDER BY
(location_name ILIKE 'cafe%') DESC -- exact beginning first
,(location_name ILIKE '%cafe%') DESC -- exact partial match next
,(location_name <-> 'cafe') -- then "best" matches
,location_name -- break remaining ties (collation!)
LIMIT 20;
I use something like that in several applications for (to me) satisfactory results. Of course, it gets a bit slower with multiple features applied in combination. Find your sweet spot ...
You could go one step further and create a separate partial index for every language and use a matching collation for each:
CREATE INDEX location_name_trgm_idx
USING gist(location_name COLLATE "de_DE" gist_trgm_ops)
WHERE location_name_language = 'German';
-- repeat for each language
That would only be useful, if you only want results of a specific language per query and would be very fast in this case.

When will Postgres's full text search supports phrase match and proximity match?

As of Postgres 8.4 the database fts does not support exact phrase match, nor does it support proximity match if given 2 terms. For example, there is no way to tell Postgres to match on content that have word #1 which is in a specified proximity of word #2. Any one know the plan of Postgres and possibly which version will phrase and proximity match be supported?
PostgreSQL 9.6 text search supports phrases now
select
*
from (values
('i heart new york'),
('i hate york new')
) docs(body)
where
to_tsvector(body) ## phraseto_tsquery('new york')
(1 row retrieved)
or by distance between words:
-- a distance of exactly 2 "hops" between "quick" and "fox"
select
*
from (values
('the quick brown fox'),
('quick brown cute fox')
) docs(body)
where
to_tsvector(body) ## to_tsquery('quick <2> fox')
(1 row retrieved)
http://linuxgazette.net/164/sephton.html
<snip>
Search Vectors
How does one turn document content into an array of lexemes using the parser and dictionaries? How does one match a search criterion ti body text? PostgreSQL provides a number of functions to do this. The first one we will look at is to_tsvector().
A tsvector is an internal data type containing an array of lexemes with position information. The lexeme positions are used when searching, to rank the search result based on proximity and other information. One may control the ranking by labelling the different portions which make up the search document content, for example the title, body and abstract may be weighted differently during search by labelling these sections differently. The section labels, quite simply A,B,C & D, are associated with the tsvector at the time it is created, but the weight modifiers associated with those labels may be controlled after the fact.
</snip>
For full-phrase searching, see here.
The Postgresql website does not have a roadmap. Instead, you are referred to the Open Issues page. At the moment, this page makes no mention of full-phrase searching.