Is there a way to prevent words shorter than a specified value end up in tsvector? MySQL has the ft_min_word_len option, is there something similar for PostgreSQL?
The short answer would be no.
The tsearch2 uses dictionaries to normalize the text:
12.6. Dictionaries
Dictionaries are used to eliminate words that should not be considered
in a search (stop words), and to normalize words so that different
derived forms of the same word will match. A successfully normalized
word is called a lexeme.
and how the dictionaries are used Parsing and Lexing
Related
I'm using Postgresql 13 and my problem was easily solved with #> operator like this:
select id from documents where keywords #> '{"winter", "report", "2020"}';
meaning that keywords array should contain all these elements. Also I've created a GIN index on this column.
Is it possible to achieve similar behavior even if I provide my request like '{"re", "202", "w"}' ? I heard that ngrams have semantics like this, but "intersection" capabilities of arrays are crucial for me.
In your example, the matches are all prefixes. Is that the general rule here? If so, you would probably went to use the match feature of full text search, not trigrams. It would require you reformat your data, or at least your query.
select * from
(values (to_tsvector('simple','winter report 2020'))) f(x)
where x## 're:* & 202:* & w:*'::tsquery;
If the strings can contain punctuation which you want preserved, you would need to take pains to properly format them into a quoted tsvector yourself rather than just letting to_tsvector deal with it. Using 'simple' config gets rid of the stemming and stop word removal features, which would interfere with what you want to do.
Is it possible to make postgres to_tsvector consider only words which occur more than N times in the table?
The only option I am seeing is to calculate the word frequencies myself beforehand and then construct a dictionary out of that list which replaces each with empty string. Is there any more elegant solution in the configurations ?
There is no dynamic solution. You have to write a stopword file.
I have a character varying field in postgres containing a 1-white-space-separated set of strings. E.g.:
--> one two three <--
--> apples bananas pears <--
I put --> and <-- to show where the strings start and end (they are not part of the stored string itself)
I need to query this field to find out if the whole string contains a certain word (apple for instance). A possible query would be
SELECT * FROM table WHERE thefield LIKE '%apple%'
But it sucks and won't scale as b-tree indexes only scale if the pattern is attached to the beginning of the string while in my case the searched string could be positioned anywhere in the field.
How would you recommend approaching the problem?
Consider database-normalization first.
While working with your current design, support the query with a trigram index, that will be pretty fast.
More details and links in this closely related answer:
PostgreSQL LIKE query performance variations
Even more about pattern matching and indexes in this related answer on dba.SE:
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
I'm a little bit confused with the whole concept of PostgreSQL, full text search and Trigram. In my full text search queries, I'm using tsvectors, like so:
SELECT * FROM articles
WHERE search_vector ## plainto_tsquery('english', 'cat, bat, rat');
The problem is, this method doesn't account for misspelling. Then I started to read about Trigram and pg_trgm:
Looking through other examples, it seems like trigram is used or vectors are used, but never both. So my questions are: Are they ever used together? If so, how? Does trigram replace full text? Are trigrams more accurate? And how are trigrams on performance?
They serve very different purposes.
Full Text Search is used to return documents that match a search query of stemmed words.
Trigrams give you a method for comparing two strings and determining how similar they look.
Consider the following examples:
SELECT 'cat' % 'cats'; --true
The above returns true because 'cat' is quite similar to 'cats' (as dictated by the pg_trgm limit).
SELECT 'there is a cat with a dog' % 'cats'; --false
The above returns false because % is looking for similarily between the two entire strings, not looking for the word cats within the string.
SELECT to_tsvector('there is a cat with a dog') ## to_tsquery('cats'); --true
This returns true becauase tsvector transformed the string into a list of stemmed words and ignored a bunch of common words (stop words - like 'is' & 'a')... then searched for the stemmed version of cats.
It sounds like you want to use trigrams to auto-correct your ts_query but that is not really possible (not in any efficient way anyway). They do not really know a word is misspelt, just how similar it might be to another word. They could be used to search a table of words to try and find similar words, allowing you to implement a "did you mean..." type feature, but this word require maintaining a separate table containing all the words used in your search field.
If you have some commonly misspelt words/phrases that you want the text-index to match you might want to look at Synonym Dictorionaries
Is there a built in method (I can't find it by searching the documentation) to see the number of similar letters in two strings? The order of the letters are not relevant so comparing "abc" to "cad" would have a 66% match for the characters 'c' and 'd'. The number of occurences is also relevant. 'a' should match the first time around, but not on the second since there is only one common 'a' between the two strings. Is there a built in way to do this currently by using some bitwise operation or do I have to loop and manually compare?
You will have to build this yourself, but here is a shortcut for doing it. There is a built-in collection class called NSCountedSet. This object keeps each unique object and a count of how many of each were added.
You can take the two strings and load their characters into two different NSCountedSet collections. Then just check the items in the resulting collections. For example, grab an object from the first NSCountedSet. Check to see if it exists in the second NSCountedSet. The smaller of the 2 counts for that particular letter is how many of those letters that the 2 strings have in common. To shorten the number of iterations, start with the collection with fewer objects and then enumerate through those objects.
Here is Apple's Documentation for NSCountedSet.
https://developer.apple.com/library/ios/#documentation/Cocoa/Reference/Foundation/Classes/NSCountedSet_Class/Reference/Reference.html
I am hesitant to say but, there is probably no method out there that fills your requirements. I'd do this:
Create a category on NSString. Lets call it -(float)percentageOfSimilarCharactersForString:(NSString*)targetString
Here's a rough pseudocode that goes into this category:
Make a copy of self called selfCopy and trimselfCopy` to contain only unique characters.
Similarly trim targetString to unique characters. For trimming to unique characters, you could utilize NSSet or a subclass thereof. Looping over each character and adding to a set would help.
Now sort both sets by ASCII values.
Loop through each character of targetString-related NSSet and check for it's presence in selfCopy-related NSSet. For this you could use another category called containsString. You can find that here. Every time containsString returns true, increment a pre-defined counter.
Your return value would be (counter_value/length_of_selfCopy)*100.