When implementing search functionalities in Postgres, we have the options to use PostGres Full text Search (FTS) or pattern matching (like) and both come with indexes for optimize queries.
eg.
gin (to_tsvector('language', text)) for ts_vector
gin (text gin_trgm_ops) for pattern matching
I am wondering when we want to use Full-text Search or pattern matching in general.
Also, if we don't need language stemming, are there still values to use tsvector.
Full-text search is what you need if you are looking for whole words in natural language texts. It supports prefix search and searching for the parts of hyphenated words, but it does not support pattern matching or similarity search.
Trigram indexes are useful if you search patterns or similarity. It has no support for stemming (e.g., finding wolves when you search for wolf) and stop words, and the indexes are normally larger.
To avoid stemming and stop word removal with full-text search, you can use the simple text search configuration.
Related
I want to have Full-Text Search without the stemming - it feels like too much of a naive approach.
What I want is:
select * from content order by levenstein_distance(document, :query_string)
I want it to be blazing fast.
I still want to have my queries / documents split by spaces.
Would the GIN index on document still be relevant?
Is this possible? Would it be blazing fast?
There is no index support for that, so it would be slow.
Perhaps pg_trgm can help, it provides similarity matching that can be supported by a GIN index.
I have divided multiple sentences into clauses(like A,B,C.....Z).
Now I want to search computer and mouse in these clauses such that they lie within a range of 3 clauses. I know that this can be done using languages, but that would be slow and mine is not a one time process. I want to use it in a search engine so I am trying to find out if there is any existing database that has this as inbuilt functionality or something closer to this.
Since you've tagged this with Solr, the regular Lucene syntax for this would be:
"computer mouse"~2
(this means that there can be two tokens between each term).
If you're using the dismax or edismax query syntax in Solr, you can use the phrase slop setting (ps) to say the same thing.
I want to perform LIKE search (e.g. all words containing 'abc' i.e. %abc%) but by using the Hibernate Search API.
Is there a way to do it by using the existing analyzers ?
If so which one is better in terms of performance; SQL or Hibernate Search for this case ?
Maybe have a look at this:
http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/search/RegexpQuery.html?is-external=true
But note this:
"Note this query can be slow, as it needs to iterate over many terms. In order to prevent extremely slow RegexpQueries, a Regexp term should not start with the expression .*"
This should be included in Hibernate-Search
Correct, Hibernate Search is much more efficient for this than using a SQL LIKE criteria.
The StandardAnalyzer (org.apache.lucene.analysis.standard.StandardAnalyzer) is a good fit, other analyzers will do more advanced text splitting.
I need to segment words from a text. Some times the hyphenated words are written without hyphens, and apostrophe words are written without apostrophe. There are also similar issues like different spelling issues of same words (ex: color, colour), or single word which are written with spaces between them (ex: up to, upto, blankspace, blank space). I need to group these variants as one single representation and insert it into a set/hashmap or some other place. There can be also problems with accented character words written without accent characters (although i haven't faced them yet). Currently and cutting the words at any blankspace character and every non-alphanumerical, and then stemming them, and omitting stop words.
These indexes would be later used for document similarity checking and searching etc. Any suggestions how can i combat these problems? I have thought of an idea to match scanned word with a wordlist, but the problem is that the proper nouns and non-dictionary words will be omitted.
Info: My code is in Java
I think you should apply a combination of techniques.
1) For common spelling variants I would go with a dictionary-based method. Since they are common, I wouldn't worry about missing non-dictionary words. That should solve the color/colour problem.
2) For typos and other non-standard spelling variants you can apply Metaphone (http://en.wikipedia.org/wiki/Metaphone) algorithm to convert the tokens to a representation of their English pronunciations. Similar variants sound similar, thus you can match them to each other (e.g., Jon to John). You can also use edit-distance-based matching algorithms during the query to match very similar tokens with only a pair of characters juxtaposed or a character-dropped (e.g., Huseyin versus Housein).
3) For apostrophe and compound words with hyphen in between, you can store both variants. For example, "John's" would be indexed both as "John s" and "Johns". "blank-space" can be converted to (or stored along with) "blank space" and "blankspace".
4) For compound words without any hyphen in between, you could use an external library such as HyphenationCompoundWordTokenFilterFactory class of Solr (http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenationCompoundWordTokenFilterFactory.html). Although it can use a dictionary, it doesn't have to. It is targeted to deal with compound words that are frequently encountered in German and similar languages. I see no reason why you can't apply it to English (you'll need to supply an English dictionary and hyphenation rule files).
Actually, the last point raises an important question. I don't think you are up to building your own search library from scratch. If that's true why don't you use Lucene (or Solr, which is based on Lucene), a Java-based search library which already have the methods and ways to deal with these problems? For example, the injection technique allows you to index both color and colour in the same place in a document; thus it doesn't matter whether you search for "colored cars" or "coloured cars" (assuming you take care of stemming). There are filters which does the phonetic indexing (http://lucene.apache.org/solr/api/org/apache/solr/analysis/PhoneticFilterFactory.html). There is even a FuzzyQuery component which lets you to allow a certain amount of edit distance to match similar terms (http://lucene.apache.org/core/old_versioned_docs/versions/3_2_0/api/all/org/apache/lucene/search/FuzzyQuery.html)
You will also need to decide at which point you want to deal with these problems: One extreme approach is to index all possible variants of these terms during the indexing and use the queries as they are. That will keep your query processing light, but will cost you a larger index (because of all the variants you need to store). The other extreme is to index the documents as they are and expand the queries during the searching. That will allow you to keep your index lean at the cost of heavier query processing. Phonetic indexing would require you to process both your documents during the indexing and the queries during the search. Fuzzy matching would be feasible only during the search time because, presumably, you wouldn't be able to store all edit variants of all terms in the index.
Does MongoDB support soundex or fuzzy matching? I want to spot dupes of basic contact name and address fields. I'm using the official C# driver. Thanks
Mongodb doesn't support soundex matching, but it has Full Text Search.
Also,
You can always just store the
soundex-encoded string in a separate
field in mongo and search against
that. Soundex is a really trivial
algorithm and should only take a handful of
lines.
-- from mongodb-user
MongoDB does not support real fulltext search and nothing like soundex (which is a very bad part for matching terms - something like Levensthein distance calculation is much better).
In addition look at my last comment here:
Full-text search in NoSQL databases