Levenstein search postgres - postgresql

I want to have Full-Text Search without the stemming - it feels like too much of a naive approach.
What I want is:
select * from content order by levenstein_distance(document, :query_string)
I want it to be blazing fast.
I still want to have my queries / documents split by spaces.
Would the GIN index on document still be relevant?
Is this possible? Would it be blazing fast?

There is no index support for that, so it would be slow.
Perhaps pg_trgm can help, it provides similarity matching that can be supported by a GIN index.

Related

Are MongoDB “starts with” Or "Contains" query are fast as FindByID Query?

I Want to query using part of id to get all the matched documents. So I tried “starts with” and "contains" which works find but is there any performance issue for large collection?
The best way to make this search optimum :
Add $text index on the fields you want to do search in. This is really important because internally it tokenize your string to that you could search for a part of it.
Use regex which is also quicker to do.
If you are using aggregate, read this mongodb official doc about aggregation optimization which might help you to implement this in efficient manner : https://docs.mongodb.com/manual/core/aggregation-pipeline-optimization/
Last but not the least, if you are not yet fully inclined towards mongodb and project is fresh, look out for elasticsearch service which is based on Lucene. Its extremely powerful doing these kinds of searches.

Alternative to Boolean OR in Sphinx?

I have/had a mysql query that was pretty fast using in e.g.
FieldA in (X,Y,Z)
I've moved over to Sphinx which is clearly much faster EXCEPT when using pipes in case like this e.g.
#(FieldA) (X|Y|Z)
Where X|Y|Z are actually about 40 different values. The MysQl In takes .3 seconds the Sphinx takes over a minute. Given how much faster Sphinx has proven to be I am wondering if there is some 'IN' version for Sphinx with multiple values vs | which clearly is slowing it down.
Really it depends on a lot of things. For certain queries, changing to use a MVA might better than using keywords. (they you do have an 'IN' function )
... particularly if you have other search keywords.
Sphinxes full-text indexing is optimized for answering short user entered queries. To answer a long 'OR' style query, it has to load and merge each wordlist. And rank all that. Its all overhead.
Whereas attribute based filtering is generally pretty quick, particully if you have a highly selective keyword query, which gives a relatively short list of potential matches.

Fast Substring Search in Mongo

We have a Mongo database with about 400,000 entries, each of which have a relatively short (< 20 characters) title. We want to be able to do fast substring searches on these titles (fast enough to be able to use the results in things like autocomplete bars). We are also only searching for prefixes (does the title start with substring). What can we do?
If you only do prefix searches then indexing that field should be enough. Rooted regex queries use index and should be fast.
Sergio is correct, but to be more specific, an index on that and a left-rooted prefix without the i (case-insensitivity) flag will make efficient use of the index. This is noted in the docs in fact:
http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-RegularExpressions
Don't forget to use .explain() if you want to benchmark the queries too.

Is mongoDB efficient in doing multi-key lookups?

I'm evaluating MongoDB, coming from Membased/memcached because I want more flexibility.
Of course Membase is excellent in doing fast (multi)-key lookups.
I like the additional options that MongoDB gives me, but is it also fast in doing multi-key lookups? I've seen the $or and $in operator and I'm sure I can model it with that. I just want to know if it's performant (in the same league) as Membase.
use-case, e.g., Lucene/Solr returns 20 product-ids. Lookup these product-ids in Couchdb to return docs/ appropriate fields.
Thanks,
Geert-Jan
For your use case, I'd say it is, from my experience: I hacked some analytics into a database of mine that made a lot of $in queries with thousands of ids and it worked fine (it was a hack). To my surprise, it worked rather well, in the lower millisecond area.
Of course, it's hard to compare this, and -as usual- theory is a bad companion when it comes to performance. I guess the best way to figure it out is to migrate some test data and send some queries to the system.
Use MongoDB's excellent built-in profiler, use $explain, keep the one index per query rule in mind, take a look at the logs, keep an eye on mongostat, and do some benchmarks. This shouldn't take too long and give you a definite and affirmative answer. If your queries turn out slow, people here and on the news group probably have some ideas how to improve the exact query, or the indexation.
One index per query. It's sometimes thought that queries on multiple
keys can use multiple indexes; this is not the case with MongoDB. If
you have a query that selects on multiple keys, and you want that
query to use an index efficiently, then a compound-key index is
necessary.
http://www.mongodb.org/display/DOCS/Indexing+Advice+and+FAQ#IndexingAdviceandFAQ-Oneindexperquery.
There's more information on that page as well with regard to Indexes.
The bottom line is Mongo will be great if your indexes are in memory and you are indexing on the columns you want to query using composite keys. If you have poor indexing then your performance will suffer as a result. This is pretty much in line with most systems.

PostgreSQL: GIN or GiST indexes?

From what information I could find, they both solve the same problems - more esoteric operations like array containment and intersection (&&, #>, <#, etc). However I would be interested in advice about when to use one or the other (or neither possibly).
The PostgreSQL documentation has some information about this:
GIN index lookups are about three times faster than GiST
GIN indexes take about three times longer to build than GiST
GIN indexes are about ten times slower to update than GiST
GIN indexes are two-to-three times larger than GiST
However I would be particularly interested to know if there is a performance impact when the memory to index size ration starts getting small (ie. the index size becomes much bigger than the available memory)? I've been told on the #postgresql IRC channel that GIN needs to keep all the index in memory, otherwise it won't be effective, because, unlike B-Tree, it doesn't know which part to read in from disk for a particular query? The question would be: is this true (because I've also been told the opposite of this)? Does GiST have the same restrictions? Are there other restrictions I should be aware of while using one of these indexing algorithms?
First of all, do you need to use them for text search indexing? GIN and GiST are index specialized for some data types. If you need to index simple char or integer values then the normal B-Tree index is the best.
Anyway, PostgreSQL documentation has a chapter on GIST and one on GIN, where you can find more info.
And, last but not least, the best way to find which is best is to generate sample data (as much as you need to be a real scenario) and then create a GIST index, measuring how much time is needed to create the index, insert a new value, execute a sample query. Then drop the index and do the same with a GIN index. Compare the values and you will have the answer you need, based on your data.