Can I find text that's "close" to some query with PostgreSQL? - postgresql

I have a table in my DB called text. It will have something like this is an example of lite coin. I want to query this for litecoin and things that are close (like lite coin). Is there some way to do this generically as I will have multiple queries. Maybe something with a max Levenshtein distance?

There is a core extension to PostgreSQL which implements the Levenshtein distance. For strings of very unequal length, as in your example, the distance will of necessity be large. So you would have to implement some normalization method, unless all phrases being searched within are the same length.
I don't think Levenshtein is indexable. You could instead look into trigram distance, which is indexable.

+1 on the trigram suggestion. Trigrams in Postgres are excellent and, for sure, indexible. Depending on the index option you choose (GIN or GiST), you get access to different operators. If I remember correctly off the top of my head, GiST gives you distance tolerances for the words, and lets you search for them in order. You can specify the number of words expected between two searches words, and more. (If I'm remembering correctly.) Both GIN and GiST are worth experimenting with.
Levenshtein compares two specific strings, so it doesn't lend itself to indexing. What would you index? The comparison string is unknown in advance. You could index every string by every string in a column and, apart from the O(aaaargh!) complexity, you still might not have unything like your search string in the index.
Tip: If you must use Levenshtein, and it is pretty great where it's useful, you can eliminate many rows from your comparison cheaply. If you've got a 10 character search string and want strings only with a distance of 2, you can eliminate shorter and longer strings from consideration without fear of losing any matches.
You might find that you want to apply Levenshtein (or Jaccard, etc.) to possible matches found by the trigrams. But, honestly, Levenshtein is, by nature, biased towards strings in the same order. That's okay for lite coin/light coin/litecoin, but not helpful when the words can be in any order, like with first and last name, much address data, and many, many phrase-like searches.
The other thing to consider, depending on your range of queries, are full text searches with tsvectors. These are also indexable, and also support a range of operators.

Related

number to word & word to number support with full text search

What kind of approaches are available or advisable for handling number to word matches or the reverse, word to number matches?
E.g. query: "600" should match "six hundred" and conversely a query of "six hundred" should match "600".
I can think of manual ways to create lexemes for normalized forms of each representation and store that on an indexed field, so I'm not so interested in hearing about that, rather I'm curious:
How are others are solving this problem generally, perhaps manually like I just mentioned?
Are there default postgres search features to help/support this? If not, perhaps there are for elasticsearch?
Other relevant feedback my question doesn't encapsulate but is important to this topic for myself and other readers to consider.

Text equality operator performance in Postgresql

How does this query work in terms of string comparison performance (assume there is a standard B-tree index on last_name ?
select * from employee where last_name = 'Wolfeschlegelsteinhausenbergerdorff';
So as it walks the B-Tree, I am assuming it it doesn't do a linear search on each character in the last_name field. EG, it doesn't start to check that the fist letter starts with a W... Assuming it doesn't do a linear comparison, what does it do?
I ask because I am considering to write my own duplicate prevention mechanism, but I want the performance to be sound. I was originally thinking of hashing each string (into some primitive datatype, probably a Long) that is coming in through an API, and storing the hash codes in a set/cache (each entry expires after 5 minutes). Any collisions would/could prompt a true duplicate check, where the already processed strings are stored in postgresql. But I'm thinking, would it be better to just simply query postgresql, in stead of maintaining my own memory based Set of Hashes that fluhses old entries after 5-10 minutes. I would probably use redis for scalability since multiple nodes will be reading different streams. Is my set of memory cached hash codes going to be faster than just querying indexed postgres String columns (full text matching not text searching) ?
When strings are compared for equality, the function texteq is called.
If you look up the function in src/backend/utils/adt/varlena.c, you will find that the comparison is made using the C library function memcmp. I doubt that you can get faster than that.
When you look up the value in a B-tree index, it will be compared to the values stored in each index page from the root page to the leaf page, that are at most 5 or 6 pages.
Frankly, I doubt that you can manage to be faster than that, but I wish you luck trying.

Alternative to Boolean OR in Sphinx?

I have/had a mysql query that was pretty fast using in e.g.
FieldA in (X,Y,Z)
I've moved over to Sphinx which is clearly much faster EXCEPT when using pipes in case like this e.g.
#(FieldA) (X|Y|Z)
Where X|Y|Z are actually about 40 different values. The MysQl In takes .3 seconds the Sphinx takes over a minute. Given how much faster Sphinx has proven to be I am wondering if there is some 'IN' version for Sphinx with multiple values vs | which clearly is slowing it down.
Really it depends on a lot of things. For certain queries, changing to use a MVA might better than using keywords. (they you do have an 'IN' function )
... particularly if you have other search keywords.
Sphinxes full-text indexing is optimized for answering short user entered queries. To answer a long 'OR' style query, it has to load and merge each wordlist. And rank all that. Its all overhead.
Whereas attribute based filtering is generally pretty quick, particully if you have a highly selective keyword query, which gives a relatively short list of potential matches.

Segmenting words, and grouping hyphenated and apostrophe words from text

I need to segment words from a text. Some times the hyphenated words are written without hyphens, and apostrophe words are written without apostrophe. There are also similar issues like different spelling issues of same words (ex: color, colour), or single word which are written with spaces between them (ex: up to, upto, blankspace, blank space). I need to group these variants as one single representation and insert it into a set/hashmap or some other place. There can be also problems with accented character words written without accent characters (although i haven't faced them yet). Currently and cutting the words at any blankspace character and every non-alphanumerical, and then stemming them, and omitting stop words.
These indexes would be later used for document similarity checking and searching etc. Any suggestions how can i combat these problems? I have thought of an idea to match scanned word with a wordlist, but the problem is that the proper nouns and non-dictionary words will be omitted.
Info: My code is in Java
I think you should apply a combination of techniques.
1) For common spelling variants I would go with a dictionary-based method. Since they are common, I wouldn't worry about missing non-dictionary words. That should solve the color/colour problem.
2) For typos and other non-standard spelling variants you can apply Metaphone (http://en.wikipedia.org/wiki/Metaphone) algorithm to convert the tokens to a representation of their English pronunciations. Similar variants sound similar, thus you can match them to each other (e.g., Jon to John). You can also use edit-distance-based matching algorithms during the query to match very similar tokens with only a pair of characters juxtaposed or a character-dropped (e.g., Huseyin versus Housein).
3) For apostrophe and compound words with hyphen in between, you can store both variants. For example, "John's" would be indexed both as "John s" and "Johns". "blank-space" can be converted to (or stored along with) "blank space" and "blankspace".
4) For compound words without any hyphen in between, you could use an external library such as HyphenationCompoundWordTokenFilterFactory class of Solr (http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenationCompoundWordTokenFilterFactory.html). Although it can use a dictionary, it doesn't have to. It is targeted to deal with compound words that are frequently encountered in German and similar languages. I see no reason why you can't apply it to English (you'll need to supply an English dictionary and hyphenation rule files).
Actually, the last point raises an important question. I don't think you are up to building your own search library from scratch. If that's true why don't you use Lucene (or Solr, which is based on Lucene), a Java-based search library which already have the methods and ways to deal with these problems? For example, the injection technique allows you to index both color and colour in the same place in a document; thus it doesn't matter whether you search for "colored cars" or "coloured cars" (assuming you take care of stemming). There are filters which does the phonetic indexing (http://lucene.apache.org/solr/api/org/apache/solr/analysis/PhoneticFilterFactory.html). There is even a FuzzyQuery component which lets you to allow a certain amount of edit distance to match similar terms (http://lucene.apache.org/core/old_versioned_docs/versions/3_2_0/api/all/org/apache/lucene/search/FuzzyQuery.html)
You will also need to decide at which point you want to deal with these problems: One extreme approach is to index all possible variants of these terms during the indexing and use the queries as they are. That will keep your query processing light, but will cost you a larger index (because of all the variants you need to store). The other extreme is to index the documents as they are and expand the queries during the searching. That will allow you to keep your index lean at the cost of heavier query processing. Phonetic indexing would require you to process both your documents during the indexing and the queries during the search. Fuzzy matching would be feasible only during the search time because, presumably, you wouldn't be able to store all edit variants of all terms in the index.

Whats the best way to Parse a Lexicon and show a large amount of matches using wild cards

My problem is, I have a lexicon of about 200,000 words or so. The file is 1.8mbs in size. I want input from a user, say **id, and I want to show all possible matches, where * can be any letter A-Z. (said, maid, etc)
I'm looking for some suggestions on the most efficient way to do this, because I want the user to be able to add more concrete letters and give a live update of the word matches.
My idea was to attempt to use RegexKitLite, but i have a feeling that would be incredibly slow.
Thanks for any input!
Edit: Do you think its possible to use NSPredicates to achieve this?
The things you can do to optimize search performace highly depends on how you want to limit the use of those wildcards.
Precisely: what are the characteristics of your wildcards?
prefix-only wildcards (m/.+foobar/)
suffix-only wildcards (m/foobar.+/)
atomic wildcards (m/./)
dynamic wildcards (m/.+/)
Prefix-only Wildcards
Use a Prefix tree or DAWG
Suffix-only Wildcards
Use a Suffix tree or DAWG
Atomic Wildcards
One way to drastically reduce the number of matches you have to run would be:
Build a BKTree from your word collection.
As (and as long as) your wildcard has a fixed length (1 in your case) you could then simply query your BKTree for nodes with an exact edit distance of n, with n being the number of wildcards. Traditional BKTree queries have an upper limit of variance. In your case you'd want to introduce an additional lower limit, narrowing the range of accepted variance down to exactly [n,1] (vs. traditionally [0,n]).
You'll get an array of words differing from your query word by ecactly n characters.
For the query **id some possible matches would be:
void (2x additions)
laid (2x additions)
bad (1x replacement, 1x addition)
to (2x replacements)
While those are not yet correct matches for your query, the represent a very small subset of your total collection of words.
So last but not least you run your Regex matching againt those results and return all remaining matches.
BKTrees introduce the levenshtein distance as some spatial heuristic to drastically (depending on the entropy within your word collection) reduce the number of required comparisons/matchings.
To gain additional optimization you could use multiple BKTrees:
Divide your collection into sub-sets. One set for words of length 1, one for length 2, one for 3, and so on. From each subset you then build a BKTree. For a query **id you'd then query the BKTree for length 4 (wildcards are counted like chars).
This applies for wildcards getting interpreted as m/./. If your wildcard however shall get interpreted as m/.?/ you'd query the BKTrees for length 3 & 4.
Alternatively to BKTrees you could also use a GADDAG, which is a data structure (specialization of Trie) specialized particularly for Scrabble-style lookups.
If I'm not mistaken your wildcards will need to get interpreted strictly as m/./ as well.
Dynamic Wildcards
Cannot right now think of any significantly better solution than running your regex against your collection of words.