number to word & word to number support with full text search - postgresql

What kind of approaches are available or advisable for handling number to word matches or the reverse, word to number matches?
E.g. query: "600" should match "six hundred" and conversely a query of "six hundred" should match "600".
I can think of manual ways to create lexemes for normalized forms of each representation and store that on an indexed field, so I'm not so interested in hearing about that, rather I'm curious:
How are others are solving this problem generally, perhaps manually like I just mentioned?
Are there default postgres search features to help/support this? If not, perhaps there are for elasticsearch?
Other relevant feedback my question doesn't encapsulate but is important to this topic for myself and other readers to consider.

Related

Can I find text that's "close" to some query with PostgreSQL?

I have a table in my DB called text. It will have something like this is an example of lite coin. I want to query this for litecoin and things that are close (like lite coin). Is there some way to do this generically as I will have multiple queries. Maybe something with a max Levenshtein distance?
There is a core extension to PostgreSQL which implements the Levenshtein distance. For strings of very unequal length, as in your example, the distance will of necessity be large. So you would have to implement some normalization method, unless all phrases being searched within are the same length.
I don't think Levenshtein is indexable. You could instead look into trigram distance, which is indexable.
+1 on the trigram suggestion. Trigrams in Postgres are excellent and, for sure, indexible. Depending on the index option you choose (GIN or GiST), you get access to different operators. If I remember correctly off the top of my head, GiST gives you distance tolerances for the words, and lets you search for them in order. You can specify the number of words expected between two searches words, and more. (If I'm remembering correctly.) Both GIN and GiST are worth experimenting with.
Levenshtein compares two specific strings, so it doesn't lend itself to indexing. What would you index? The comparison string is unknown in advance. You could index every string by every string in a column and, apart from the O(aaaargh!) complexity, you still might not have unything like your search string in the index.
Tip: If you must use Levenshtein, and it is pretty great where it's useful, you can eliminate many rows from your comparison cheaply. If you've got a 10 character search string and want strings only with a distance of 2, you can eliminate shorter and longer strings from consideration without fear of losing any matches.
You might find that you want to apply Levenshtein (or Jaccard, etc.) to possible matches found by the trigrams. But, honestly, Levenshtein is, by nature, biased towards strings in the same order. That's okay for lite coin/light coin/litecoin, but not helpful when the words can be in any order, like with first and last name, much address data, and many, many phrase-like searches.
The other thing to consider, depending on your range of queries, are full text searches with tsvectors. These are also indexable, and also support a range of operators.

Nearby or within search for clauses

I have divided multiple sentences into clauses(like A,B,C.....Z).
Now I want to search computer and mouse in these clauses such that they lie within a range of 3 clauses. I know that this can be done using languages, but that would be slow and mine is not a one time process. I want to use it in a search engine so I am trying to find out if there is any existing database that has this as inbuilt functionality or something closer to this.
Since you've tagged this with Solr, the regular Lucene syntax for this would be:
"computer mouse"~2
(this means that there can be two tokens between each term).
If you're using the dismax or edismax query syntax in Solr, you can use the phrase slop setting (ps) to say the same thing.

Segmenting words, and grouping hyphenated and apostrophe words from text

I need to segment words from a text. Some times the hyphenated words are written without hyphens, and apostrophe words are written without apostrophe. There are also similar issues like different spelling issues of same words (ex: color, colour), or single word which are written with spaces between them (ex: up to, upto, blankspace, blank space). I need to group these variants as one single representation and insert it into a set/hashmap or some other place. There can be also problems with accented character words written without accent characters (although i haven't faced them yet). Currently and cutting the words at any blankspace character and every non-alphanumerical, and then stemming them, and omitting stop words.
These indexes would be later used for document similarity checking and searching etc. Any suggestions how can i combat these problems? I have thought of an idea to match scanned word with a wordlist, but the problem is that the proper nouns and non-dictionary words will be omitted.
Info: My code is in Java
I think you should apply a combination of techniques.
1) For common spelling variants I would go with a dictionary-based method. Since they are common, I wouldn't worry about missing non-dictionary words. That should solve the color/colour problem.
2) For typos and other non-standard spelling variants you can apply Metaphone (http://en.wikipedia.org/wiki/Metaphone) algorithm to convert the tokens to a representation of their English pronunciations. Similar variants sound similar, thus you can match them to each other (e.g., Jon to John). You can also use edit-distance-based matching algorithms during the query to match very similar tokens with only a pair of characters juxtaposed or a character-dropped (e.g., Huseyin versus Housein).
3) For apostrophe and compound words with hyphen in between, you can store both variants. For example, "John's" would be indexed both as "John s" and "Johns". "blank-space" can be converted to (or stored along with) "blank space" and "blankspace".
4) For compound words without any hyphen in between, you could use an external library such as HyphenationCompoundWordTokenFilterFactory class of Solr (http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenationCompoundWordTokenFilterFactory.html). Although it can use a dictionary, it doesn't have to. It is targeted to deal with compound words that are frequently encountered in German and similar languages. I see no reason why you can't apply it to English (you'll need to supply an English dictionary and hyphenation rule files).
Actually, the last point raises an important question. I don't think you are up to building your own search library from scratch. If that's true why don't you use Lucene (or Solr, which is based on Lucene), a Java-based search library which already have the methods and ways to deal with these problems? For example, the injection technique allows you to index both color and colour in the same place in a document; thus it doesn't matter whether you search for "colored cars" or "coloured cars" (assuming you take care of stemming). There are filters which does the phonetic indexing (http://lucene.apache.org/solr/api/org/apache/solr/analysis/PhoneticFilterFactory.html). There is even a FuzzyQuery component which lets you to allow a certain amount of edit distance to match similar terms (http://lucene.apache.org/core/old_versioned_docs/versions/3_2_0/api/all/org/apache/lucene/search/FuzzyQuery.html)
You will also need to decide at which point you want to deal with these problems: One extreme approach is to index all possible variants of these terms during the indexing and use the queries as they are. That will keep your query processing light, but will cost you a larger index (because of all the variants you need to store). The other extreme is to index the documents as they are and expand the queries during the searching. That will allow you to keep your index lean at the cost of heavier query processing. Phonetic indexing would require you to process both your documents during the indexing and the queries during the search. Fuzzy matching would be feasible only during the search time because, presumably, you wouldn't be able to store all edit variants of all terms in the index.

Whats the best way to Parse a Lexicon and show a large amount of matches using wild cards

My problem is, I have a lexicon of about 200,000 words or so. The file is 1.8mbs in size. I want input from a user, say **id, and I want to show all possible matches, where * can be any letter A-Z. (said, maid, etc)
I'm looking for some suggestions on the most efficient way to do this, because I want the user to be able to add more concrete letters and give a live update of the word matches.
My idea was to attempt to use RegexKitLite, but i have a feeling that would be incredibly slow.
Thanks for any input!
Edit: Do you think its possible to use NSPredicates to achieve this?
The things you can do to optimize search performace highly depends on how you want to limit the use of those wildcards.
Precisely: what are the characteristics of your wildcards?
prefix-only wildcards (m/.+foobar/)
suffix-only wildcards (m/foobar.+/)
atomic wildcards (m/./)
dynamic wildcards (m/.+/)
Prefix-only Wildcards
Use a Prefix tree or DAWG
Suffix-only Wildcards
Use a Suffix tree or DAWG
Atomic Wildcards
One way to drastically reduce the number of matches you have to run would be:
Build a BKTree from your word collection.
As (and as long as) your wildcard has a fixed length (1 in your case) you could then simply query your BKTree for nodes with an exact edit distance of n, with n being the number of wildcards. Traditional BKTree queries have an upper limit of variance. In your case you'd want to introduce an additional lower limit, narrowing the range of accepted variance down to exactly [n,1] (vs. traditionally [0,n]).
You'll get an array of words differing from your query word by ecactly n characters.
For the query **id some possible matches would be:
void (2x additions)
laid (2x additions)
bad (1x replacement, 1x addition)
to (2x replacements)
While those are not yet correct matches for your query, the represent a very small subset of your total collection of words.
So last but not least you run your Regex matching againt those results and return all remaining matches.
BKTrees introduce the levenshtein distance as some spatial heuristic to drastically (depending on the entropy within your word collection) reduce the number of required comparisons/matchings.
To gain additional optimization you could use multiple BKTrees:
Divide your collection into sub-sets. One set for words of length 1, one for length 2, one for 3, and so on. From each subset you then build a BKTree. For a query **id you'd then query the BKTree for length 4 (wildcards are counted like chars).
This applies for wildcards getting interpreted as m/./. If your wildcard however shall get interpreted as m/.?/ you'd query the BKTrees for length 3 & 4.
Alternatively to BKTrees you could also use a GADDAG, which is a data structure (specialization of Trie) specialized particularly for Scrabble-style lookups.
If I'm not mistaken your wildcards will need to get interpreted strictly as m/./ as well.
Dynamic Wildcards
Cannot right now think of any significantly better solution than running your regex against your collection of words.

Please advise an optimal solution to full text search in mongoDB

The documents in my database have names and descriptions among other fields. I would like to allow the users to search for those documents by providing some keywords. The keywords should be used to lookup in both the name and the description field. I've read the mongoDB documentation on full text search and it looks really nice and easy if I want to search for keywords in the name field of my documents. However, the description field contains free form text and can take up to 2000 characters, so potentially there are a few hundred words per document. I could treat them the same way as names and just split the whole description into separate words and store it as another tag-like array (as per the Mongo example), but it seems like a terrible idea - each document's size could be almost doubled, plus there are characters like dots, commas, etc.
I know there are specialized solutions for exactly this kind of problems and I was just looking at Lucene.Net, I also saw Solr mentioned here and there.
Should I be looking to implement this search feature in mongoDB or should I use a specialized solution? Currently I just have one instance of mongod and one instance of a web server. We might need to scale later, but for now that is all I use. I'd appreciate any suggestions on how to implement this feature.
If storing the text split out into an array per the documented approach is not viable (I can understand your concerns), then I think you should look into a specialised solution.
Quote from the MongoDB documentation:
MongoDB has interesting functionality
that makes certain search functions
easy. That said, it is not a dedicated
full text search engine.
So, for more advanced full text search functionality I think a dedicated engine would be more suited. I have no experience in this area so I can't offer much in the way of suggestions from here, other than what my thoughts would be if I was in the same boat:
how much work involved in using a dedicated full-text search engine instead of MongoDB's functionality?
does that add more complexity / is it worth it?
would it be quicker/simpler to use MongoDB and just take the hit on the extra disk space?
maybe MongoDB will support better full-text functionality in future (it is rapidly evolving after all)
Fulltext search support is planned for the future. However right now you have to go with Solr & friends. Using the built-in "fulltext" functionality is not really suitable for real world usage.