Segmenting words, and grouping hyphenated and apostrophe words from text - text-processing

I need to segment words from a text. Some times the hyphenated words are written without hyphens, and apostrophe words are written without apostrophe. There are also similar issues like different spelling issues of same words (ex: color, colour), or single word which are written with spaces between them (ex: up to, upto, blankspace, blank space). I need to group these variants as one single representation and insert it into a set/hashmap or some other place. There can be also problems with accented character words written without accent characters (although i haven't faced them yet). Currently and cutting the words at any blankspace character and every non-alphanumerical, and then stemming them, and omitting stop words.
These indexes would be later used for document similarity checking and searching etc. Any suggestions how can i combat these problems? I have thought of an idea to match scanned word with a wordlist, but the problem is that the proper nouns and non-dictionary words will be omitted.
Info: My code is in Java

I think you should apply a combination of techniques.
1) For common spelling variants I would go with a dictionary-based method. Since they are common, I wouldn't worry about missing non-dictionary words. That should solve the color/colour problem.
2) For typos and other non-standard spelling variants you can apply Metaphone (http://en.wikipedia.org/wiki/Metaphone) algorithm to convert the tokens to a representation of their English pronunciations. Similar variants sound similar, thus you can match them to each other (e.g., Jon to John). You can also use edit-distance-based matching algorithms during the query to match very similar tokens with only a pair of characters juxtaposed or a character-dropped (e.g., Huseyin versus Housein).
3) For apostrophe and compound words with hyphen in between, you can store both variants. For example, "John's" would be indexed both as "John s" and "Johns". "blank-space" can be converted to (or stored along with) "blank space" and "blankspace".
4) For compound words without any hyphen in between, you could use an external library such as HyphenationCompoundWordTokenFilterFactory class of Solr (http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenationCompoundWordTokenFilterFactory.html). Although it can use a dictionary, it doesn't have to. It is targeted to deal with compound words that are frequently encountered in German and similar languages. I see no reason why you can't apply it to English (you'll need to supply an English dictionary and hyphenation rule files).
Actually, the last point raises an important question. I don't think you are up to building your own search library from scratch. If that's true why don't you use Lucene (or Solr, which is based on Lucene), a Java-based search library which already have the methods and ways to deal with these problems? For example, the injection technique allows you to index both color and colour in the same place in a document; thus it doesn't matter whether you search for "colored cars" or "coloured cars" (assuming you take care of stemming). There are filters which does the phonetic indexing (http://lucene.apache.org/solr/api/org/apache/solr/analysis/PhoneticFilterFactory.html). There is even a FuzzyQuery component which lets you to allow a certain amount of edit distance to match similar terms (http://lucene.apache.org/core/old_versioned_docs/versions/3_2_0/api/all/org/apache/lucene/search/FuzzyQuery.html)
You will also need to decide at which point you want to deal with these problems: One extreme approach is to index all possible variants of these terms during the indexing and use the queries as they are. That will keep your query processing light, but will cost you a larger index (because of all the variants you need to store). The other extreme is to index the documents as they are and expand the queries during the searching. That will allow you to keep your index lean at the cost of heavier query processing. Phonetic indexing would require you to process both your documents during the indexing and the queries during the search. Fuzzy matching would be feasible only during the search time because, presumably, you wouldn't be able to store all edit variants of all terms in the index.

Related

Can I find text that's "close" to some query with PostgreSQL?

I have a table in my DB called text. It will have something like this is an example of lite coin. I want to query this for litecoin and things that are close (like lite coin). Is there some way to do this generically as I will have multiple queries. Maybe something with a max Levenshtein distance?
There is a core extension to PostgreSQL which implements the Levenshtein distance. For strings of very unequal length, as in your example, the distance will of necessity be large. So you would have to implement some normalization method, unless all phrases being searched within are the same length.
I don't think Levenshtein is indexable. You could instead look into trigram distance, which is indexable.
+1 on the trigram suggestion. Trigrams in Postgres are excellent and, for sure, indexible. Depending on the index option you choose (GIN or GiST), you get access to different operators. If I remember correctly off the top of my head, GiST gives you distance tolerances for the words, and lets you search for them in order. You can specify the number of words expected between two searches words, and more. (If I'm remembering correctly.) Both GIN and GiST are worth experimenting with.
Levenshtein compares two specific strings, so it doesn't lend itself to indexing. What would you index? The comparison string is unknown in advance. You could index every string by every string in a column and, apart from the O(aaaargh!) complexity, you still might not have unything like your search string in the index.
Tip: If you must use Levenshtein, and it is pretty great where it's useful, you can eliminate many rows from your comparison cheaply. If you've got a 10 character search string and want strings only with a distance of 2, you can eliminate shorter and longer strings from consideration without fear of losing any matches.
You might find that you want to apply Levenshtein (or Jaccard, etc.) to possible matches found by the trigrams. But, honestly, Levenshtein is, by nature, biased towards strings in the same order. That's okay for lite coin/light coin/litecoin, but not helpful when the words can be in any order, like with first and last name, much address data, and many, many phrase-like searches.
The other thing to consider, depending on your range of queries, are full text searches with tsvectors. These are also indexable, and also support a range of operators.

Nearby or within search for clauses

I have divided multiple sentences into clauses(like A,B,C.....Z).
Now I want to search computer and mouse in these clauses such that they lie within a range of 3 clauses. I know that this can be done using languages, but that would be slow and mine is not a one time process. I want to use it in a search engine so I am trying to find out if there is any existing database that has this as inbuilt functionality or something closer to this.
Since you've tagged this with Solr, the regular Lucene syntax for this would be:
"computer mouse"~2
(this means that there can be two tokens between each term).
If you're using the dismax or edismax query syntax in Solr, you can use the phrase slop setting (ps) to say the same thing.

number to word & word to number support with full text search

What kind of approaches are available or advisable for handling number to word matches or the reverse, word to number matches?
E.g. query: "600" should match "six hundred" and conversely a query of "six hundred" should match "600".
I can think of manual ways to create lexemes for normalized forms of each representation and store that on an indexed field, so I'm not so interested in hearing about that, rather I'm curious:
How are others are solving this problem generally, perhaps manually like I just mentioned?
Are there default postgres search features to help/support this? If not, perhaps there are for elasticsearch?
Other relevant feedback my question doesn't encapsulate but is important to this topic for myself and other readers to consider.

Lucene Indexing strategy with MultiLingual Support

We are using Lucene.net for searching in our application , we do it in good manner, Now We need to support multiple language so I would like to ask what strategy we should use for indexing like, indexing different languages in different index folder with different analyzer , same index folder having documents, of English language and other languages fields (We end up having too many fields bt repetition of fields per language) or is there any other alternative ?
Pravin Thokal
The ideal strategy would be to have an additional language field and other existing fields can take in content in many languages. The value of language field dynamically selects different language analyzers for the multilingual fields.
But in essence, one field will have contents in many languages which impacts the term statistics.
Since a term in Lucene is field:term, for languages having common words, term statistics will be a concern, especially if in one language the term is a frequently used word and in other it is an uncommon word. Worst case being a stop word in one language and important term in other language. If this is the case, it is a no go strategy. However, for your language set, it is possible that there is no impact on the term statistics and vocabularies in different languages are mutually exclusive. In this case you could expect the TFIDFSimilarity to work. In case you are using other Similarity classes, they should mostly work well if TFIDF works.
For other strategies:
It definitely depends on
a)No of languages to support (say m)
b)No of fields which need to be multilingual.(say n)
In case both m and n are less, then you can go for a multifields approach:
(en -english, jp - Japanese, fr - French)
field1_en, field1_jp , field1_fr,
field2_en, field2_jp , field2_fr.
Unless you have hit m*n more than 1000+ fields, this is a safe strategy. Lucene's performance goes down when no of fields are huge.
In case no of languages are very few then different index folder (different schema) can work - but note that if you need to return results from different languages, it is a concern in many search engines. Elastic Search does well though.

Whats the best way to Parse a Lexicon and show a large amount of matches using wild cards

My problem is, I have a lexicon of about 200,000 words or so. The file is 1.8mbs in size. I want input from a user, say **id, and I want to show all possible matches, where * can be any letter A-Z. (said, maid, etc)
I'm looking for some suggestions on the most efficient way to do this, because I want the user to be able to add more concrete letters and give a live update of the word matches.
My idea was to attempt to use RegexKitLite, but i have a feeling that would be incredibly slow.
Thanks for any input!
Edit: Do you think its possible to use NSPredicates to achieve this?
The things you can do to optimize search performace highly depends on how you want to limit the use of those wildcards.
Precisely: what are the characteristics of your wildcards?
prefix-only wildcards (m/.+foobar/)
suffix-only wildcards (m/foobar.+/)
atomic wildcards (m/./)
dynamic wildcards (m/.+/)
Prefix-only Wildcards
Use a Prefix tree or DAWG
Suffix-only Wildcards
Use a Suffix tree or DAWG
Atomic Wildcards
One way to drastically reduce the number of matches you have to run would be:
Build a BKTree from your word collection.
As (and as long as) your wildcard has a fixed length (1 in your case) you could then simply query your BKTree for nodes with an exact edit distance of n, with n being the number of wildcards. Traditional BKTree queries have an upper limit of variance. In your case you'd want to introduce an additional lower limit, narrowing the range of accepted variance down to exactly [n,1] (vs. traditionally [0,n]).
You'll get an array of words differing from your query word by ecactly n characters.
For the query **id some possible matches would be:
void (2x additions)
laid (2x additions)
bad (1x replacement, 1x addition)
to (2x replacements)
While those are not yet correct matches for your query, the represent a very small subset of your total collection of words.
So last but not least you run your Regex matching againt those results and return all remaining matches.
BKTrees introduce the levenshtein distance as some spatial heuristic to drastically (depending on the entropy within your word collection) reduce the number of required comparisons/matchings.
To gain additional optimization you could use multiple BKTrees:
Divide your collection into sub-sets. One set for words of length 1, one for length 2, one for 3, and so on. From each subset you then build a BKTree. For a query **id you'd then query the BKTree for length 4 (wildcards are counted like chars).
This applies for wildcards getting interpreted as m/./. If your wildcard however shall get interpreted as m/.?/ you'd query the BKTrees for length 3 & 4.
Alternatively to BKTrees you could also use a GADDAG, which is a data structure (specialization of Trie) specialized particularly for Scrabble-style lookups.
If I'm not mistaken your wildcards will need to get interpreted strictly as m/./ as well.
Dynamic Wildcards
Cannot right now think of any significantly better solution than running your regex against your collection of words.