Lucene Indexing strategy with MultiLingual Support - lucene.net

We are using Lucene.net for searching in our application , we do it in good manner, Now We need to support multiple language so I would like to ask what strategy we should use for indexing like, indexing different languages in different index folder with different analyzer , same index folder having documents, of English language and other languages fields (We end up having too many fields bt repetition of fields per language) or is there any other alternative ?
Pravin Thokal

The ideal strategy would be to have an additional language field and other existing fields can take in content in many languages. The value of language field dynamically selects different language analyzers for the multilingual fields.
But in essence, one field will have contents in many languages which impacts the term statistics.
Since a term in Lucene is field:term, for languages having common words, term statistics will be a concern, especially if in one language the term is a frequently used word and in other it is an uncommon word. Worst case being a stop word in one language and important term in other language. If this is the case, it is a no go strategy. However, for your language set, it is possible that there is no impact on the term statistics and vocabularies in different languages are mutually exclusive. In this case you could expect the TFIDFSimilarity to work. In case you are using other Similarity classes, they should mostly work well if TFIDF works.
For other strategies:
It definitely depends on
a)No of languages to support (say m)
b)No of fields which need to be multilingual.(say n)
In case both m and n are less, then you can go for a multifields approach:
(en -english, jp - Japanese, fr - French)
field1_en, field1_jp , field1_fr,
field2_en, field2_jp , field2_fr.
Unless you have hit m*n more than 1000+ fields, this is a safe strategy. Lucene's performance goes down when no of fields are huge.
In case no of languages are very few then different index folder (different schema) can work - but note that if you need to return results from different languages, it is a concern in many search engines. Elastic Search does well though.

Related

Nearby or within search for clauses

I have divided multiple sentences into clauses(like A,B,C.....Z).
Now I want to search computer and mouse in these clauses such that they lie within a range of 3 clauses. I know that this can be done using languages, but that would be slow and mine is not a one time process. I want to use it in a search engine so I am trying to find out if there is any existing database that has this as inbuilt functionality or something closer to this.
Since you've tagged this with Solr, the regular Lucene syntax for this would be:
"computer mouse"~2
(this means that there can be two tokens between each term).
If you're using the dismax or edismax query syntax in Solr, you can use the phrase slop setting (ps) to say the same thing.

number to word & word to number support with full text search

What kind of approaches are available or advisable for handling number to word matches or the reverse, word to number matches?
E.g. query: "600" should match "six hundred" and conversely a query of "six hundred" should match "600".
I can think of manual ways to create lexemes for normalized forms of each representation and store that on an indexed field, so I'm not so interested in hearing about that, rather I'm curious:
How are others are solving this problem generally, perhaps manually like I just mentioned?
Are there default postgres search features to help/support this? If not, perhaps there are for elasticsearch?
Other relevant feedback my question doesn't encapsulate but is important to this topic for myself and other readers to consider.

Alternative to Boolean OR in Sphinx?

I have/had a mysql query that was pretty fast using in e.g.
FieldA in (X,Y,Z)
I've moved over to Sphinx which is clearly much faster EXCEPT when using pipes in case like this e.g.
#(FieldA) (X|Y|Z)
Where X|Y|Z are actually about 40 different values. The MysQl In takes .3 seconds the Sphinx takes over a minute. Given how much faster Sphinx has proven to be I am wondering if there is some 'IN' version for Sphinx with multiple values vs | which clearly is slowing it down.
Really it depends on a lot of things. For certain queries, changing to use a MVA might better than using keywords. (they you do have an 'IN' function )
... particularly if you have other search keywords.
Sphinxes full-text indexing is optimized for answering short user entered queries. To answer a long 'OR' style query, it has to load and merge each wordlist. And rank all that. Its all overhead.
Whereas attribute based filtering is generally pretty quick, particully if you have a highly selective keyword query, which gives a relatively short list of potential matches.

Segmenting words, and grouping hyphenated and apostrophe words from text

I need to segment words from a text. Some times the hyphenated words are written without hyphens, and apostrophe words are written without apostrophe. There are also similar issues like different spelling issues of same words (ex: color, colour), or single word which are written with spaces between them (ex: up to, upto, blankspace, blank space). I need to group these variants as one single representation and insert it into a set/hashmap or some other place. There can be also problems with accented character words written without accent characters (although i haven't faced them yet). Currently and cutting the words at any blankspace character and every non-alphanumerical, and then stemming them, and omitting stop words.
These indexes would be later used for document similarity checking and searching etc. Any suggestions how can i combat these problems? I have thought of an idea to match scanned word with a wordlist, but the problem is that the proper nouns and non-dictionary words will be omitted.
Info: My code is in Java
I think you should apply a combination of techniques.
1) For common spelling variants I would go with a dictionary-based method. Since they are common, I wouldn't worry about missing non-dictionary words. That should solve the color/colour problem.
2) For typos and other non-standard spelling variants you can apply Metaphone (http://en.wikipedia.org/wiki/Metaphone) algorithm to convert the tokens to a representation of their English pronunciations. Similar variants sound similar, thus you can match them to each other (e.g., Jon to John). You can also use edit-distance-based matching algorithms during the query to match very similar tokens with only a pair of characters juxtaposed or a character-dropped (e.g., Huseyin versus Housein).
3) For apostrophe and compound words with hyphen in between, you can store both variants. For example, "John's" would be indexed both as "John s" and "Johns". "blank-space" can be converted to (or stored along with) "blank space" and "blankspace".
4) For compound words without any hyphen in between, you could use an external library such as HyphenationCompoundWordTokenFilterFactory class of Solr (http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenationCompoundWordTokenFilterFactory.html). Although it can use a dictionary, it doesn't have to. It is targeted to deal with compound words that are frequently encountered in German and similar languages. I see no reason why you can't apply it to English (you'll need to supply an English dictionary and hyphenation rule files).
Actually, the last point raises an important question. I don't think you are up to building your own search library from scratch. If that's true why don't you use Lucene (or Solr, which is based on Lucene), a Java-based search library which already have the methods and ways to deal with these problems? For example, the injection technique allows you to index both color and colour in the same place in a document; thus it doesn't matter whether you search for "colored cars" or "coloured cars" (assuming you take care of stemming). There are filters which does the phonetic indexing (http://lucene.apache.org/solr/api/org/apache/solr/analysis/PhoneticFilterFactory.html). There is even a FuzzyQuery component which lets you to allow a certain amount of edit distance to match similar terms (http://lucene.apache.org/core/old_versioned_docs/versions/3_2_0/api/all/org/apache/lucene/search/FuzzyQuery.html)
You will also need to decide at which point you want to deal with these problems: One extreme approach is to index all possible variants of these terms during the indexing and use the queries as they are. That will keep your query processing light, but will cost you a larger index (because of all the variants you need to store). The other extreme is to index the documents as they are and expand the queries during the searching. That will allow you to keep your index lean at the cost of heavier query processing. Phonetic indexing would require you to process both your documents during the indexing and the queries during the search. Fuzzy matching would be feasible only during the search time because, presumably, you wouldn't be able to store all edit variants of all terms in the index.

sqlite Indexing Performance Advice

I have an sqlite database in my iPhone app that I access via the Core Data framework. I'm using NSPredicates to query the database.
I am building a search function that needs to search six different varchar fields that contain text. At the moment, it's very slow and I need to improve performance, probably in the sqlite database. Would it be best to create an index on all those columns? Or would it be better to build a custom index table that expands those six columns into multiple rows, each containing a word and the ID it matches? Any other suggestions?
There are things you can do to improve the performance of searching for text in sqlite databases. Although Core Data abstracts you away from the underlying store it can be good to have an appreciation of what is going on when your store is backed using sqlite.
If we assume you're doing a substring search of these fields there are things you can do to improve search performance. Apple recommend using a derived properties. This amounts to maintaining a normalised version of your property in your model that is used for searching. The derived property should be done in a way that it can be indexed. You then express your search in terms of this derived property using binary operators > <= etc.
I found doing this reduced our search from around 1 second to under 100ms.
To make things clear I would suggest looking at the ADC example http://developer.apple.com/mac/library/samplecode/DerivedProperty/
From the Core Data Programming Guide:
How you use predicates can
significantly affect the performance
of your application. If a fetch
request requires a compound predicate,
you can make the fetch more efficient
by ensuring that the most restrictive
predicate is the first, especially if
the predicate involves text matching
(contains, endsWith, like, and
matches) since correct Unicode
searching is slow. If the predicate
combines textual and non-textual
comparisons, then it is likely to be
more efficient to specify the
non-textual predicates first, for
example (salary > 5000000) AND
(lastName LIKE 'Quincey') is better
than (lastName LIKE 'Quincey') AND
(salary > 5000000).
If there is a way to reorder your query such that the simplest logic is on the left, and the most complex on the right, that can help your search performance. As Lyon suggests, searching Unicode text is extremely expensive, so Apple recommends searching against derived values that strip unicode characters and common phrases like a, and, and the.
I assume these columns store text. The question is how much text and how often this model is accessed. If it is a large amount of text, I would create other properties that held the text, stripping common words and Unicode text. The only downside to this is that you end up with extra properties to maintain. You can do any indexing to improve perf on those columns.
If what you want is essentially full text indexing of your sqlite db, then you may want to use sqlite's ft3 module, since that's exactly what it provides:
http://www.sqlite.org/cvstrac/wiki?p=FtsUsage
http://dotnetperls.com/sqlite-fts3