"Noise word" in indexing services - indexing-service

Can anyone please tell me, what exactly noise word means in indexing services? I am working on windows server indexing services and getting lots of issues. Some questions on it: Does indexing services not search for noise words? What is the location and name of noise word file on windows server? Thanks.

They are the same as stop-words:
https://en.wikipedia.org/wiki/Stop-words
In computing, stop words are words which are filtered out prior to, or after, processing of natural language data (text). There is not one definite list of stop words which all tools use and such a filter is not always used. Some tools specifically avoid removing them to support phrase search.
See also:
http://msdn.microsoft.com/en-us/library/ms693206%28v=vs.85%29.aspx
Noise words act as placeholders in phrase queries. A document that contains the text "wag the dog" is stored in the index with "wag" at occurrence 1 and "dog" at occurrence 3. The phrase query "wag dog" does not match, but the phrase query "wag a dog" does, because the occurrence information matches

Related

Nearby or within search for clauses

I have divided multiple sentences into clauses(like A,B,C.....Z).
Now I want to search computer and mouse in these clauses such that they lie within a range of 3 clauses. I know that this can be done using languages, but that would be slow and mine is not a one time process. I want to use it in a search engine so I am trying to find out if there is any existing database that has this as inbuilt functionality or something closer to this.
Since you've tagged this with Solr, the regular Lucene syntax for this would be:
"computer mouse"~2
(this means that there can be two tokens between each term).
If you're using the dismax or edismax query syntax in Solr, you can use the phrase slop setting (ps) to say the same thing.

number to word & word to number support with full text search

What kind of approaches are available or advisable for handling number to word matches or the reverse, word to number matches?
E.g. query: "600" should match "six hundred" and conversely a query of "six hundred" should match "600".
I can think of manual ways to create lexemes for normalized forms of each representation and store that on an indexed field, so I'm not so interested in hearing about that, rather I'm curious:
How are others are solving this problem generally, perhaps manually like I just mentioned?
Are there default postgres search features to help/support this? If not, perhaps there are for elasticsearch?
Other relevant feedback my question doesn't encapsulate but is important to this topic for myself and other readers to consider.

Mongodb Text Search Processed Query

I'm using the text search feature and I couldn't find a way to get the stemmed terms in the query. Is there a way to also return the list of words in the stemmed form together with the query results and also the parts of the document that matched the result? This would be meaningful to understand and identify which part of the document matches.
Cheers!
As of MongoDB 2.6, the only meta information about the text search that can be used is a score indicating the strength of the match. You can submit a ticket on the Core Server project to request this feature (as I looked and I don't think one exists at the moment).

Segmenting words, and grouping hyphenated and apostrophe words from text

I need to segment words from a text. Some times the hyphenated words are written without hyphens, and apostrophe words are written without apostrophe. There are also similar issues like different spelling issues of same words (ex: color, colour), or single word which are written with spaces between them (ex: up to, upto, blankspace, blank space). I need to group these variants as one single representation and insert it into a set/hashmap or some other place. There can be also problems with accented character words written without accent characters (although i haven't faced them yet). Currently and cutting the words at any blankspace character and every non-alphanumerical, and then stemming them, and omitting stop words.
These indexes would be later used for document similarity checking and searching etc. Any suggestions how can i combat these problems? I have thought of an idea to match scanned word with a wordlist, but the problem is that the proper nouns and non-dictionary words will be omitted.
Info: My code is in Java
I think you should apply a combination of techniques.
1) For common spelling variants I would go with a dictionary-based method. Since they are common, I wouldn't worry about missing non-dictionary words. That should solve the color/colour problem.
2) For typos and other non-standard spelling variants you can apply Metaphone (http://en.wikipedia.org/wiki/Metaphone) algorithm to convert the tokens to a representation of their English pronunciations. Similar variants sound similar, thus you can match them to each other (e.g., Jon to John). You can also use edit-distance-based matching algorithms during the query to match very similar tokens with only a pair of characters juxtaposed or a character-dropped (e.g., Huseyin versus Housein).
3) For apostrophe and compound words with hyphen in between, you can store both variants. For example, "John's" would be indexed both as "John s" and "Johns". "blank-space" can be converted to (or stored along with) "blank space" and "blankspace".
4) For compound words without any hyphen in between, you could use an external library such as HyphenationCompoundWordTokenFilterFactory class of Solr (http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenationCompoundWordTokenFilterFactory.html). Although it can use a dictionary, it doesn't have to. It is targeted to deal with compound words that are frequently encountered in German and similar languages. I see no reason why you can't apply it to English (you'll need to supply an English dictionary and hyphenation rule files).
Actually, the last point raises an important question. I don't think you are up to building your own search library from scratch. If that's true why don't you use Lucene (or Solr, which is based on Lucene), a Java-based search library which already have the methods and ways to deal with these problems? For example, the injection technique allows you to index both color and colour in the same place in a document; thus it doesn't matter whether you search for "colored cars" or "coloured cars" (assuming you take care of stemming). There are filters which does the phonetic indexing (http://lucene.apache.org/solr/api/org/apache/solr/analysis/PhoneticFilterFactory.html). There is even a FuzzyQuery component which lets you to allow a certain amount of edit distance to match similar terms (http://lucene.apache.org/core/old_versioned_docs/versions/3_2_0/api/all/org/apache/lucene/search/FuzzyQuery.html)
You will also need to decide at which point you want to deal with these problems: One extreme approach is to index all possible variants of these terms during the indexing and use the queries as they are. That will keep your query processing light, but will cost you a larger index (because of all the variants you need to store). The other extreme is to index the documents as they are and expand the queries during the searching. That will allow you to keep your index lean at the cost of heavier query processing. Phonetic indexing would require you to process both your documents during the indexing and the queries during the search. Fuzzy matching would be feasible only during the search time because, presumably, you wouldn't be able to store all edit variants of all terms in the index.

Needed an efficient way for search for the following specfic requirement

I have to search a given file name (let say Keyword) in a directory containing files. If there were only few keywords to be searched, I could have used regular search (like creating an array of file names residing in the specified directory and then search each file name with the given keyword). Since I need to search very large number of keywords dynamically, its not efficient to search using regular. I had couple of ideas:
1.using hashing (but not clear how to design it)
2.Using Bloom Filters for searching (please google , if u dont know about it, its working is very interesting!): Problem in using bloom filters is "False positives are possible, but false negatives are not". I might miss some results....
Before searching, create a trie of all positive matches.
Creating the trie will take O(n) where n is the number of words.
To to search, try to match the word against the trie. Look-ups are done in O(m) where m is the length of the word to look-up.
Total runtime: O(n + nm) => O(nm) to find all the words.