Needed an efficient way for search for the following specfic requirement - hash

I have to search a given file name (let say Keyword) in a directory containing files. If there were only few keywords to be searched, I could have used regular search (like creating an array of file names residing in the specified directory and then search each file name with the given keyword). Since I need to search very large number of keywords dynamically, its not efficient to search using regular. I had couple of ideas:
1.using hashing (but not clear how to design it)
2.Using Bloom Filters for searching (please google , if u dont know about it, its working is very interesting!): Problem in using bloom filters is "False positives are possible, but false negatives are not". I might miss some results....

Before searching, create a trie of all positive matches.
Creating the trie will take O(n) where n is the number of words.
To to search, try to match the word against the trie. Look-ups are done in O(m) where m is the length of the word to look-up.
Total runtime: O(n + nm) => O(nm) to find all the words.

Related

Can I find text that's "close" to some query with PostgreSQL?

I have a table in my DB called text. It will have something like this is an example of lite coin. I want to query this for litecoin and things that are close (like lite coin). Is there some way to do this generically as I will have multiple queries. Maybe something with a max Levenshtein distance?
There is a core extension to PostgreSQL which implements the Levenshtein distance. For strings of very unequal length, as in your example, the distance will of necessity be large. So you would have to implement some normalization method, unless all phrases being searched within are the same length.
I don't think Levenshtein is indexable. You could instead look into trigram distance, which is indexable.
+1 on the trigram suggestion. Trigrams in Postgres are excellent and, for sure, indexible. Depending on the index option you choose (GIN or GiST), you get access to different operators. If I remember correctly off the top of my head, GiST gives you distance tolerances for the words, and lets you search for them in order. You can specify the number of words expected between two searches words, and more. (If I'm remembering correctly.) Both GIN and GiST are worth experimenting with.
Levenshtein compares two specific strings, so it doesn't lend itself to indexing. What would you index? The comparison string is unknown in advance. You could index every string by every string in a column and, apart from the O(aaaargh!) complexity, you still might not have unything like your search string in the index.
Tip: If you must use Levenshtein, and it is pretty great where it's useful, you can eliminate many rows from your comparison cheaply. If you've got a 10 character search string and want strings only with a distance of 2, you can eliminate shorter and longer strings from consideration without fear of losing any matches.
You might find that you want to apply Levenshtein (or Jaccard, etc.) to possible matches found by the trigrams. But, honestly, Levenshtein is, by nature, biased towards strings in the same order. That's okay for lite coin/light coin/litecoin, but not helpful when the words can be in any order, like with first and last name, much address data, and many, many phrase-like searches.
The other thing to consider, depending on your range of queries, are full text searches with tsvectors. These are also indexable, and also support a range of operators.

Indexing on only part of a field in MongoDB

Is there a way to create an index on only part of a field in MongoDB, for example on the first 10 characters? I couldn't find it documented (or asked about on here).
The MySQL equivalent would be CREATE INDEX part_of_name ON customer (name(10));.
Reason: I have a collection with a single field that varies in length from a few characters up to over 1000 characters, average 50 characters. As there are a hundred million or so documents it's going to be hard to fit the full index in memory (testing with 8% of the data the index is already 400MB, according to stats). Indexing just the first part of the field would reduce the index size by about 75%. In most cases the search term is quite short, it's not a full-text search.
A work-around would be to add a second field of 10 (lowercased) characters for each item, index that, then add logic to filter the results if the search term is over ten characters (and that extra field is probably needed anyway for case-insensitive searches, unless anybody has a better way). Seems like an ugly way to do it though.
[added later]
I tried adding the second field, containing the first 12 characters from the main field, lowercased. It wasn't a big success.
Previously, the average object size was 50 bytes, but I forgot that includes the _id and other overheads, so my main field length (there was only one) averaged nearer to 30 bytes than 50. Then, the second field index contains the _id and other overheads.
Net result (for my 8% sample) is the index on the main field is 415MB and on the 12 byte field is 330MB - only a 20% saving in space, not worthwhile. I could duplicate the entire field (to work around the case insensitive search problem) but realistically it looks like I should reconsider whether MongoDB is the right tool for the job (or just buy more memory and use twice as much disk space).
[added even later]
This is a typical document, with the source field, and the short lowercased field:
{ "_id" : ObjectId("505d0e89f56588f20f000041"), "q" : "Continental Airlines", "f" : "continental " }
Indexes:
db.test.ensureIndex({q:1});
db.test.ensureIndex({f:1});
The 'f" index, working on a shorter field, is 80% of the size of the "q" index. I didn't mean to imply I included the _id in the index, just that it needs to use that somewhere to show where the index will point to, so it's an overhead that probably helps explain why a shorter key makes so little difference.
Access to the index will be essentially random, no part of it is more likely to be accessed than any other. Total index size for the full file will likely be 5GB, so it's not extreme for that one index. Adding some other fields for other search cases, and their associated indexes, and copies of data for lower case, does start to add up, and make paging and swapping more likely (it's an 8GB server) which I why I started looking into a more concise index.
MongoDB has no way to create an index on a portion of a field's value. Your best approach is to create the second field, as you've suggested.
Since you'll need the second field for efficient case-insensitive searching anyway, there's really no reason to not create it.
The indexes don't store the '_id' field of the document, they store a DiscLoc structure, which is a much lower-level structure: see here for details
http://www.10gen.com/presentations/MongoNYC-2012/storage-engine-internals
Also, note that the "ugly" is really an artifact of "relational thinking". (As a long-time SQL user myself, I often find that the hardest part about learning MongoDB is un-learning my relational thinking.) In a document-oriented database, denormalizing and duplicating data are actually Best Practices.

Lucene "contains" search on subset of index

I have an index with around 5 million documents that I am trying to do a "contains" search on. I know how to accomplish this and I have explained the performance cost to the customer, but that is what they want. As expected doing a "contains" search on the entire index is very slow, but sometimes I only want to search a very small subset of the index (say 100 documents or so). I've done this by adding a Filter to the search that should limit the results correctly. However I find that this search and the entire index search perform almost exactly the same. Is there something I'm missing here? It feels like this search is also searching the entire index.
Adding a filter to the search will not limit the scope of the index.
You need to be more clear about what you need from your search, but I don't believe what you want is possible.
Is the subset of documents always the same? If so, maybe you can get clever with multiple indices. (e.g. search the smaller index and if there aren't enough hits, then search the larger index).
You can try SingleCharTokenAnalyzer

Segmenting words, and grouping hyphenated and apostrophe words from text

I need to segment words from a text. Some times the hyphenated words are written without hyphens, and apostrophe words are written without apostrophe. There are also similar issues like different spelling issues of same words (ex: color, colour), or single word which are written with spaces between them (ex: up to, upto, blankspace, blank space). I need to group these variants as one single representation and insert it into a set/hashmap or some other place. There can be also problems with accented character words written without accent characters (although i haven't faced them yet). Currently and cutting the words at any blankspace character and every non-alphanumerical, and then stemming them, and omitting stop words.
These indexes would be later used for document similarity checking and searching etc. Any suggestions how can i combat these problems? I have thought of an idea to match scanned word with a wordlist, but the problem is that the proper nouns and non-dictionary words will be omitted.
Info: My code is in Java
I think you should apply a combination of techniques.
1) For common spelling variants I would go with a dictionary-based method. Since they are common, I wouldn't worry about missing non-dictionary words. That should solve the color/colour problem.
2) For typos and other non-standard spelling variants you can apply Metaphone (http://en.wikipedia.org/wiki/Metaphone) algorithm to convert the tokens to a representation of their English pronunciations. Similar variants sound similar, thus you can match them to each other (e.g., Jon to John). You can also use edit-distance-based matching algorithms during the query to match very similar tokens with only a pair of characters juxtaposed or a character-dropped (e.g., Huseyin versus Housein).
3) For apostrophe and compound words with hyphen in between, you can store both variants. For example, "John's" would be indexed both as "John s" and "Johns". "blank-space" can be converted to (or stored along with) "blank space" and "blankspace".
4) For compound words without any hyphen in between, you could use an external library such as HyphenationCompoundWordTokenFilterFactory class of Solr (http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenationCompoundWordTokenFilterFactory.html). Although it can use a dictionary, it doesn't have to. It is targeted to deal with compound words that are frequently encountered in German and similar languages. I see no reason why you can't apply it to English (you'll need to supply an English dictionary and hyphenation rule files).
Actually, the last point raises an important question. I don't think you are up to building your own search library from scratch. If that's true why don't you use Lucene (or Solr, which is based on Lucene), a Java-based search library which already have the methods and ways to deal with these problems? For example, the injection technique allows you to index both color and colour in the same place in a document; thus it doesn't matter whether you search for "colored cars" or "coloured cars" (assuming you take care of stemming). There are filters which does the phonetic indexing (http://lucene.apache.org/solr/api/org/apache/solr/analysis/PhoneticFilterFactory.html). There is even a FuzzyQuery component which lets you to allow a certain amount of edit distance to match similar terms (http://lucene.apache.org/core/old_versioned_docs/versions/3_2_0/api/all/org/apache/lucene/search/FuzzyQuery.html)
You will also need to decide at which point you want to deal with these problems: One extreme approach is to index all possible variants of these terms during the indexing and use the queries as they are. That will keep your query processing light, but will cost you a larger index (because of all the variants you need to store). The other extreme is to index the documents as they are and expand the queries during the searching. That will allow you to keep your index lean at the cost of heavier query processing. Phonetic indexing would require you to process both your documents during the indexing and the queries during the search. Fuzzy matching would be feasible only during the search time because, presumably, you wouldn't be able to store all edit variants of all terms in the index.

Whats the best way to Parse a Lexicon and show a large amount of matches using wild cards

My problem is, I have a lexicon of about 200,000 words or so. The file is 1.8mbs in size. I want input from a user, say **id, and I want to show all possible matches, where * can be any letter A-Z. (said, maid, etc)
I'm looking for some suggestions on the most efficient way to do this, because I want the user to be able to add more concrete letters and give a live update of the word matches.
My idea was to attempt to use RegexKitLite, but i have a feeling that would be incredibly slow.
Thanks for any input!
Edit: Do you think its possible to use NSPredicates to achieve this?
The things you can do to optimize search performace highly depends on how you want to limit the use of those wildcards.
Precisely: what are the characteristics of your wildcards?
prefix-only wildcards (m/.+foobar/)
suffix-only wildcards (m/foobar.+/)
atomic wildcards (m/./)
dynamic wildcards (m/.+/)
Prefix-only Wildcards
Use a Prefix tree or DAWG
Suffix-only Wildcards
Use a Suffix tree or DAWG
Atomic Wildcards
One way to drastically reduce the number of matches you have to run would be:
Build a BKTree from your word collection.
As (and as long as) your wildcard has a fixed length (1 in your case) you could then simply query your BKTree for nodes with an exact edit distance of n, with n being the number of wildcards. Traditional BKTree queries have an upper limit of variance. In your case you'd want to introduce an additional lower limit, narrowing the range of accepted variance down to exactly [n,1] (vs. traditionally [0,n]).
You'll get an array of words differing from your query word by ecactly n characters.
For the query **id some possible matches would be:
void (2x additions)
laid (2x additions)
bad (1x replacement, 1x addition)
to (2x replacements)
While those are not yet correct matches for your query, the represent a very small subset of your total collection of words.
So last but not least you run your Regex matching againt those results and return all remaining matches.
BKTrees introduce the levenshtein distance as some spatial heuristic to drastically (depending on the entropy within your word collection) reduce the number of required comparisons/matchings.
To gain additional optimization you could use multiple BKTrees:
Divide your collection into sub-sets. One set for words of length 1, one for length 2, one for 3, and so on. From each subset you then build a BKTree. For a query **id you'd then query the BKTree for length 4 (wildcards are counted like chars).
This applies for wildcards getting interpreted as m/./. If your wildcard however shall get interpreted as m/.?/ you'd query the BKTrees for length 3 & 4.
Alternatively to BKTrees you could also use a GADDAG, which is a data structure (specialization of Trie) specialized particularly for Scrabble-style lookups.
If I'm not mistaken your wildcards will need to get interpreted strictly as m/./ as well.
Dynamic Wildcards
Cannot right now think of any significantly better solution than running your regex against your collection of words.