When is Rabin Karp more effective than KMP or Boyer-Moore? - string-search

I'm learning about string searching algorithms and understand how they work but haven't found a good enough answer about in which cases Rabin-Karp algorithm would be more effective than KMP or Boyer-Moore. I see that it is easier to implement and doesn't need the same overhead but beyond that, I have no clue.
So, when is Rabin-Karp better to use than the others?

There are a couple of properties that each of these algorithms have that might make them desirable or undesirable in different circumstances. Here's a quick rundown:
Space Usage favors Rabin-Karp
One major advantage of Rabin-Karp is that it uses O(1) auxiliary storage space, which is great if the pattern string you're looking for is very large. For example, if you're looking for all occurrences of a string of length 107 in a longer string of length 109, not having to allocate a table of 107 machine words for a failure function or shift table is a major win. Both Boyer-Moore and KMP use Ω(n) memory on a pattern string of length n, so Rabin-Karp would be a clear win here.
Worst-Case Single-Match Efficiency Favors Boyer-Moore or KMP
Rabin-Karp suffers from two potential worst cases. First, if the particular prime numbers used by Rabin-Karp are known to a malicious adversary, that adversary could potentially craft an input that causes the rolling hash to match the hash of a pattern string at each point in time, causing the algorithm's performance to degrade to Ω((m - n + 1)n) on a string of length m and pattern of length n. If you're taking untrusted strings as input, this could potentially be an issue. Neither Boyer-Moore nor KMP have these weaknesses.
Worst-Case Multiple-Match Efficiency favors KMP.
Similarly, Rabin-Karp is slow in the case where you want to find all matches of a pattern string in the case where that pattern appears a large number of times. For example, if you're looking for a string of 105 copies of the letter a in text string consisting of 109copies of the letter a with Rabin-Karp, then there will be lots of spots where the pattern string appears, and each will require a linear scan. This can also lead to a runtime of Ω((m + n - 1)n).
Many Boyer-Moore implementations suffer from this second rule, but will not have bad runtimes in the first case. And KMP has no pathological worst-cases like these.
Best-Case Performance favors Boyer-Moore
One advantage of the Boyer-Moore algorithm is that it doesn't necessarily have to scan all the characters of the input string. Specifically, the Bad Character Rule can be used to skip over huge regions of the input string in the event of a mismatch. More specifically, the best-case runtime for Boyer-Moore is O(m / n), which is much faster than what Rabin-Karp or KMP can provide.
Generalizations to Multiple Strings favor KMP
Suppose you have a fixed set of multiple text strings that you want to search for, rather than just one. You could, if you wanted to, run multiple passes of Rabin-Karp, KMP, or Boyer-Moore across the strings to find all the matches. However, the runtime of this approach isn't great, as it scales linearly with the number of strings to search for. On the other hand, KMP generalizes nicely to the Aho-Corasick string-matching algorithm, which runs in time O(m + n + z), where z is the number of matches found and n is the combined length of the pattern strings. Notice that there's no dependence here on the number of different pattern strings being searched for!

The Rabin-Karp algorithm is better when searching for a large text that is finding multiple pattern matches, like detecting plagiarism.
And Boyer-Moore is better when the pattern is relatively large with a moderately sized alphabet and with a large vocabulary. And it does not work well with binary strings or very short patterns.
Meanwhile, KMP is good for searching inside a smaller alphabet, like in bioinformatics or searching in binary strings. And it does not run fast if the alphabet increases.

Space-Time complexities of all three (for reference)
(For finding ALL occurrences of the pattern)
m : length of the pattern
n : length of the String in which we search the pattern
k : size of the alphabet
Rabin Karp:
O(1) auxiliary space
Uses hashing to find an exact match of a pattern string in a text. It uses a rolling hash to quickly filter out positions of the text that cannot match the pattern, and then checks for a match at the remaining positions
Boyer Moore:
Worst-case performance : Θ(m) preprocessing + O(mn) matching
Best-case performance : Θ(m) preprocessing + Ω(n/m) matching
Worst-case space complexity : Θ(k).
Can be used for "grep" like searches.
https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm#Performance
Knuth Morris Pratt:
Worst-case performance : Θ(m) preprocessing + Θ(n) matching
Worst-case space complexity : Θ(m)
For more details lookup in Wikipedia for each algorithm.

Related

Hyphenation algorithm using Bloom filter

A classical example of where Bloom filters shine is in hyphenation algorithms. It's even the example given in the original paper on Bloom filters.
I don't understand how a Bloom filter would be used in a hyphenation algorithm.
A hyphenation algorithm is defined as something that takes an input word and gives back the possible ways that that word can be hyphenated.
Would the Bloom filter contain both hyph-enation and hyphena-tion, and client code would query the filter for h-yphenation, hy-phenation, hyp-henation, ...?
Here's what the original paper says:
Analysis of Hyphenation Sample Application
[...] Let us assume that there are about 500,000 words to be hyphenated by the program and that 450,000 of these words can be hyphenated by application of a few simple rules. The other 50,000 words require reference to a dictionary. It is reasonable to estimate that at least 19 bits would, on the average, be required to represent each of these 50,000 words using a conventional hash-coding method. If we assume that a time factor of T = 4 is acceptable, we find from eq. (9) that the hash area would be 2,000,000 bits in size. This might very well be too large for a practical core contained hash area. By using method 2 with an allowable error frequency of, say, P = 1/16, and using the smallest possible hash area by having T = 2, we see from eq. (22) that the problem can be solved with a hash area of less than 300,000 bits, a size which would very likely be suitable for a core hash area. With a choice for P of 1/16, an access would be required to the disk resident dictionary for approximately 50,000 + 450,000/16 ~ 78,000 of the 500,000 words to be hyphenated, i.e. for approximately 16 percent of the cases. This constitutes a reduction of 84 percent in the number of disk accesses from those required in a typical conventional approach using a completely disk resident hash area and dictionary.
For this case,
the dictionary is stored on disk and contains all words with the correct hyphenation,
the Bloom filter contains just keys that require special hyphenation, e.g. maybe hyphenation itself,
the Bloom filter responds with "probably" or "no".
Then the algorithm to find the possible hyphenations of a word is:
word = "hyphenation"; (or some other word)
x = bloomFilter.probablyContains(word);
if (x == "probably") {
lookupInDictionary(word).getHypenation();
} else {
// x == "no" case
useSimpleRuleBasedHypenation(word);
}
If the Bloom filter responds with "probably", then the algorithm would have to do a disk read in the dictionary.
The Bloom filter would respond with "probably" sometimes if there are in fact no special rules, in which case a disk I/O is done unnecessarily. But that's OK as long as that doesn't happen too often (false positive rate is low, e.g. 1/16).
The Bloom filter, as it doesn't have false negatives, would never respond with "no" for cases do have special hyphenation.

finding good hash function for languages accepted by finite state automata

I'm working on project in Java (but I think it doesn't depend on the language) where I'm generating small (4 states max) nondeterministic finite state automata on binary alphabet and I have to check fast the generated automaton for equivalence with the previous ones. Therefore, I have to use some good hash function, to avoid compairing with too many automatas.
My first thought was doing a DFS on the transitions and finding all the accepted words until length max. 5 and then I map the set of accepted words to a 64-bit long (the amount of binary words of length max. 5). But it seems to produce too many collisions on NFAs with 4 states. Increasing the length results in making the computing of the hash code too slow for practical use.
Another approach was having a set of words and testing which of them the automaton accepts but finding the right ones, I think, isn't that trivial.
Do you have any idea how to improve the hash function to avoid too many collisions without a significant loss of speed?
Thanks in advance
I was thinking further (thanks #justhalf and #templatetypedef) and I have an idea - an injective function of any NFA (or more precisely, language accepted by it) to integers - Let's have an NFA A. Let's construct minimal DFA A_min accepting the same language with complete delta-function. As a consequence of Myhill-Nerode theorem, this automaton should be unambiguous except isomorphism. Do a BFS from the initial state giving priority to the edges(transitions) based on some fixed order of characters in the alphabet (for example first 0, second 1). And renumber the states based on the order of visiting. Now we have a canonical minimal DFA and we can map the incidence matrix of states to an integer and append enumeration of final states (or better make a tuple, to avoid collision). This integer could be then used for deciding equivalence of two NFAs. Do you think, it is ok or have any other idea?

Efficient Function to Map (or Hash) Integers and Integer Ranges into Index

We are looking for the computationally simplest function that will enable an indexed look-up of a function to be determined by a high frequency input stream of widely distributed integers and ranges of integers.
It is OK if the hash/map function selection itself varies based on the specific integer and range requirements, and the performance associated with the part of the code that selects this algorithm is not critical. The number of integers/ranges of interest in most cases will be small (zero to a few thousand). The performance critical portion is in processing the incoming stream and selecting the appropriate function.
As a simple example, please consider the following pseudo-code:
switch (highFrequencyIntegerStream)
case(2) : func1();
case(3) : func2();
case(8) : func3();
case(33-122) : func4();
...
case(10,000) : func40();
In a typical example, there would be only a few thousand of the "cases" shown above, which could include a full range of 32-bit integer values and ranges. (In the pseudo code above 33-122 represents all integers from 33 to 122.) There will be a large number of objects containing these "switch statements."
(Note that the actual implementation will not include switch statements. It will instead be a jump table (which is an array of function pointers) or maybe a combination of the Command and Observer patterns, etc. The implementation details are tangential to the request, but provided to help with visualization.)
Many of the objects will contain "switch statements" with only a few entries. The values of interest are subject to real time change, but performance associated with managing these changes is not critical. Hash/map algorithms can be re-generated slowly with each update based on the specific integers and ranges of interest (for a given object at a given time).
We have searched around the internet, looking at Bloom filters, various hash functions listed on Wikipedia's "hash function" page and elsewhere, quite a few Stack Overflow questions, abstract algebra (mostly Galois theory which is attractive for its computationally simple operands), various ciphers, etc., but have not found a solution that appears to be targeted to this problem. (We could not even find a hash or map function that considered these types of ranges as inputs, much less a highly efficient one. Perhaps we are not looking in the right places or using the correct vernacular.)
The current plan is to create a custom algorithm that preprocesses the list of interesting integers and ranges (for a given object at a given time) looking for shifts and masks that can be applied to input stream to help delineate the ranges. Note that most of the incoming integers will be uninteresting, and it is of critical importance to make a very quick decision for as large a percentage of that portion of the stream as possible (which is why Bloom filters looked interesting at first (before we starting thinking that their implementation required more computational complexity than other solutions)).
Because the first decision is so important, we are also considering having multiple tables, the first of which would be inverse masks (masks to select uninteresting numbers) for the easy to find large ranges of data not included in a given "switch statement", to be followed by subsequent tables that would expand the smaller ranges. We are thinking this will, for most cases of input streams, yield something quite a bit faster than a binary search on the bounds of the ranges.
Note that the input stream can be considered to be randomly distributed.
There is a pretty extensive theory of minimal perfect hash functions that I think will meet your requirement. The idea of a minimal perfect hash is that a set of distinct inputs is mapped to a dense set of integers in 1-1 fashion. In your case a set of N 32-bit integers and ranges would each be mapped to a unique integer in a range of size a small multiple of N. Gnu has a perfect hash function generator called gperf that is meant for strings but might possibly work on your data. I'd definitely give it a try. Just add a length byte so that integers are 5 byte strings and ranges are 9 bytes. There are some formal references on the Wikipedia page. A literature search in ACM and IEEE literature will certainly turn up more.
I just ran across this library I had not seen before.
Addition
I see now that you are trying to map all integers in the ranges to the same function value. As I said in the comment, this is not very compatible with hashing because hash functions deliberately try to "erase" the magnitude information in a bit's position so that values with similar magnitude are unlikely to map to the same hash value.
Consequently, I think that you will not do better than an optimal binary search tree, or equivalently a code generator that produces an optimal "tree" of "if else" statements.
If we wanted to construct a function of the type you are asking for, we could try using real numbers where individual domain values map to consecutive integers in the co-domain and ranges map to unit intervals in the co-domain. So a simple floor operation will give you the jump table indices you're looking for.
In the example you provided you'd have the following mapping:
2 -> 0.0
3 -> 1.0
8 -> 2.0
33 -> 3.0
122 -> 3.99999
...
10000 -> 42.0 (for example)
The trick is to find a monotonically increasing polynomial that interpolates these points. This is certainly possible, but with thousands of points I'm certain you'ed end up with something much slower to evaluate than the optimal search would be.
Perhaps our thoughts on hashing integers can help a little bit. You will also find there a hashing library (hashlib.zip) based on Bob Jenkins' work which deals with integer numbers in a smart way.
I would propose to deal with larger ranges after the single cases have been rejected by the hashing mechanism.

Linear hashing complexity

I was going through Linear hashing article on Wiki. One line puzzled me and here it is:
" The cost of hash table expansion is spread out across each hash table insertion operation, as opposed to being incurred all at once.[2]"
In case of linear hashing if hash value of item to be inserted is smaller than split variable then a new node (or bucket) is created and value inserted in that.And according to above line( the time complexity is measured over each "insertion operation" which if compared to "dynamic array" implementation where we do amortized analysis , the insertion in Linear hashing must take O(n) time. Please correct me if i am wrong.
One more thing: Second line on wiki says "Linear hashing is therefore well suited for interactive applications."
Can i compare B+ tree with Linear hashing in "interactive cases" (since both are extendible searching techniques) ?
From what I know O(n) is the worst time complexity but in most cases a hash table would return results in constant time which is O(1). As oppose to B+ tree where one must traverse the tree hash tables work on hashing function where the result of hashing function points to the address of a stored value. In the worst case if all the keys have same hashing results then the time complexity might become O(n) because all the results will be stored in one bucket.
According to wikipedia b plus tree has following time complexities.
Inserting a record requires O(logbn) operations
Finding a record requires O(logbn) operations
An LH implementation can guarantee strictly bounded insertion time.
There's no reason for the split location and the key-hash location to be related, if collisions are handled by overflows. The trick is to link the creation of overflow slots to the split operation.
For example, if every Nth slot is always reserved to be an overflow slot, then you need to do at most N-1 splits to create a new overflow slot. In practice it's fewer than (N-1)/2 splits, because splitting one slot may free up an overflow slot.
http://goo.gl/6dbuH for a description, https://github.com/mischasan/hx for source code.

Does KMP algorithm perform less comparisons than the simplified Boyer-Moore algorithm?

Does the KMP (Knuth–Morris–Pratt) algorithm perform fewer comparisons than the simplified Boyer-Moore algorithm?
The Boyers Moore algorithm should usually perform with less comparisons to quote from here
It should be reasonably clear that, if it is normally the case that a given letter doesnt appear at all in the search string, then this algorithm only requires approx N/M character comparisons (N=length(s1), M=length(s2)) - a big improvement on the KMP algorithm, which still requires N. However, if this is not the case then we may need up to N+M comparisons again (with the full version of the algorithm). Fortunately, for many applications we get close to the N/M performance. If the search string is very large, then it is likely that a given character WILL appear in it, but we still get a good improvement compared with the other algorithms (approx N*2/alphabet_size if characters are randomly distributed in a string).