Does KMP algorithm perform less comparisons than the simplified Boyer-Moore algorithm? - string-search

Does the KMP (Knuth–Morris–Pratt) algorithm perform fewer comparisons than the simplified Boyer-Moore algorithm?

The Boyers Moore algorithm should usually perform with less comparisons to quote from here
It should be reasonably clear that, if it is normally the case that a given letter doesnt appear at all in the search string, then this algorithm only requires approx N/M character comparisons (N=length(s1), M=length(s2)) - a big improvement on the KMP algorithm, which still requires N. However, if this is not the case then we may need up to N+M comparisons again (with the full version of the algorithm). Fortunately, for many applications we get close to the N/M performance. If the search string is very large, then it is likely that a given character WILL appear in it, but we still get a good improvement compared with the other algorithms (approx N*2/alphabet_size if characters are randomly distributed in a string).

Related

When is Rabin Karp more effective than KMP or Boyer-Moore?

I'm learning about string searching algorithms and understand how they work but haven't found a good enough answer about in which cases Rabin-Karp algorithm would be more effective than KMP or Boyer-Moore. I see that it is easier to implement and doesn't need the same overhead but beyond that, I have no clue.
So, when is Rabin-Karp better to use than the others?
There are a couple of properties that each of these algorithms have that might make them desirable or undesirable in different circumstances. Here's a quick rundown:
Space Usage favors Rabin-Karp
One major advantage of Rabin-Karp is that it uses O(1) auxiliary storage space, which is great if the pattern string you're looking for is very large. For example, if you're looking for all occurrences of a string of length 107 in a longer string of length 109, not having to allocate a table of 107 machine words for a failure function or shift table is a major win. Both Boyer-Moore and KMP use Ω(n) memory on a pattern string of length n, so Rabin-Karp would be a clear win here.
Worst-Case Single-Match Efficiency Favors Boyer-Moore or KMP
Rabin-Karp suffers from two potential worst cases. First, if the particular prime numbers used by Rabin-Karp are known to a malicious adversary, that adversary could potentially craft an input that causes the rolling hash to match the hash of a pattern string at each point in time, causing the algorithm's performance to degrade to Ω((m - n + 1)n) on a string of length m and pattern of length n. If you're taking untrusted strings as input, this could potentially be an issue. Neither Boyer-Moore nor KMP have these weaknesses.
Worst-Case Multiple-Match Efficiency favors KMP.
Similarly, Rabin-Karp is slow in the case where you want to find all matches of a pattern string in the case where that pattern appears a large number of times. For example, if you're looking for a string of 105 copies of the letter a in text string consisting of 109copies of the letter a with Rabin-Karp, then there will be lots of spots where the pattern string appears, and each will require a linear scan. This can also lead to a runtime of Ω((m + n - 1)n).
Many Boyer-Moore implementations suffer from this second rule, but will not have bad runtimes in the first case. And KMP has no pathological worst-cases like these.
Best-Case Performance favors Boyer-Moore
One advantage of the Boyer-Moore algorithm is that it doesn't necessarily have to scan all the characters of the input string. Specifically, the Bad Character Rule can be used to skip over huge regions of the input string in the event of a mismatch. More specifically, the best-case runtime for Boyer-Moore is O(m / n), which is much faster than what Rabin-Karp or KMP can provide.
Generalizations to Multiple Strings favor KMP
Suppose you have a fixed set of multiple text strings that you want to search for, rather than just one. You could, if you wanted to, run multiple passes of Rabin-Karp, KMP, or Boyer-Moore across the strings to find all the matches. However, the runtime of this approach isn't great, as it scales linearly with the number of strings to search for. On the other hand, KMP generalizes nicely to the Aho-Corasick string-matching algorithm, which runs in time O(m + n + z), where z is the number of matches found and n is the combined length of the pattern strings. Notice that there's no dependence here on the number of different pattern strings being searched for!
The Rabin-Karp algorithm is better when searching for a large text that is finding multiple pattern matches, like detecting plagiarism.
And Boyer-Moore is better when the pattern is relatively large with a moderately sized alphabet and with a large vocabulary. And it does not work well with binary strings or very short patterns.
Meanwhile, KMP is good for searching inside a smaller alphabet, like in bioinformatics or searching in binary strings. And it does not run fast if the alphabet increases.
Space-Time complexities of all three (for reference)
(For finding ALL occurrences of the pattern)
m : length of the pattern
n : length of the String in which we search the pattern
k : size of the alphabet
Rabin Karp:
O(1) auxiliary space
Uses hashing to find an exact match of a pattern string in a text. It uses a rolling hash to quickly filter out positions of the text that cannot match the pattern, and then checks for a match at the remaining positions
Boyer Moore:
Worst-case performance : Θ(m) preprocessing + O(mn) matching
Best-case performance : Θ(m) preprocessing + Ω(n/m) matching
Worst-case space complexity : Θ(k).
Can be used for "grep" like searches.
https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm#Performance
Knuth Morris Pratt:
Worst-case performance : Θ(m) preprocessing + Θ(n) matching
Worst-case space complexity : Θ(m)
For more details lookup in Wikipedia for each algorithm.

improve hashing using genetic programming/algorithm

I'm writing a program which can significantly lessen the number of collisions that occur while using hash functions like 'key mod table_size'. For this I would like to use Genetic Programming/Algorithm. But I don't know much about it. Even after reading many articles and examples I don't know that in my case (as in program definition) what would be the fitness function, target (target is usually the required result), what would pose as the population/individuals and parents, etc.
Please help me in identifying the above and with a few codes/pseudo-codes snippets if possible as this is my project.
Its not necessary to be using genetic programming/algorithm, it can be anything using evolutionary programming/algorithm.
thanks..
My advice would be: don't do this that way. The literature on hash functions is vast and we more or less understand what makes a good hash function. We know enough mathematics not to look for them blindly.
If you need a hash function to use, there is plenty to choose from.
However, if this is your uni project and you cannot possibly change the subject or steer it in a more manageable direction, then as you noticed there will be complex issues of getting fitness function and mutation operators right. As far as I can tell off the top of my head, there are no obvious candidates.
You may look up e.g. 'strict avalanche criterion' and try to see if you can reason about it in terms of fitness and mutations.
Another question is how do you want to represent your function? Just a boolean expression? Something built from word operations like AND, XOR, NOT, ROT ?
Depending on your constraints (or rather, assumptions) the question of fitness and mutation will be different.
Broadly fitness is clearly minimize the number of collisions in your 'hash modulo table-size' model.
The obvious part is to take a suitably large and (very important) representative distribution of keys and chuck them through your 'candidate' function.
Then you might pass them through 'hash modulo table-size' for one or more values of table-size and evaluate some measure of 'niceness' of the arising distribution(s).
So what that boils down to is what table-sizes to try and what niceness measure to apply.
Niceness is context dependent.
You might measure 'fullest bucket' as a measure of 'worst case' insert/find time.
You might measure sum of squares of bucket sizes as a measure of 'average' insert/find time based on uniform distribution of amongst the keys look-up.
Finally you would need to decide what table-size (or sizes) to test at.
Conventional wisdom often uses primes because hash modulo prime tends to be nicely volatile to all the bits in hash where as something like hash modulo 2^n only involves the lower n-1 bits.
To keep computation down you might consider the series of next prime larger than each power of two. 5(>2^2) 11 (>2^3), 17 (>2^4) , etc. up to and including the first power of 2 greater than your 'sample' size.
There are other ways of considering fitness but without a practical application the question is (of course) ill-defined.
If your 'space' of potential hash functions don't all have the same execution time you should also factor in 'cost'.
It's fairly easy to define very good hash functions but execution time can be a significant factor.

finding good hash function for languages accepted by finite state automata

I'm working on project in Java (but I think it doesn't depend on the language) where I'm generating small (4 states max) nondeterministic finite state automata on binary alphabet and I have to check fast the generated automaton for equivalence with the previous ones. Therefore, I have to use some good hash function, to avoid compairing with too many automatas.
My first thought was doing a DFS on the transitions and finding all the accepted words until length max. 5 and then I map the set of accepted words to a 64-bit long (the amount of binary words of length max. 5). But it seems to produce too many collisions on NFAs with 4 states. Increasing the length results in making the computing of the hash code too slow for practical use.
Another approach was having a set of words and testing which of them the automaton accepts but finding the right ones, I think, isn't that trivial.
Do you have any idea how to improve the hash function to avoid too many collisions without a significant loss of speed?
Thanks in advance
I was thinking further (thanks #justhalf and #templatetypedef) and I have an idea - an injective function of any NFA (or more precisely, language accepted by it) to integers - Let's have an NFA A. Let's construct minimal DFA A_min accepting the same language with complete delta-function. As a consequence of Myhill-Nerode theorem, this automaton should be unambiguous except isomorphism. Do a BFS from the initial state giving priority to the edges(transitions) based on some fixed order of characters in the alphabet (for example first 0, second 1). And renumber the states based on the order of visiting. Now we have a canonical minimal DFA and we can map the incidence matrix of states to an integer and append enumeration of final states (or better make a tuple, to avoid collision). This integer could be then used for deciding equivalence of two NFAs. Do you think, it is ok or have any other idea?

Efficient Function to Map (or Hash) Integers and Integer Ranges into Index

We are looking for the computationally simplest function that will enable an indexed look-up of a function to be determined by a high frequency input stream of widely distributed integers and ranges of integers.
It is OK if the hash/map function selection itself varies based on the specific integer and range requirements, and the performance associated with the part of the code that selects this algorithm is not critical. The number of integers/ranges of interest in most cases will be small (zero to a few thousand). The performance critical portion is in processing the incoming stream and selecting the appropriate function.
As a simple example, please consider the following pseudo-code:
switch (highFrequencyIntegerStream)
case(2) : func1();
case(3) : func2();
case(8) : func3();
case(33-122) : func4();
...
case(10,000) : func40();
In a typical example, there would be only a few thousand of the "cases" shown above, which could include a full range of 32-bit integer values and ranges. (In the pseudo code above 33-122 represents all integers from 33 to 122.) There will be a large number of objects containing these "switch statements."
(Note that the actual implementation will not include switch statements. It will instead be a jump table (which is an array of function pointers) or maybe a combination of the Command and Observer patterns, etc. The implementation details are tangential to the request, but provided to help with visualization.)
Many of the objects will contain "switch statements" with only a few entries. The values of interest are subject to real time change, but performance associated with managing these changes is not critical. Hash/map algorithms can be re-generated slowly with each update based on the specific integers and ranges of interest (for a given object at a given time).
We have searched around the internet, looking at Bloom filters, various hash functions listed on Wikipedia's "hash function" page and elsewhere, quite a few Stack Overflow questions, abstract algebra (mostly Galois theory which is attractive for its computationally simple operands), various ciphers, etc., but have not found a solution that appears to be targeted to this problem. (We could not even find a hash or map function that considered these types of ranges as inputs, much less a highly efficient one. Perhaps we are not looking in the right places or using the correct vernacular.)
The current plan is to create a custom algorithm that preprocesses the list of interesting integers and ranges (for a given object at a given time) looking for shifts and masks that can be applied to input stream to help delineate the ranges. Note that most of the incoming integers will be uninteresting, and it is of critical importance to make a very quick decision for as large a percentage of that portion of the stream as possible (which is why Bloom filters looked interesting at first (before we starting thinking that their implementation required more computational complexity than other solutions)).
Because the first decision is so important, we are also considering having multiple tables, the first of which would be inverse masks (masks to select uninteresting numbers) for the easy to find large ranges of data not included in a given "switch statement", to be followed by subsequent tables that would expand the smaller ranges. We are thinking this will, for most cases of input streams, yield something quite a bit faster than a binary search on the bounds of the ranges.
Note that the input stream can be considered to be randomly distributed.
There is a pretty extensive theory of minimal perfect hash functions that I think will meet your requirement. The idea of a minimal perfect hash is that a set of distinct inputs is mapped to a dense set of integers in 1-1 fashion. In your case a set of N 32-bit integers and ranges would each be mapped to a unique integer in a range of size a small multiple of N. Gnu has a perfect hash function generator called gperf that is meant for strings but might possibly work on your data. I'd definitely give it a try. Just add a length byte so that integers are 5 byte strings and ranges are 9 bytes. There are some formal references on the Wikipedia page. A literature search in ACM and IEEE literature will certainly turn up more.
I just ran across this library I had not seen before.
Addition
I see now that you are trying to map all integers in the ranges to the same function value. As I said in the comment, this is not very compatible with hashing because hash functions deliberately try to "erase" the magnitude information in a bit's position so that values with similar magnitude are unlikely to map to the same hash value.
Consequently, I think that you will not do better than an optimal binary search tree, or equivalently a code generator that produces an optimal "tree" of "if else" statements.
If we wanted to construct a function of the type you are asking for, we could try using real numbers where individual domain values map to consecutive integers in the co-domain and ranges map to unit intervals in the co-domain. So a simple floor operation will give you the jump table indices you're looking for.
In the example you provided you'd have the following mapping:
2 -> 0.0
3 -> 1.0
8 -> 2.0
33 -> 3.0
122 -> 3.99999
...
10000 -> 42.0 (for example)
The trick is to find a monotonically increasing polynomial that interpolates these points. This is certainly possible, but with thousands of points I'm certain you'ed end up with something much slower to evaluate than the optimal search would be.
Perhaps our thoughts on hashing integers can help a little bit. You will also find there a hashing library (hashlib.zip) based on Bob Jenkins' work which deals with integer numbers in a smart way.
I would propose to deal with larger ranges after the single cases have been rejected by the hashing mechanism.

Hash function combining - is there a significant decrease in collision risk?

Does anyone know if there's a real benefit regarding decreasing collision probability by combining hash functions? I especially need to know this regarding 32 bit hashing, namely combining Adler32 and CRC32.
Basically, will adler32(crc32(data)) yield a smaller collision probability than crc32(data)?
The last comment here gives some test results in favor of combining, but no source is mentioned.
For my purpose, collision is not critical (i.e. the task does not involve security), but I'd rather minimize the probability anyway, if possible.
PS: I'm just starting in the wonderful world of hashing, doing a lot of reading about it. Sorry if I asked a silly question, I haven't even acquired the proper "hash dialect" yet, probably my Google searches regarding this were also poorly formed.
Thanks.
This doesn't make sense combining them in series like that. You are hashing one 32-bit space to another 32-bit space.
In the case of a crc32 collision in the first step, the final result is still a collision. Then you add on any potential collisions in the adler32 step. So it can not get any better, and can only be the same or worse.
To reduce collisions, you might try something like using the two hashes independently to create a 64-bit output space:
adler32(data) << 32 | crc32(data)
Whether there is significant benefit in doing that, I'm not sure.
Note that the original comment you referred to was storing the hashes independently:
Whichever algorithm you use there is
going to be some chance of false
positives. However, you can reduce
these chances by a considerable margin
by using two different hashing
algorithms. If you were to calculate
and store both the CRC32 and the
Alder32 for each url, the odds of a
simultaneous collision for both hashes
for any given pair of urls is vastly
reduced.
Of course that means storing twice as
much information which is a part of
your original problem. However, there
is a way of storing both sets of hash
data such that it requires minimal
memory (10kb or so) whilst giving
almost the same lookup performance (15
microsecs/lookup compared to 5
microsecs) as Perl's hashes.