Finding max common substrings in 2 strings using Hash and Binary search - hash

Suppose I have 2 big strings of size 10^5, How can I search for max common sub strings from both strings with complexity O(NlogN) using Hash and Binary Search.
Explanation with code will be of great help :)

Related

When is Rabin Karp more effective than KMP or Boyer-Moore?

I'm learning about string searching algorithms and understand how they work but haven't found a good enough answer about in which cases Rabin-Karp algorithm would be more effective than KMP or Boyer-Moore. I see that it is easier to implement and doesn't need the same overhead but beyond that, I have no clue.
So, when is Rabin-Karp better to use than the others?
There are a couple of properties that each of these algorithms have that might make them desirable or undesirable in different circumstances. Here's a quick rundown:
Space Usage favors Rabin-Karp
One major advantage of Rabin-Karp is that it uses O(1) auxiliary storage space, which is great if the pattern string you're looking for is very large. For example, if you're looking for all occurrences of a string of length 107 in a longer string of length 109, not having to allocate a table of 107 machine words for a failure function or shift table is a major win. Both Boyer-Moore and KMP use Ω(n) memory on a pattern string of length n, so Rabin-Karp would be a clear win here.
Worst-Case Single-Match Efficiency Favors Boyer-Moore or KMP
Rabin-Karp suffers from two potential worst cases. First, if the particular prime numbers used by Rabin-Karp are known to a malicious adversary, that adversary could potentially craft an input that causes the rolling hash to match the hash of a pattern string at each point in time, causing the algorithm's performance to degrade to Ω((m - n + 1)n) on a string of length m and pattern of length n. If you're taking untrusted strings as input, this could potentially be an issue. Neither Boyer-Moore nor KMP have these weaknesses.
Worst-Case Multiple-Match Efficiency favors KMP.
Similarly, Rabin-Karp is slow in the case where you want to find all matches of a pattern string in the case where that pattern appears a large number of times. For example, if you're looking for a string of 105 copies of the letter a in text string consisting of 109copies of the letter a with Rabin-Karp, then there will be lots of spots where the pattern string appears, and each will require a linear scan. This can also lead to a runtime of Ω((m + n - 1)n).
Many Boyer-Moore implementations suffer from this second rule, but will not have bad runtimes in the first case. And KMP has no pathological worst-cases like these.
Best-Case Performance favors Boyer-Moore
One advantage of the Boyer-Moore algorithm is that it doesn't necessarily have to scan all the characters of the input string. Specifically, the Bad Character Rule can be used to skip over huge regions of the input string in the event of a mismatch. More specifically, the best-case runtime for Boyer-Moore is O(m / n), which is much faster than what Rabin-Karp or KMP can provide.
Generalizations to Multiple Strings favor KMP
Suppose you have a fixed set of multiple text strings that you want to search for, rather than just one. You could, if you wanted to, run multiple passes of Rabin-Karp, KMP, or Boyer-Moore across the strings to find all the matches. However, the runtime of this approach isn't great, as it scales linearly with the number of strings to search for. On the other hand, KMP generalizes nicely to the Aho-Corasick string-matching algorithm, which runs in time O(m + n + z), where z is the number of matches found and n is the combined length of the pattern strings. Notice that there's no dependence here on the number of different pattern strings being searched for!
The Rabin-Karp algorithm is better when searching for a large text that is finding multiple pattern matches, like detecting plagiarism.
And Boyer-Moore is better when the pattern is relatively large with a moderately sized alphabet and with a large vocabulary. And it does not work well with binary strings or very short patterns.
Meanwhile, KMP is good for searching inside a smaller alphabet, like in bioinformatics or searching in binary strings. And it does not run fast if the alphabet increases.
Space-Time complexities of all three (for reference)
(For finding ALL occurrences of the pattern)
m : length of the pattern
n : length of the String in which we search the pattern
k : size of the alphabet
Rabin Karp:
O(1) auxiliary space
Uses hashing to find an exact match of a pattern string in a text. It uses a rolling hash to quickly filter out positions of the text that cannot match the pattern, and then checks for a match at the remaining positions
Boyer Moore:
Worst-case performance : Θ(m) preprocessing + O(mn) matching
Best-case performance : Θ(m) preprocessing + Ω(n/m) matching
Worst-case space complexity : Θ(k).
Can be used for "grep" like searches.
https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm#Performance
Knuth Morris Pratt:
Worst-case performance : Θ(m) preprocessing + Θ(n) matching
Worst-case space complexity : Θ(m)
For more details lookup in Wikipedia for each algorithm.

Is the uniqueness of CRC-32-hashes sufficient to uniquely identify strings containing filenames?

I have sorted lists of filenames concatenated to strings and want to identify each such string by a unique checksum.
The size of these strings is a minimum of 100 bytes, a maximum of 4000 bytes, and an average of 1000 bytes. The total number of strings could be anything but more likely be in the range of ca. 10000.
Is CRC-32 suited for this purpose?
E.g. I need each of the following strings to have a different fixed-length (, preferably short,) checksum:
"/some/path/to/something/some/other/path"
"/some/path/to/something/another/path"
"/some/path"
...
# these strings can get __very__ long (very long strings are the norm)
Is the uniqueness of CRC-32 hashes increased by input length?
Is there a better choice of checksum for this purpose?
No.
Unless your filenames were all four characters or less, there is no assurance that the CRCs will be unique. With 10,000 names, the probability of at least two of them having the same CRC is about 1%.
This would be true for any 32-bit hash value.
The best way to assign a unique code to each name is to simply start a counter at zero for the first name, and increment for each name, assigning the counter as the code for that name. However that won't help you compute the code given just the name.
You can use a hash, such as a CRC or some other hash, but you will need to deal with collisions. There are several common approaches in the literature. You would keep a list of hashes with names assigned, and if you have a collision you could just increment the hash until you find one not used and assign that one. Then when looking up a name, you start at the computed hash and do a linear search for the name until you find it or an unused slot.
As for the hash, I would recommend XXH64. It is a very fast 64-bit hash. You do not need a cryptographic hash for this application, which would be unnecessarily slow.

Hash a Sequence of positive/negative integers

I have a file with millions of lines (actually it's an online stream of data, which means we are receiving it line by line) , each line consists of an array of integers which is not sorted (positive and negative) there's no limit for the each number and the lengths are different and we might have duplicate values in one line,
I want to remove the duplicate lines (if 2 lines have same values regardless of how they are ordered we consider them duplicate), is there any good hashing function ?
We want to do this in O(n) while n is number of lines (we can assume that the maximum possibele number of elements in each line is constant, e.g. we have maximum of 100 elements in each line)
I've read some of the questions posted here in stackoverflow and I also googled it, most of them were for the cases where the arrays are of the same length or the integers are positive or even or they are sorted, is there any way to solve this in general case ?
My solution:
First we sort each line with the use of O(n) sorting algorithm e.g. Counting sort , then we put them into a string and then we use md5 hashing to put them into a hashset. If it's not in the set we put it into that set, if it's already in the list we check the arrays with the same hash value.
Problem with the solution : sorting using the Counting Sort takes a lot of space as there's no limit for the numbers and the collisions are possible .
The problem with using a hashing algorithm on a set of data this large is that you have a high probability of two different lines hashing to the same value. You want to stay in O(n) but I am not sure that is possible, with the size of the data and accuracy needed. If you use heapsort, which is space efficient and then traverse down the new sorted data removing consecutive lines that are the same you could accomplish this in O(nlogn)

Hash function for an integer sequence

Say there is a list of permutations. Each permutation is a long list of integers. Let's consider a sample permutatation and call it samplePerm. My task is to find out if the list contains the samplePerm. I think that it will be a good idea to use a hash function technique. So that permutations are very large (more than 10000 items) the polinomial variant (like for strings) is useless. Does anybody know the best practice?
UPDATE:
THE ORDER OF INTEGERS IN A PERMUTATION IS A KEY CRITERION! All permutations consist of the same numbers
The solution is dividing integers into groups and considering each group as a string via concatenating integers. After that it is possible to apply a hash function (see java String.hashCode() for an algorithm) to each group. Finally it is possible to add the result numbers. The last activity may provide collisions so it is a place where it is required a better idea :)

Checking for string matches using hashes, without double-checking the entire string

I'm trying to check if two strings are identical as quickly as possible. Can I protect myself from hash collisions without also comparing the entire string?
I've got a cache of items that are keyed by a string. I store the hash of the string, the length of the string, and the string itself. (I'm currently using djb2 to generate the hash.)
To check if an input string is a match to an item in the cache, I compute the input's hash, and compare it to the stored hash. If that matches, I compare the length of the input (which I got as a side effect of computing the hash) to the stored length. Finally, if that matches, I do a full string comparison of the input and the stored string.
Is it necessary to do that full string comparison? For example, is there a string hashing algorithm that can mathematically guarantee that no two strings of the same length will generate the same hash? If not, can an algorithm guarantee that two different strings of the same length will generate different hash codes if any of the first N characters differ?
Basically, any string comparison scheme that offers O(1) performance when the strings differ but better than O(n) performance when they match would be an improvement over what I'm doing now.
For example, is there a string hashing algorithm that can mathematically guarantee that no two strings of the same length will generate the same hash?
No, and there can't be. Think about it: The hash has a finite length, but the strings do not. Say for argument's sake that the hash is 32-bits. Can you create more than 2 billion unique strings with the same length? Of course you can - you can create an infinite number of unique strings, so comparing the hashes is not enough to guarantee uniqueness. This argument scales to longer hashes.
If not, can an algorithm guarantee that two different strings of the same length will generate different hash codes if any of the first N characters differ?
Well, yes, as long as the number of bits in the hash is as great as the number of bits in the string, but that's probably not the answer you were looking for.
Some of the algorithms used for cyclic redundancy checks have guarantees like if there's exactly one bit different then the CRC is guaranteed to be different over a certain run length of bits, but that only works for relatively short runs.
You should be safe from collisions if you use a modern hashing function such as one of the Secure Hash Algorithm (SHA) variants.