Space complexity for a simple streaming algorithm

Space complexity for a simple streaming algorithm - streaming

I want to determine the space complexity of the go to example of a simple streaming algorithm.
If you get a permutation of n-1 different numbers and have to detect the one missing number, you calculate the sum of all numbers 1 to n using the formula n (n + 1) / 2 and then you subtract each incoming number. The result is your missing number. I found a german wikipedia article stating that the space complexity of this algorithm is O(log n). (https://de.wikipedia.org/wiki/Datenstromalgorithmus)
What I do not understand is: The amount of bits needed to store a number n is log2(n). ok.. but I do have to calculate the sum, tough. So n (n + 1) / 2 is larger than n and therefore needs more space than just log (n) right?
Can someone help me with this? Thanks in advance!

If integer A in binary coding requires Na bits and integer B requires Nb bits then A*B requires no more than Na+Nb bits (not Na * Nb). So, expression n(n+1)/2 requires no more than log2(n) + log2(n+1) = O(2log2(n)) = O(log2(n)) bits.
Even more, you may raise n to any fixed power i and it still will use O(log2(n)) space. n itself, n10, n500, n10000000 all require O(log(n)) bits of storage.

Related

how to "explain" the following hash function is bad

we have a hash table with size 16, using double hashing method.
h1(x) = k mod 16
h2(x) = 2*(k mod 8)
I know that h2 hash function is bad, probably because mod 8 and times 2, but I don't know how to explain it. is there any explanation like "h2 hash function should mod prime or it will cause ____ problem "

It is bad because it increases the number of collisions.
The (mod 8) means that you are only looking for 8 pigeonholes in your 16-pigeonhole table.
Multiplying it by 2 just spreads those 8 pigeonholes out so that you don’t have to search too many slots past the hashed index to find an empty hole...
You should always compute modulo the size of your table.
h(x) ::= x (mod N) // where N is the table size
The purpose of making the table size a prime number just has to do with how powers of two are very common in computer science. If your data is random, then the size of the table doesn’t matter.
— As long as it is big enough for your expected load factor. A 16-element table is very small. You shouldn’t expect to store more than 6-12 random values in your table without a high-probability of collisions.
A very good linked thread is What is a good Hash Function?, which is totally worth a read just for the links to further reading alone.

can Wolfram factor 300 digit RSA Number?

Everybody knows its hard to factor over 100 digit public key but 250 digits RSA number have already been factored and Wolfram was able to factor a 300 digit number.
I tried factorizing public key n=144965985551789672595298753480434206198100243703843869912732139769072770813192027176664780173715597401753521552885189279272665054124083640424760144394629798590902883058370807005114592169348977123961322905036962506399782515793487504942876237605818689905761084423626547637902556832944887103223814087385426838463 using a simple prime factor program but it has encountered an error and it keeps repeating.
import math
i = 0 n=144965985551789672595298753480434206198100243703843869912732139769072770813192027176664780173715597401753521552885189279272665054124083640424760144394629798590902883058370807005114592169348977123961322905036962506399782515793487504942876237605818689905761084423626547637902556832944887103223814087385426838463
p = math.floor(math.sqrt(n))
while n % p != 0:
p -= 1
i += 1
q = int(n / p)
print( p,q)
and heres the results:
Next I tried Sieve of Eratosthenes
import time
import math
def sieve(b):
global prime_list
for a in prime_list:
if (a % prime_list[b] == 0 and a != prime_list[b]):
prime_list.remove(a)
inp = 144965985551789672595298753480434206198100243703843869912732139769072770813192027176664780173715597401753521552885189279272665054124083640424760144394629798590902883058370807005114592169348977123961322905036962506399782515793487504942876237605818689905761084423626547637902556832944887103223814087385426838463
prime_list = []
i = 2
b = 0
t = time.time()
while i <= int(inp):
prime_list.append(i)
i += 1
while b < len(prime_list):
sieve(b)
b += 1
print(prime_list)
print("length of list: " + str(len(prime_list)))
print("time took: " + str((time.time()-t)))
That doesn't work either. But, I believe a 300 digit number can be factored. I just don't understand why so many programmers who gave up that easily said it's impossible.

It is easier to factor a number if the factors are small. Here is a nice big 300 digit number for you:
100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
It’s pretty obvious what the factors are, right? The prime factors are 2299 5299, and that should be obvious from looking at the number.
So, some numbers are easier to factor than others.
RSA keys are made from two large prime numbers multiplied together, with no small factors. For a 309-digit number, the factors might each be over 150 digits. So if you try to use the sieve of Eratosthenes to factor a large prime number, what will happen is that your program will try to calculate all the primes up to 150 digits, and that is simply too many prime numbers to calculate.
The number of prime numbers with less than 150 digits is:
About 10147, so your program would take at least that many processor cycles to finish. This number is so large, that if we took all of the computers in the entire world (maybe 1021 or 1022 operations per second), your program would take more than 10117 years to run (again, using all of the computers in the entire world).
I just don't understand why so many programmers who gave up that easily said it's impossible.
It’s because factoring numbers is known to be a very hard problem—the people who give up are giving up because they know what the state of the art algorithms are for factoring large numbers (the General Number Field Sieve) and know that 300 digits is just not feasible without some kind of radical new development in algorithms, computers, or engineering.

Hashing using division method

For the hash function : h(k) = k mod m;
I understand that m=2^n will always give the last n LSB digits. I also understand that m=2^p-1 when K is a string converted to integers using radix 2^p will give same hash value for every permutation of characters in K. But why exactly "a prime not too close to an exact power of 2" is a good choice? What if I choose 2^p - 2 or 2^p-3? Why are these choices considered bad?
Following is the text from CLRS:
"A prime not too close to an exact power of 2 is often a good choice for m. For
example, suppose we wish to allocate a hash table, with collisions resolved by
chaining, to hold roughly n D 2000 character strings, where a character has 8 bits.
We don’t mind examining an average of 3 elements in an unsuccessful search, and
so we allocate a hash table of size m D 701. We could choose m D 701 because
it is a prime near 2000=3 but not near any power of 2."

Suppose we work with radix 2p.
2p-1 case:
Why that is a bad idea to use 2p-1? Let us see,
k = ∑ai2ip
and if we divide by 2p-1 we just get
k = ∑ai2ip = ∑ai mod 2p-1
so, as addition is commutative, we can permute digits and get the same result.
2p-b case:
Quote from CLRS:
A prime not too close to an exact power of 2 is often a good choice for m.
k = ∑ai2ip = ∑aibi mod 2p-b
So changing least significant digit by one will change hash by one. Changing second least significant bit by one will change hash by two. To really change hash we would need to change digits with bigger significance. So, in case of small b we face problem similar to the case then m is power of 2, namely we depend on distribution of least significant digits.

Expected chain length after rehashing - Linear Hashing

There is one confusion I've about load factor. Some sources say that it is just the number of keys in hash table divided by total number of slots which is same as expected chain length for each slot. But that is only in simple uniform hashing right?
Suppose hash table T has n elements and we expand T into T1 by redistributing elements in slot T[0] by rehashing them using h'(k) = k mod 2m. The hash function of T1 is h(k) = k mod 2m if h(k) < 1 and k mod m if h(k) >= 1. Many sources say that we "Expand and rehash to maintain the load factor (does this imply expected chain length is still same?) Since this is not simple uniform hashing, I think the probability that any key k enters a slot is (1/4 + 1/2(m-1)).
For a key k (randomly selected), h(k) is first evaluated (there are 50-50 chances whether it is less than 1 or greater than or equal to 1) and then if it's less than 1, key k has JUST two ways - slot 0 or slot m. Hence, probability 1/4 (1/2 * 1/2) But if it is greater than or equal to 1, it has m-1 slots and could enter any and hence probability (1/2 * 1/m-1). So expected chain length would now be n/4 + n/2(m-1). Am I on right track?

The calculation for linear hashing should be the same as for "non-linear" hashing. With a certain initial number of buckets, uniform distribution of hash values would result in uniform placement. With enough expansions to double the size of the table, each of those values would be randomly split over the larger space via the incremental re-hashing, and new values would also have been distributed over the larger space. Incrementally, each point is equally likely to be at (initial bucket position) and (2x initial bucket position) as the table expands to that length.
There is a paper here which goes into detail about the chain length calculation under different circumstances (not just the average), specifically for linear hashing.

find discretization steps

I have data files F_j, each containing a list of numbers with an unknown number of decimal places. Each file contains discretized measurements of some continuous variable and
I want to find the discretization step d_j for file F_j
A solution I could come up with: for each F_j,
find the number (n_j) of decimal places;
multiply each number in F_j with 10^{n_j} to obtain integers;
find the greatest common divisor of the entire list.
I'm looking for an elegant way to find n_j with Matlab.
Also, finding the gcd of a long list of integers seems hard — do you have any better idea?

Finding the gcd of a long list of numbers isn't too hard. You can do it in time linear in the size of the list. If you get lucky, you can do it in time a lot less than linear. Essentially this is because:
gcd(a,b,c) = gcd(gcd(a,b),c)
and if either a=1 or b=1 then gcd(a,b)=1 regardless of the size of the other number.
So if you have a list of numbers xs you can do
g = xs(1);
for i = 2:length(xs)
g = gcd(x(i),g);
if g == 1
break
end
end
The variable g will now store the gcd of the list.

Here is some sample code that I believe will help you get the GCD once you have the numbers you want to look at.
A = [15 30 20];
A_min = min(A);
GCD = 1;
for n = A_min:-1:1
temp = A / n;
if (max(mod(temp,1))==0)
% yay GCD found
GCD = n;
break;
end
end
The basic concept here is that the default GCD will always be 1 since every number is divisible by itself and 1 of course =). The GCD also can't be greater than the smallest number in the list, thus I start with the smallest number and then decriment by 1. This is assuming that you have already converted the numbers to a whole number form at this point. Decimals will throw this off!
By using the modulus of 1 you are testing to see if the number is a whole number, if it isn't you will have a decmial remainder left which is greater than 0. If you anticipate having to deal with negative you will have to tweak this test!
Other than that, the first time you find a number where the modulus of the list (mod 1) is all zeros you've found the GCD.
Enjoy!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse