Time Complexity - Order of Growth - constants

I have just started learning the order of growth and found the below exercise in the text.
We have the below equation,
T(n) = 3n + 32n^3 + 767249999n^2 = O (?)
Usually, we drop the lower order terms and constants and we would get O(n^3). But in this case, the constant 767249999 is too large and that will make the n2 term bigger even with very large values of n. So, this relation results in O(n^2) or O(n^3)?
Thanks!

It results in O(n^3). This is due to the fact, that for large enough n (no matter if this is "small", "large" or "very large") the term 32n^3 dominates 767249999n^2.
By the definition, you would need any constant c such that cn^2 is never dominated by 32n^3 for this to be in O(n^2). But this is impossible, as for n > c/32 you will find out that 32n^3 = 32n*n^2 > cn^2.

Related

Hashing using division method

For the hash function : h(k) = k mod m;
I understand that m=2^n will always give the last n LSB digits. I also understand that m=2^p-1 when K is a string converted to integers using radix 2^p will give same hash value for every permutation of characters in K. But why exactly "a prime not too close to an exact power of 2" is a good choice? What if I choose 2^p - 2 or 2^p-3? Why are these choices considered bad?
Following is the text from CLRS:
"A prime not too close to an exact power of 2 is often a good choice for m. For
example, suppose we wish to allocate a hash table, with collisions resolved by
chaining, to hold roughly n D 2000 character strings, where a character has 8 bits.
We don’t mind examining an average of 3 elements in an unsuccessful search, and
so we allocate a hash table of size m D 701. We could choose m D 701 because
it is a prime near 2000=3 but not near any power of 2."
Suppose we work with radix 2p.
2p-1 case:
Why that is a bad idea to use 2p-1? Let us see,
k = ∑ai2ip
and if we divide by 2p-1 we just get
k = ∑ai2ip = ∑ai mod 2p-1
so, as addition is commutative, we can permute digits and get the same result.
2p-b case:
Quote from CLRS:
A prime not too close to an exact power of 2 is often a good choice for m.
k = ∑ai2ip = ∑aibi mod 2p-b
So changing least significant digit by one will change hash by one. Changing second least significant bit by one will change hash by two. To really change hash we would need to change digits with bigger significance. So, in case of small b we face problem similar to the case then m is power of 2, namely we depend on distribution of least significant digits.

Space complexity for a simple streaming algorithm

I want to determine the space complexity of the go to example of a simple streaming algorithm.
If you get a permutation of n-1 different numbers and have to detect the one missing number, you calculate the sum of all numbers 1 to n using the formula n (n + 1) / 2 and then you subtract each incoming number. The result is your missing number. I found a german wikipedia article stating that the space complexity of this algorithm is O(log n). (https://de.wikipedia.org/wiki/Datenstromalgorithmus)
What I do not understand is: The amount of bits needed to store a number n is log2(n). ok.. but I do have to calculate the sum, tough. So n (n + 1) / 2 is larger than n and therefore needs more space than just log (n) right?
Can someone help me with this? Thanks in advance!
If integer A in binary coding requires Na bits and integer B requires Nb bits then A*B requires no more than Na+Nb bits (not Na * Nb). So, expression n(n+1)/2 requires no more than log2(n) + log2(n+1) = O(2log2(n)) = O(log2(n)) bits.
Even more, you may raise n to any fixed power i and it still will use O(log2(n)) space. n itself, n10, n500, n10000000 all require O(log(n)) bits of storage.

Subset sum where the size of the subset is `k` is NPC?

I have a variation of Subset-Sum problem where the size of the subset is k and all the integers are positive (not zero).
As can be seen online, this question can be fairly solved using dynamic programming in pseudo-polynomial time.
I need to decide wether this problem is NPC, or in P (while assuming P!=NP).
I've tried to reduce from subset-sum problem, but had a problem with the constraint that all integers must be greater than zero. Since otherwise I would have just padded the input with k zero integers.
Formal definition of the problem:
L={<S1,S2,...,Sn,T,k>|There exists a subset I of S1,...,Sn of size m which sums up to T}
The problem is in NPC.
If you could find polynomial time solutions to your problem then you could have polynomial time solutions to Subset Sum problem with upper bound
Time of Subset Sum = k * (Your Problem's Time)
That problem is NPC. ​ In fact, when k is in n1-Ω(1), even the combination of
the numbers all being in nO(k)
with
the promise that there is at most one solution
is not suspected to put it in ​ coNTIME(n^(o(k))) / 2o(k*n*log(n)) ​ infinitely-often,
since this paper's Proof of Theorem 5.1 gives a reduction
that works for such k and preserves the number of solutions.

Spearman's rank correlation significance

I'm calculating Spearman's rank correlation in matlab with the following code:
[RHO,PVAL] = corr(x,y,'Type','Spearman');
RHO =
0.7211
PVAL =
4.9473e-04
and then with different variables
[RHO,PVAL] = corr(x2,y2,'Type','Spearman');
RHO =
0.3277
PVAL =
0.0060
How do you categorize these as p < 0.05, p < 0.01, p < 0.001 etc. Commonly in scientific journals these pvalues are represented as the examples I've shown and not as one number. Would both of these be p < 0.01? When defining whether a correlation is significant to a specific value do you always look for the smallest error i.e if its PVAL = 0.0005, both p > 0.05 and p > 0.001 would be correct here, do we simply write the lowest i.e. p > 0.001?
As Martin Dinov wrote, this is at least partially a matter of journal policy. But, as long as there is no explicit journal convention against it, I would recommend to always report the actual p-value, in this case in the form p = 4.9·10-4 and p = 0.006, respectively. You can then proceed to say that the effect you found is statistically significant, usually based on comparison with a previously chosen significance level, typically 0.05, unless you need to correct for multiple comparisons.
The reason is that the commonly used significance levels are purely a matter of convention. By only saying that p is below one conventional threshold means to withhold valuable information from the reader, which she might use to make up her own mind about the result – and this truncation is not even justified by relevant saving of print space.
You should also, of course, report the value of the correlation coefficient itself (which in this case doubles as a test statistic and an effect size) as well as the sample size.
At least for the field of psychology, these are official recommendations:
Hypothesis tests. It is hard to imagine a situation in which a dichotomous accept-reject decision is better than reporting an actual p value or, better still, a confidence interval.
…
Effect sizes. Always present effect sizes for primary outcomes. If the units of measurement are meaningful on a practical level (e.g., number of cigarettes smoked per day), then we usually prefer an unstandardized measure (regression coefficient or mean difference) to a standardized measure (r or d).
L. Wilkinson and the Task Force on Statistical Inference, "Statistical Methods in Psychology Journals. Guidelines and Explanations"
You mean pval is < 0.05 and also < 0.001 and not >. In general, you do want to show that it is smaller than the smallest significance level (alpha) threshold that you can. So yes, it is best to say for the second example that the p-value is < 0.001. Depending on the journal convention, it may be preferable to put the actual p-value in (so, for the first example, 4.9473e-04) or just that it's < some good alpha (0.0001 for the first case).

Predicting an overflow when counting combinations

We have the following formula for determining how many combinations C we can pick of size k out of a set of n:
I have written an algorithm which will always give an answer if, of course, the answer falls within the range of the datatype (ulong, in my case), by factorising and cancelling terms on the numerator and denominator during evaluation.
Even though it's quite fast to try to compute C and detect an overflow if the result is too large, it would be better if I could put n and k into a preliminary function which estimates whether the answer will be larger than what ulong can hold. It doesn't have to be exact. If it estimates that a given n and k will not overflow but it does, that's fine - but it should never say this it will overflow if it won't. Ideally this function should be very fast otherwise there is no point in having it - I may as well try and compute C directly and let it overflow.
I was plotting the curve of the nCk for various n's as a function of k to see if I can find a curve which grows at least as fast as C(n, k) but doesn't diverge too far in the range I'm interested in (0..2^64-1) and is computationally easy to evaluate.
I didn't have any luck. Any ideas?
Without seeing the actual code for your algorithm, I can't give you a 100% solution, but your best bet is to develop a heuristic function. By simply finding the smallest value of r for which the final answer to nCr overflows for a variety of n values, you should then be able to analyze the relationship between something like n and the ratio between n and r (n/r), and find a quick to calculate function which would let you know if overflow would occur via regression.
I found that for any n < 68, you should never overflow on the final answer, as 67C33 = 67C34 ~ 1.42x1019 is the largest possible answer, and a ulong holds ~1.84x1019. Similarly, when n > 5000, any r > 5 or n-r < n-5 will certainly overflow. You can tune these cutoffs to your liking, and for all the n values in between them, just calculate n/r and use the regression formula to decide if it will overflow or not.
This might be too much work, but it should at least get you started on the right path.