What is the time complexity of this hash function? - hash

I have a hashing function and I want to know if it is constant. Since the length of the array word is constant, does that mean the function is constant in Big O notation?
public int hash(String s) {
if (s.length() > 7)
return -1;
for (int i = 0; i < word.length; ++i) {
if (word[i].compareTo(s) == 0)
return i;
}
return -1;
}

Since the length of the array word is constant, does that mean the function is constant in Big O notation?
Big O is used to describe how the run time or memory consumption of a process grows as its input grows. If your array is of constant length, then it will not grow and have an effect. Therefore, you can in this context consider hash() to run in O(1), assuming that the string comparisons are done in relatively constant time.
One way to think about it would be to say that since the length of the array is not variable, it should always be possible to "unroll" that loop so as to have a fixed number of O(1) comparisons one after the other, which all-in-all will still be O(1). Again, this presumes that the time taken to compare the strings is also constant (which in reality may not be the case if you have very large strings of varying lengths). Of course, if you know that the contents of the array will also be constant in addition to its length, then you can say for certain that the function will be O(1).

The time required to compare two strings of lengths m and n is O(min{m, n} + 1). Let's suppose that k is the length of the word array and that m is the length of the longest word in word and n is the length of the input string. In this case, the function does O(k) string comparisons, each of which take time O(min{m, n} + 1). Therefore, the runtime is O(k min{m, n} + m).
Now, since m is known to be a constant, we can simplify this and say that the runtime will be O(min{m, n} + 1). If all of the strings in word are fixed constants, then m is a constant and the runtime is O(min{1, n} + 1) = O(1) and your hash function runs in constant time. Otherwise, if they're unboundedly long, the only thing you can claim is that the runtime is O(min{m, n} + 1).
Hope this helps!

This function is O(1) if word is constant.
s.length() runs in constant time regardless of the length of s.
The time it takes to run word[i].compareTo(s) is bounded by the length of word[i]. As long as word doesn't change, this means there is an upper bound for the time it takes to run the entire for loop.
So there's an upper bound on the time this function takes to run, and the function is O(1).
If word can change, I believe this function would be O(n) where n is the size of word. However, if the elements of word have increasing lengths, word[i].compareTo(s) will be bounded by larger and larger numbers, so the length of s might begin to matter. Perhaps the complexity is actually O(n^2). I don't know, and now I'm curious myself.

your function has complexity O(N2), as it has 2 inputs:
s - your string (length N1)
word - array (length N2)
so, you complexity will be O(N1*N2), which can be simplified to O(N2)
if length N2 is really const, then function will have complexity O(N1) in worst case.
if length N1 also consts - then we have O(1) complexity

Related

Calculating the e number using Raku

I'm trying to calculate the e constant (AKA Euler's Number) by calculating the formula
In order to calculate the factorial and division in one shot, I wrote this:
my #e = 1, { state $a=1; 1 / ($_ * $a++) } ... *;
say reduce * + * , #e[^10];
But it didn't work out. How to do it correctly?
I analyze your code in the section Analyzing your code. Before that I present a couple fun sections of bonus material.
One liner One letter1
say e; # 2.718281828459045
"A treatise on multiple ways"2
Click the above link to see Damian Conway's extraordinary article on computing e in Raku.
The article is a lot of fun (after all, it's Damian). It's a very understandable discussion of computing e. And it's a homage to Raku's bicarbonate reincarnation of the TIMTOWTDI philosophy espoused by Larry Wall.3
As an appetizer, here's a quote from about halfway through the article:
Given that these efficient methods all work the same way—by summing (an initial subset of) an infinite series of terms—maybe it would be better if we had a function to do that for us. And it would certainly be better if the function could work out by itself exactly how much of that initial subset of the series it actually needs to include in order to produce an accurate answer...rather than requiring us to manually comb through the results of multiple trials to discover that.
And, as so often in Raku, it’s surprisingly easy to build just what we need:
sub Σ (Unary $block --> Numeric) {
(0..∞).map($block).produce(&[+]).&converge
}
Analyzing your code
Here's the first line, generating the series:
my #e = 1, { state $a=1; 1 / ($_ * $a++) } ... *;
The closure ({ code goes here }) computes a term. A closure has a signature, either implicit or explicit, that determines how many arguments it will accept. In this case there's no explicit signature. The use of $_ (the "topic" variable) results in an implicit signature that requires one argument that's bound to $_.
The sequence operator (...) repeatedly calls the closure on its left, passing the previous term as the closure's argument, to lazily build a series of terms until the endpoint on its right, which in this case is *, shorthand for Inf aka infinity.
The topic in the first call to the closure is 1. So the closure computes and returns 1 / (1 * 1) yielding the first two terms in the series as 1, 1/1.
The topic in the second call is the value of the previous one, 1/1, i.e. 1 again. So the closure computes and returns 1 / (1 * 2), extending the series to 1, 1/1, 1/2. It all looks good.
The next closure computes 1 / (1/2 * 3) which is 0.666667. That term should be 1 / (1 * 2 * 3). Oops.
Making your code match the formula
Your code is supposed to match the formula:
In this formula, each term is computed based on its position in the series. The kth term in the series (where k=0 for the first 1) is just factorial k's reciprocal.
(So it's got nothing to do with the value of the prior term. Thus $_, which receives the value of the prior term, shouldn't be used in the closure.)
Let's create a factorial postfix operator:
sub postfix:<!> (\k) { [×] 1 .. k }
(× is an infix multiplication operator, a nicer looking Unicode alias of the usual ASCII infix *.)
That's shorthand for:
sub postfix:<!> (\k) { 1 × 2 × 3 × .... × k }
(I've used pseudo metasyntactic notation inside the braces to denote the idea of adding or subtracting as many terms as required.
More generally, putting an infix operator op in square brackets at the start of an expression forms a composite prefix operator that is the equivalent of reduce with => &[op],. See Reduction metaoperator for more info.
Now we can rewrite the closure to use the new factorial postfix operator:
my #e = 1, { state $a=1; 1 / $a++! } ... *;
Bingo. This produces the right series.
... until it doesn't, for a different reason. The next problem is numeric accuracy. But let's deal with that in the next section.
A one liner derived from your code
Maybe compress the three lines down to one:
say [+] .[^10] given 1, { 1 / [×] 1 .. ++$ } ... Inf
.[^10] applies to the topic, which is set by the given. (^10 is shorthand for 0..9, so the above code computes the sum of the first ten terms in the series.)
I've eliminated the $a from the closure computing the next term. A lone $ is the same as (state $), an anonynous state scalar. I made it a pre-increment instead of post-increment to achieve the same effect as you did by initializing $a to 1.
We're now left with the final (big!) problem, pointed out by you in a comment below.
Provided neither of its operands is a Num (a float, and thus approximate), the / operator normally returns a 100% accurate Rat (a limited precision rational). But if the denominator of the result exceeds 64 bits then that result is converted to a Num -- which trades performance for accuracy, a tradeoff we don't want to make. We need to take that into account.
To specify unlimited precision as well as 100% accuracy, simply coerce the operation to use FatRats. To do this correctly, just make (at least) one of the operands be a FatRat (and none others be a Num):
say [+] .[^500] given 1, { 1.FatRat / [×] 1 .. ++$ } ... Inf
I've verified this to 500 decimal digits. I expect it to remain accurate until the program crashes due to exceeding some limit of the Raku language or Rakudo compiler. (See my answer to Cannot unbox 65536 bit wide bigint into native integer for some discussion of that.)
Footnotes
1 Raku has a few important mathematical constants built in, including e, i, and pi (and its alias π). Thus one can write Euler's Identity in Raku somewhat like it looks in math books. With credit to RosettaCode's Raku entry for Euler's Identity:
# There's an invisible character between <> and i⁢π character pairs!
sub infix:<⁢> (\left, \right) is tighter(&infix:<**>) { left * right };
# Raku doesn't have built in symbolic math so use approximate equal
say e**i⁢π + 1 ≅ 0; # True
2 Damian's article is a must read. But it's just one of several admirable treatments that are among the 100+ matches for a google for 'raku "euler's number"'.
3 See TIMTOWTDI vs TSBO-APOO-OWTDI for one of the more balanced views of TIMTOWTDI written by a fan of python. But there are downsides to taking TIMTOWTDI too far. To reflect this latter "danger", the Perl community coined the humorously long, unreadable, and understated TIMTOWTDIBSCINABTE -- There Is More Than One Way To Do It But Sometimes Consistency Is Not A Bad Thing Either, pronounced "Tim Toady Bicarbonate". Strangely enough, Larry applied bicarbonate to Raku's design and Damian applies it to computing e in Raku.
There is fractions in $_. Thus you need 1 / (1/$_ * $a++) or rather $_ /$a++.
By Raku you could do this calculation step by step
1.FatRat,1,2,3 ... * #1 1 2 3 4 5 6 7 8 9 ...
andthen .produce: &[*] #1 1 2 6 24 120 720 5040 40320 362880
andthen .map: 1/* #1 1 1/2 1/6 1/24 1/120 1/720 1/5040 1/40320 1/362880 ...
andthen .produce: &[+] #1 2 2.5 2.666667 2.708333 2.716667 2.718056 2.718254 2.718279 2.718282 ...
andthen .[50].say #2.71828182845904523536028747135266249775724709369995957496696762772

Does Swift have quadratic string concatenation when using var?

In the Swift Language Reference, under String Mutability it says:
You indicate whether a particular String can be modified (or mutated) by assigning it to a variable (in which case it can be modified), or to a constant (in which case it cannot be modified)
It's unclear to me if the "it" that is mutable is the variable or the value.
For example, if I write:
var s = ""
for i in 0...100 {
s += "a"
}
Is it akin to creating an NSMutableString and calling appendString 100 times (i.e. linear cost)?
Or is it akin to creating a series of ever-larger NSString instances and combining them with stringByAppendingString (i.e. quadratic cost)?
Or perhaps it creates some kind of rope structure behind the scenes, so it's immutable and linear in aggregate?
Appending to a collection like this (while String is not itself a collection, you're essentially appending to its characters view with that code) is linear, not quadratic. A string in Swift has an internal buffer whose size is doubled whenever it fills up, which means you will see fewer and fewer reallocations as you repeatedly append. The documentation describes appending in this way as an "amortized" O(1) operation: most of the time appending is O(1), but occasionally it will need to reallocate the string's storage.
Arrays, sets, and dictionaries have the same behavior, although you can also reserve a specific capacity for an array (using reserveCapacity(_:)) if you know you'll be appending many times.
All these collections use "copy-on-write" to guarantee value semantics. Here, x and y share a buffer:
let x = "a"
let y = x
If you mutate x, it gets a new, unique copy of the buffer:
x += "b"
// x == "ab"
// y == "a"
After that, x has its own buffer, so subsequent mutations won't require a copy.
x += "c" // no copy unless buffer is full

kdb c++ interface: create byte list from std::string

The following is very slow for long strings:
std::string s = "long string";
K klist = DBVec::CreateList(KG , s.length());
for (int i=0; i<s.length(); i++)
{
kG(klist)[i]=s.c_str()[i];
}
It works acceptably fast (<100ms) for strings up to 100k, but slows to a crawl (tens of minutes, possibly hours) for strings of a few million characters. I don't see anything other than kG that can create nonlinearity. I don't see any reason for accessor function kG to be non-constant time, but there is just nothing else in this loop. Unfortunately I don't know how kG works due to lack of documentation.
Question: given a blob of binary data as std::string, what's the efficient way to construct a byte list?
kG is a macro defined in k.h which expands to ((x)->G0), i.e. follow the G0 pointer of the K object
http://kx.com/q/d/a/c.htm#Strings documents kp, which creates a K string object directly from a string, so presumably you could do K klist = kp(s.c_str()), which is probably faster
This works:
memcpy(kG(klist), s.c_str(), s.length());
Still wonder why that loop is not O(N).

Could max introduce round-off error?

In general, the == operator is not suited to test for "numeric" equality, but one should rather do something like abs(a - b) < eps. However, when I want to find the location of the largest element in an array, is it save to assume that max will return the element unchanged? Is it ok to do
[row, col] = find(a == max(a(:));
Yes.
max only compares two values, and does not do any operations on them that might change their values.
Here's a typical C++ implementation of a max:
template <class T>
T max(T a, T b) {
return a>b ? a : b;
}
As you see, this function will return the exact same value as either a or b.
Matlab just adds matrix formalism, fancy formatting wrappers etc. to it, but its kernel will follow the same principles as the example above.
So yes, it is OK to use equality here.

Performance difference between functions and pattern matching in Mathematica

So Mathematica is different from other dialects of lisp because it blurs the lines between functions and macros. In Mathematica if a user wanted to write a mathematical function they would likely use pattern matching like f[x_]:= x*x instead of f=Function[{x},x*x] though both would return the same result when called with f[x]. My understanding is that the first approach is something equivalent to a lisp macro and in my experience is favored because of the more concise syntax.
So I have two questions, is there a performance difference between executing functions versus the pattern matching/macro approach? Though part of me wouldn't be surprised if functions were actually transformed into some version of macros to allow features like Listable to be implemented.
The reason I care about this question is because of the recent set of questions (1) (2) about trying to catch Mathematica errors in large programs. If most of the computations were defined in terms of Functions, it seems to me that keeping track of the order of evaluation and where the error originated would be easier than trying to catch the error after the input has been rewritten by the successive application of macros/patterns.
The way I understand Mathematica is that it is one giant search replace engine. All functions, variables, and other assignments are essentially stored as rules and during evaluation Mathematica goes through this global rule base and applies them until the resulting expression stops changing.
It follows that the fewer times you have to go through the list of rules the faster the evaluation. Looking at what happens using Trace (using gdelfino's function g and h)
In[1]:= Trace#(#*#)&#x
Out[1]= {x x,x^2}
In[2]:= Trace#g#x
Out[2]= {g[x],x x,x^2}
In[3]:= Trace#h#x
Out[3]= {{h,Function[{x},x x]},Function[{x},x x][x],x x,x^2}
it becomes clear why anonymous functions are fastest and why using Function introduces additional overhead over a simple SetDelayed. I recommend looking at the introduction of Leonid Shifrin's excellent book, where these concepts are explained in some detail.
I have on occasion constructed a Dispatch table of all the functions I need and manually applied it to my starting expression. This provides a significant speed increase over normal evaluation as none of Mathematica's inbuilt functions need to be matched against my expression.
My understanding is that the first approach is something equivalent to a lisp macro and in my experience is favored because of the more concise syntax.
Not really. Mathematica is a term rewriter, as are Lisp macros.
So I have two questions, is there a performance difference between executing functions versus the pattern matching/macro approach?
Yes. Note that you are never really "executing functions" in Mathematica. You are just applying rewrite rules to change one expression into another.
Consider mapping the Sqrt function over a packed array of floating point numbers. The fastest solution in Mathematica is to apply the Sqrt function directly to the packed array because it happens to implement exactly what we want and is optimized for this special case:
In[1] := N#Range[100000];
In[2] := Sqrt[xs]; // AbsoluteTiming
Out[2] = {0.0060000, Null}
We might define a global rewrite rule that has terms of the form sqrt[x] rewritten to Sqrt[x] such that the square root will be calculated:
In[3] := Clear[sqrt];
sqrt[x_] := Sqrt[x];
Map[sqrt, xs]; // AbsoluteTiming
Out[3] = {0.4800007, Null}
Note that this is ~100× slower than the previous solution.
Alternatively, we might define a global rewrite rule that replaces the symbol sqrt with a lambda function that invokes Sqrt:
In[4] := Clear[sqrt];
sqrt = Function[{x}, Sqrt[x]];
Map[sqrt, xs]; // AbsoluteTiming
Out[4] = {0.0500000, Null}
Note that this is ~10× faster than the previous solution.
Why? Because the slow second solution is looking up the rewrite rule sqrt[x_] :> Sqrt[x] in the inner loop (for each element of the array) whereas the fast third solution looks up the value Function[...] of the symbol sqrt once and then applies that lambda function repeatedly. In contrast, the fastest first solution is a loop calling sqrt written in C. So searching the global rewrite rules is extremely expensive and term rewriting is expensive.
If so, why is Sqrt ever fast? You might expect a 2× slowdown instead of 10× because we've replaced one lookup for Sqrt with two lookups for sqrt and Sqrt in the inner loop but this is not so because Sqrt has the special status of being a built-in function that will be matched in the core of the Mathematica term rewriter itself rather than via the general-purpose global rewrite table.
Other people have described much smaller performance differences between similar functions. I believe the performance differences in those cases are just minor differences in the exact implementation of Mathematica's internals. The biggest issue with Mathematica is the global rewrite table. In particular, this is where Mathematica diverges from traditional term-level interpreters.
You can learn a lot about Mathematica's performance by writing mini Mathematica implementations. In this case, the above solutions might be compiled to (for example) F#. The array may be created like this:
> let xs = [|1.0..100000.0|];;
...
The built-in sqrt function can be converted into a closure and given to the map function like this:
> Array.map sqrt xs;;
Real: 00:00:00.006, CPU: 00:00:00.015, GC gen0: 0, gen1: 0, gen2: 0
...
This takes 6ms just like Sqrt[xs] in Mathematica. But that is to be expected because this code has been JIT compiled down to machine code by .NET for fast evaluation.
Looking up rewrite rules in Mathematica's global rewrite table is similar to looking up the closure in a dictionary keyed on its function name. Such a dictionary can be constructed like this in F#:
> open System.Collections.Generic;;
> let fns = Dictionary<string, (obj -> obj)>(dict["sqrt", unbox >> sqrt >> box]);;
This is similar to the DownValues data structure in Mathematica, except that we aren't searching multiple resulting rules for the first to match on the function arguments.
The program then becomes:
> Array.map (fun x -> fns.["sqrt"] (box x)) xs;;
Real: 00:00:00.044, CPU: 00:00:00.031, GC gen0: 0, gen1: 0, gen2: 0
...
Note that we get a similar 10× performance degradation due to the hash table lookup in the inner loop.
An alternative would be to store the DownValues associated with a symbol in the symbol itself in order to avoid the hash table lookup.
We can even write a complete term rewriter in just a few lines of code. Terms may be expressed as values of the following type:
> type expr =
| Float of float
| Symbol of string
| Packed of float []
| Apply of expr * expr [];;
Note that Packed implements Mathematica's packed lists, i.e. unboxed arrays.
The following init function constructs a List with n elements using the function f, returning a Packed if every return value was a Float or a more general Apply(Symbol "List", ...) otherwise:
> let init n f =
let rec packed ys i =
if i=n then Packed ys else
match f i with
| Float y ->
ys.[i] <- y
packed ys (i+1)
| y ->
Apply(Symbol "List", Array.init n (fun j ->
if j<i then Float ys.[i]
elif j=i then y
else f j))
packed (Array.zeroCreate n) 0;;
val init : int -> (int -> expr) -> expr
The following rule function uses pattern matching to identify expressions that it can understand and replaces them with other expressions:
> let rec rule = function
| Apply(Symbol "Sqrt", [|Float x|]) ->
Float(sqrt x)
| Apply(Symbol "Map", [|f; Packed xs|]) ->
init xs.Length (fun i -> rule(Apply(f, [|Float xs.[i]|])))
| f -> f;;
val rule : expr -> expr
Note that the type of this function expr -> expr is characteristic of term rewriting: rewriting replaces expressions with other expressions rather than reducing them to values.
Our program can now be defined and executed by our custom term rewriter:
> rule (Apply(Symbol "Map", [|Symbol "Sqrt"; Packed xs|]));;
Real: 00:00:00.049, CPU: 00:00:00.046, GC gen0: 24, gen1: 0, gen2: 0
We've recovered the performance of Map[Sqrt, xs] in Mathematica!
We can even recover the performance of Sqrt[xs] by adding an appropriate rule:
| Apply(Symbol "Sqrt", [|Packed xs|]) ->
Packed(Array.map sqrt xs)
I wrote an article on term rewriting in F#.
Some measurements
Based on #gdelfino answer and comments by #rcollyer I made this small program:
j = # # + # # &;
g[x_] := x x + x x ;
h = Function[{x}, x x + x x ];
anon = Table[Timing[Do[ # # + # # &[i], {i, k}]][[1]], {k, 10^5, 10^6, 10^5}];
jj = Table[Timing[Do[ j[i], {i, k}]][[1]], {k, 10^5, 10^6, 10^5}];
gg = Table[Timing[Do[ g[i], {i, k}]][[1]], {k, 10^5, 10^6, 10^5}];
hh = Table[Timing[Do[ h[i], {i, k}]][[1]], {k, 10^5, 10^6, 10^5}];
ListLinePlot[ {anon, jj, gg, hh},
PlotStyle -> {Black, Red, Green, Blue},
PlotRange -> All]
The results are, at least for me, very surprising:
Any explanations? Please feel free to edit this answer (comments are a mess for long text)
Edit
Tested with the identity function f[x] = x to isolate the parsing from the actual evaluation. Results (same colors):
Note: results are very similar to this Plot for constant functions (f[x]:=1);
Pattern matching seems faster:
In[1]:= g[x_] := x*x
In[2]:= h = Function[{x}, x*x];
In[3]:= Do[h[RandomInteger[100]], {1000000}] // Timing
Out[3]= {1.53927, Null}
In[4]:= Do[g[RandomInteger[100]], {1000000}] // Timing
Out[4]= {1.15919, Null}
Pattern matching is also more flexible as it allows you to overload a definition:
In[5]:= g[x_] := x * x
In[6]:= g[x_,y_] := x * y
For simple functions you can compile to get the best performance:
In[7]:= k[x_] = Compile[{x}, x*x]
In[8]:= Do[k[RandomInteger[100]], {100000}] // Timing
Out[8]= {0.083517, Null}
You can use function recordSteps in previous answer to see what Mathematica actually does with Functions. It treats it just like any other Head. IE, suppose you have the following
f = Function[{x}, x + 2];
f[2]
It first transforms f[2] into
Function[{x}, x + 2][2]
At the next step, x+2 is transformed into 2+2. Essentially, "Function" evaluation behaves like an application of pattern matching rules, so it shouldn't be surprising that it's not faster.
You can think of everything in Mathematica as an expression, where evaluation is the process of rewriting parts of the expression in a predefined sequence, this applies to Function like to any other head