In Fibonacci heap for all operations analysis are Amortized in nature. Why cant we have normal analysis as in case of Binomial Heap.
In a binomial heap, each operation is guaranteed to run with a certain worst-case performance. An insertion will never take more than time O(log n), a merge will never take more than time O(log n + log m), etc. Therefore, when analyzing the efficiency of a binomial heap, it's common to use a more traditional algorithmic analysis.
Now, that said, there are several properties of binomial heaps that only become apparent when doing an amortized analysis. For example, what's the cost of doing n consecutive insertions into a binomial heap, assuming the heap is initially empty? You can show that, in this case, the amortized cost of an insertion is O(1), meaning that the total cost of doing n insertions is O(n). In that sense, using an amortized analysis on top of a traditional analysis reveals more insights about the data structure than might initially arise from a more conservative worst-case analysis.
In some sense, Fibonacci heaps are best analyzed in an amortized sense because even though the worst-case bounds on many of the operations really aren't that great (for example, a delete-min or decrease-key can take time Θ(n) in the worst case), across any series of operations the Fibonacci heap has excellent amortized performance. Even though an individual delete-min might take Θ(n) time, it's never possible for a series of m delete-mins to take more than Θ(m log n) time.
In another sense, though, Fibonacci heaps were specifically designed to be efficient in an amortized sense rather than a worst-case sense. They were initially invented to speed up Dijkstra's and Prim's algorithms, where all that mattered were the total cost of doing m decrease-keys and n deletes on an n-node heap, and since that was the design goal, the designers made no attempt to make the Fibonacci heap efficient in the worst case.
Related
How can I calculate time complexity using mathematics or Big O notation for algebra operations used in the algebra of data.
I will use book example to explain my question. Consider following example given in book.
B
In above example I would like to calculate the time complexity of transpose and compose operation.
If it possible I would also like to find out other algebra data operations' time complexity.
Please let me know if you need more explanation.
#wesholler I edited my question to understand you explanation. Following is a real life example and suppose we want to calculate the time complexity for operations used below.
suppose I have algebra of data operations as follows
Could you describe how we would calculate the time complexity in above example. Preferably in Big O?
Thanks
This answer has three parts:
General Time Complexity Analysis
Generally, the time complexity/BigO can be determined by considering the origin of an operation - that is, what operations were extended from more primitive algebras to derive this one?
The following rules describe the upper-bound on the time complexity for both unary and binary operations that are extended into their power set algebras.
Unary extension can be thought of similarly to a map operation and so has linear time complexity. Binary extension evaluates the cross product of the operation's arguments and so has a worst-case time complexity similar to O(n^2). However it is important to consider that the real upper bound is a product of the cardinality of both arguments; this comes up in practice often when the right-hand argument to a composition or superstriction operation is a singleton.
Time Complexity for algebraixlib Implementations
We can take a look at a few examples of how extension affects the time complexity while at the same time analyzing the complexity of the implementations in algebraixlib (the last part talks about other implementations.)
Being that it is a reference implementation for data algebra, algebraixlib implements extended operations very literally. For that reason, Big Theta is used below, because the formulas represent both the lower and upper bounds of the time complexity.
Here is the unary operation transpose being extended from couplets to relations and then to clans.
Likewise, here is the binary operation compose being extended from couplets to relations and then to clans.
It is clear that the complexity of both of the clan operations is influenced by both the number of relation elements as well as the number of couplets in those relations.
Time Complexity for Other Implementations
It is important to note that the above section describes the time complexity that is specific to the algorithms implemented in algebraixlib.
One could imagine implementing e.g. clans.cross_union with a method similar to sort-merge-join or hash-join. In this case, the upper bound would remain the same, but the lower bound (and expected) time complexity would be reduced by one or more degrees.
This is a similar question to Linear Probing Runtime but it regards quadratic probing.
It makes sense to me that "Theoretical worst case is O(n)" for linear probing because in the worst case, you may have to traverse through every bucket(n buckets)
What would runtime be for quadratic probing? I know that quadratic probes in a quadratic fashion -1, 4, 9, 16, ..... My initial thought was that it's some variation of log n(exponential) but there isn't a consistent base.
If there are n - 1 occupied buckets in your hash table, then regardless of the sequence in which you check for an empty bucket, you cannot rule out the possibility that you will need to test n buckets before finding an empty one. The worst case for quadratic probing therefore cannot be any better than O(n).
It could be worse, however: it's not immediately clear to me that quadratic probing will do a good job of avoiding testing the same bucket more than once. (That's not an issue with linear probing if you choose a step size that is relatively prime to the number of buckets.) I would guess that quadratic probing doesn't revisit the same buckets enough times to make the worst case worse than O(n), but I cannot prove it.
I have questions about real application performance running on a cluster vs cluster peak performance.
Let's say one HPC cluster report that it has peak performance of 1 Petaflops. How is this calculated?
To me, it seems that there are two measuring matrixes. One is the performance calculated based on the hardware. The other one is from running HPL? Is my understanding correct?
When I am reading one real application running on the system at full scale, the developer mentions that it could achieve 10% of the peak performance. How is this measured and why it can't achieve peak performance?
Thanks
Peak performance is what the system is theoretically able to deliver. It is the product of the total number of CPU cores, the core clock frequency, and the number of FLOPs one core makes per clock tick. That performance can never be reached in practice because no real application consists of 100% fully vectorised tight loops that only operate on data held in the L1 data cache. In many cases data doesn't even fit in the last-level cache and the memory interface is usually not fast enough to deliver data at the same rate at which the CPU is able to process it. One ubiquitous example from HPC is the multiplication of a sparse matrix with a vector. It is so memory intensive (i.e. many loads and stores per arithmetic operation) that on many platforms it only achieves a fraction of the peak performance.
Things get even worse when multiple nodes are networked together on a massive scale as data transfers could introduce huge additional delays. Performance in those cases is determined mainly by the ratio of local data processing and data transfer. HPL is a particularly good in that aspect - it does a lot of vectorised local processing and does not move much data across the CPUs/nodes. That's not the case with many real-world parallel programs and also the reason why many are questioning the applicability of HPL in assessing cluster performance nowadays. Alternative benchmarks are already emerging, for example the HPCG benchmark (from the people who brought you HPL).
The theoretical (peak) value is based on the capability of each individual core in the cluster, which depends on clock frequency, number of floating point units, parallel instruction issuing capacity, vector register sizes, etc. which are design characteristics of the core. The flops/s count for each core in the cluster is then aggregated to get the cluster flops/s count.
For a car the equivalent theoretical performance would be the maximum speed it can reach given the specification of its engine.
For a program to reach the theoretical count, it has to perform specific operations in a specific order so that the instruction-level parallelism is maximum and all floating-point units are working constantly without delay due to synchronization or memory access, etc. (See this SO question for more insights)
For a car, it is equivalent to measuring top speed on a straight line with no wind.
But of course, chances that such a program computes something of interest are small. So benchmarks like HPL use actual problems in linear algebra, with a highly optimized and tuned implementation, but which is still imperfect due to IO operations and the fact that the order of operations is not optimal.
For a car, it could be compared to measuring the top average speed on a race track with straight lines, curves, etc.
If the program requires a lot of network, or disk communications, which are operations that require a lot of clock cycle, then the CPU has often to stay idle waiting for data before it can perform arithmetic operations, effectively wasting away a lot of computing power. Then, the actual performance is estimated by dividing the number of floating points operations (addition and multiplications) the program is performing by the time it takes to perform them.
For a car, this would correspond to measuring the top average speed in town with red lights, etc. by calculating the length of the trip divided by the time needed to accomplish it.
Long time reader, first time inquisitor. So I am currently hitting a serious bottleneck from the following code:
for kk=1:KT
parfor jj=1:KT
umodes(kk,jj,:,:,:) = repmat(squeeze(psi(kk,jj,:,:)), [1,1,no_bands]).*squeeze(umodes(kk,jj,:,:,:));
end
end
In plain language, I need to tile the multi-dimensional array 'psi' across another dimension of length 'no_bands' and then perform pointwise multiplication with the matrix 'umodes'. The issue is that each of the arrays I am working with is large, on the order of 20 GB or more.
What is happening then, I suspect, is that my processors grind to a halt due to cache limits or because data is being paged. I am reasonably convinced there is no practical way to reduce the size of my arrays, so at this point I am trying to reduce computational overhead to a bare minimum.
If that is not possible, it might be time to think of using a proper programming language where I can enforce pass by reference to avoid unnecessary replication of arrays.
Often bsxfun uses less memory than repmat. So you can try:
for kk=1:KT
parfor jj=1:KT
umodes(kk,jj,:,:,:) = bsxfun(#times, squeeze(psi(kk,jj,:,:)), squeeze(umodes(kk,jj,:,:,:)));
end
end
Or you can vectorize the two loops. Vectorizing is usually faster, although not necessarily more memory-efficient, so I'm not sure it helps in your case. In any case, bsxfun benefits from multihreading:
umodes = bsxfun(#times, psi, umodes);
If so can you provide explicit examples? I understand that an algorithm like Quicksort can have O(n log n) expected running time, but O(n^2) in the worse case. I presume that if the same principle of expected/worst case applies to theta, then the above question could be false. Understanding how theta works will help me to understand the relationship between theta and big-O.
When $n$ is large enough, the algorithm with complexity $\theta(n)$ will run faster than the algorithm with complexity $\theta(n^2)$. In fact $\theta(n) / \theta(n^2)\to 0$ as $\theta \to \infty$. However there might be values of $n$ where $\theta(n) > \theta(n^2)$.
It's not always faster, only asymptotically faster (when n grows infinitely). But after some n — yes, it is always faster.
For example, for little n a bubble sort may operate faster than quick sort just because it's simpler (its θ has lower constants).
This has nothing to do with expected/worst cases: selecting a case is another problem that is not related to theta or big-O.
And about the relationship between theta and big-O: in computer science, big-O is often (mis)used in sense of θ, but in its strict meaning big-O is a more wide class than θ: it limits only the upper bound of a growing function while theta limits both bounds. E.g. when somebody says that Quicksort has a complexity of O(n log n), he actually means θ(n log n).
You are on the right track of thought.
Actual runtime of program can be quite different from asymptotic bounds.This is a fundamental concept that arises from the way asymptotic notation is defined.
You can read my answer here to clarify.