Performance of immutable set implementations in Scala - scala

I have recently been diving into Scala and (perhaps predictably) have spent quite a bit of time studying the immutable collections API in the Scala standard library.
I am writing an application that necessarily does many +/- operations on large sets. For this reason, I want to ensure that the implementation I choose is a so-called "persistent" data structure so that I avoid doing copy-on-write. I saw this answer by Martin Odersky, but it didn't really clear up the issue for me.
I wrote the following test code to compare the performance of ListSet and HashSet for add operations:
import scala.collection.immutable._
object TestListSet extends App {
var set = new ListSet[Int]
for(i <- 0 to 100000) {
set += i
}
}
object TestHashSet extends App {
var set = new HashSet[Int]
for(i <- 0 to 100000) {
set += i
}
}
Here is a rough runtime measurement of the HashSet:
$ time scala TestHashSet
real 0m0.955s
user 0m1.192s
sys 0m0.147s
And ListSet:
$ time scala TestListSet
real 0m30.516s
user 0m30.612s
sys 0m0.168s
Cons on a singly linked list is a constant-time operation, but this performance looks linear or worse. Is this performance hit related to the need to check each element of the set for object equality to conform to the no-duplicates invariant of Set? If this is the case, I realize it's not related to "persistence".
As for official documentation, all I could find was the following page, but it seems incomplete: Scala 2.8 Collections API -- Performance Characteristics. Since ListSet seems initially to be a good choice for its memory footprint, maybe there should be some information about its performance in the API docs.

An old question but also a good example of conclusions being drawn on the wrong foundation.
Connor, basically you're trying to do a microbenchmark. That is generally not recommended and damn hard to do properly.
Why? Because the JVM is doing many other things than executing the code in your examples. It's loading classes, doing garbage collection, compiling bytecode to native code, etc. All dynamically and based on different metrics sampled at runtime.
So you cannot conclude anything about the performance of the two collections with the above test code. For example, what you could actually be measuring could be the compilation time of the += method of HashSet and garbage collection times of ListSet. So it's a comparison between apples and pears.
To do a micro benchmark properly, you should:
Warm up the JVM: Load all classes, ensure all code paths in the benchmark are run and hot spots in the code are compiled (e.g. the += method).
Run the benchmark and ensure neither the GC or the compiler runs during the test (use the JVM flags -XX:-PrintCompilation and -XX:-PrintGC). If either runs during the test, discard the result.
Repeat step 2 and sample 10-15 good measurements. Calculate variance and standard deviation.
Evaluate: If the mean of each benchmark +/- 3 std do not overlap, then you can draw a conclusion about which is faster. Otherwise, it's a blurry result (depending on the amount of overlap).
I can recommend reading Oracle's recommendations for doing micro benchmarks and a great article about benchmark pitfalls by Brian Goetz.
Also, if you want to use a good tool, which does all the above for you, try Caliper by Google.

The key line from the ListSet source is (within subclass Node):
override def +(e: A): ListSet[A] = if (contains(e)) this else new Node(e)
where you can see that an item is added only if it is not already contained. So adding to the set is O(n). You can generally assume that XMap has similar performance characteristics to XSet, and ListMap is listed as linear time all the way along. This is why, and it is how a set is supposed to behave.
P.S. In the TestHashSet case you're measuring startup time. It's way more than 30x faster.

Since a set has to have with no duplicates, before adding an element, a Set must check to see if it already contains the element. This search in a list that has no guarantee of an element's position will be O(N) linear time. The same general idea applies to its remove operation.
With a HashSet, the class defines a function that picks a location for any element in O(1), which makes the contains(element) method much quicker, at the expense of taking up more space to decrease the chance of element location collisions.

Related

Why is HashSet<int> much slower than HashSet<String> in dart?

I noticed a HashSet<int> performing very slowly when working on a Flutter project. I had about 20,000 integers in a Set, and checking set.contains() took a very long time. But when I use toString() to convert all items to string, it performed 1000x faster.
I then tried to create a minimal reproducible code with 10 million random integers, but I couldn't reproduce the issue. Turns out, something special about these data caused the extreme slowness. I've attached a test code (and data) at the end of this question.
How to reproduce:
First, click "add int" button to add 14790 integers to a set. Then click "query int" (runs set.contains(123)) and "query string" (runs set.contains('123')). Observe that: 1. both operations are super slow; 2. the int version is slower than the string version. Picture:
Then click "clear items", then "add string" to add the toString() version of the data. Then click "query int" and "query string" again, notice how much faster it becomes. Picture:
Lastly, click both "add int" and "add string" to create a mixed set (with twice the entries). Observe that the querying times dropped in half for the int version, as if the faster strings helped "dilute" the problem. Picture:
I've had several friends running the same test code on various machines (intel i5, apple M1, snapdragon), timings are different but the conclusions are the same.
What's not the answer here:
Here are some things I considered, but they couldn't explain what's happening with some more tests.
Maybe int needs boxing, whereas string is already an object?
That's probably not the issue here. With 1 million randomly generated values, ints performed faster than strings.
string is immutable so their hash value could be cached?
I don't know if they are cached, but this doesn't explain the results observed with 1 million randomly generated values.
int hash resulted in a lot of collisions?
I tried to print out .hashCode for all ints and strings in the data set, and verified they are all unique.
Test code:
The full test code with data is too long for StackOverflow, I've put it here https://pastebin.com/raw/4fm2hKQB instead.
So yeah, I'm lost, if anyone could help me understand what's going on that'll be greatly appreciated!
I commented on the issue in the Dart repo. For completeness I will mention the 'answer' part of the comment here.
The implementations of HashSet and LinkedHashSet make the assumption that the key.hashCode values are 'good' hash codes that are reasonably distributed over a range of integers so that the lower N bits do not collide or nearly collide to 'bunch up' in the hash-table. Unfortunately int.hashCode does not have this property as it is effectively the identity function.
Things go wrong when the lower bits of all the keys are the same (or have only a few of the possible values) so taking the lower N bits gives the same effective hash code value. This is just the power-of-two version of the % 1000 example mentioned by #ch271828n.
#ch271828n mentions using a different hashCode. This is probably the best short-term solution. Use LinkedHashSet(hashCode: dispersedHashCode) with something like this:
int dispersedHashCode(e) { // untested!
int hash = e.hashCode;
// Odd number with 30%-50% of the bits set in an irregular pattern.
hash *= 0x1736B4D29;
hash += hash >>> 20;
// maybe do it again to let bits higher that 20 influence the low bits.
return hash;
}
Something like this would ideally be built into the core library hashed structures. This might take a long time since, realistically, a performance issue with a simple work-around will be likely be prioritized behind security bugs, incorrect behaviour bugs, performance issues with no work-around, and new features that enable customers to do things that are otherwise impossible to difficult to do.
A completely different approach would be to use an ordered Set like SplayTreeSet.
I am also considering hash collision problem.
int hash resulted in a lot of collisions?
I tried to print out .hashCode for all ints and strings in the data set, and verified they are all unique.
Well, "all unique" does not mean "there is no collision". For a hash set, the number of bins are much less than the number of hashcode. For example, suppose you have a hash set with 1000 bins, and the mapping from hashCode to bin index is a simple bin index := hashCode % 1000, and suppose your data has hashCode like 0, 1000, 2000, 3000 etc. In this artificial case, your data has all unique hashCode, but they all fall into the first bin out of the 1000 bins. Huge collision!
A simple approach to debug whether it is the problem of hashcode: Re-run the program with LinkedHashSet(hashCode: (e) => some_other_hash_approach(e), equals: ...). By using such a new hash set, you can test on other hashCode generating functions. If some hashCode generating functions do not result in the same extremely slow speed, it is highly because of the original hashCode function which causes collision.
In addition, you can even use the same hashCode method for both the int and the String case. Then you guarantee that both cases have exactly the same collision behavior. Then it is easy to see whether collision is the cause, or is unrelated.
Another debug approach: Look at the C++ source code of LinkedHashSet, and see what algorithms it uses to assign data to bins. Then check whether collision as mentioned above happens or not.
A third debug method: Compile the pure-Dart program into an executable, and use profilers like perf to run it. Then you can see which code is hottest and consume most of the time. You may need debug symbols of Dart's native C++ code, which should be fetchable.

What is the benefit of effect system (e.g. ZIO)?

I'm having hard time understanding what value effect systems, like ZIO or Cats Effect.
It does not make code readable, e.g.:
val wrappedB = for {
a <- getA() // : ZIO[R, E, A]
b <- getB(a) // : ZIO[R, E, B]
} yield b
is no more readable to me than:
val a = getA() // : A
val b = getB(a) // : B
I could even argue, that the latter is more straight forward, because calling a function executes it, instead of just creating an effect or execution pipeline.
Delayed execution does not sound convincing, because all examples I've encountered so far are just executing the pipeline right away anyways. Being able to execute effects in parallel or multiple time can be achieved in simpler ways IMHO, e.g. C# has Parallel.ForEach
Composability. Functions can be composed without using effects, e.g. by plain composition.
Pure functional methods. In the end the pure instructions will be executed, so it seems like it's just pretending DB access is pure. It does not help to reason, because while construction of the instructions is pure, executing them is not.
I may be missing something or just downplaying the benefits above or maybe benefits are bigger in certain situations (e.g. complex domain).
What are the biggest selling points to use effect systems?
Because it makes it easy to deal with side effects. From your example:
a <- getA() // ZIO[R, E, A] (doesn't have to be ZIO btw)
val a = getA(): A
The first getA accounts in the effect and the possibility of returning an error, a side effect. This would be like getting an A from some db where the said A may not exist or that you lack permission to access it. The second getA would be like a simple def getA = "A".
How do we put these methods together ? What if one throws an error ? Should we still proceed to the next method or just quit it ? What if one blocks your thread ?
Hopefully that addresses your second point about composability. To quickly address the rest:
Delayed execution. There are probably two reasons for this. The first is you actually don't want to accidentally start an execution. Or just because you write it it starts right away. This breaks what the cool guys refer to as referential transparency. The second is concurrent execution requires a thread pool or execution context. Normally we want to have a centralized place where we can fine tune it for the whole app. And when building a library we can't provide it ourselves. It's the users who provide it. In fact we can also defer the effect. All you do is define how the effect should behave and the users can use ZIO, Monix, etc, it's totally up to them.
Purity. Technically speaking wrapping a process in a pure effect doesn't necessarily mean the underlying process actually uses it. Only the implementation knows if it's really used or not. What we can do is lift it to make it compatible with the composition.
what makes programming with ZIO or Cats great is when it comes to concurrent programming. They are also other reasons but this one is IMHO where I got the "Ah Ah! Now I got it".
Try to write a program that monitor the content of several folders and for each files added to the folders parse their content but not more than 4 files at the same time. (Like the example in the video "What Java developpers could learn from ZIO" By Adam Fraser on youtube https://www.youtube.com/watch?v=wxpkMojvz24 .
I mean this in ZIO is really easy to write :)
The all idea behind the fact that you combine data structure (A ZIO is a data structure) in order to make bigger data structure is so easy to understand that I would not want to code without it for complex problems :)
The two examples are not comparable since an error in the first statement will mark as faulty the value equal to the objectified sequence in the first form while it will halt the whole program in the second. The second form shall then be a function definition to properly encapsulate the two statements, followed by an affectation of the result of its call.
But more than that, in order to completely mimic the first form, some additional code has to be written, to catch exceptions and build a true faulty result, while all these things are made for free by ZIO...
I think that the ability to cleanly propagate the error state between successive statements is the real value of the ZIO approach. Any composite ZIO program fragment is then fully composable itself.
That's the main benefit of any workflow based approach, anyway.
It is this modularity which gives to effect handling its real value.
Since an effect is an action which structurally may produce errors, handling effects like this is an excellent way to handle errors in a composable way. In fact, handling effects consists in handling errors !

Incremental SAT Solving: save solving instance - change model between runs

From my understanding incremental SAT solving helps to evaluate different models who are quite close to each other.
I want to use this to evaluate a model and if I change it later reevaluate it again using the previous solution for a faster result. However after looking into various SAT Solvers (Sat4J, Minisat, mathsat5) it seems like they are only able to solve incrementally when all models are presented within one run.
I'm quite new to SAT solving so I might be overlooking something. Is there no way to save a solving instance for later use? Is all learning lost on closing the instance?
In incremental mode, you can feed the solver with new constraints.
Depending on the settings, the solver may or may not forget previous learned clauses and heuristics.
To fully take advantage of the incremental mode and discard previously entered constraints in the system, you need to use "assumptions", i.e. specific variables which will activate or disable the constraints in the solver.
See e.g. this discussion in minisat newsgroup: https://groups.google.com/forum/#!topic/minisat/ffXxBpqKh90
SAT4J provides a mechanism, which allows you feed the solver and then remove parts of the clauses and add new ones clauses for the next check for satisfiability. Clauses to be removed need to be added to a ConstGroup. Unfortunately, it is slighly more complicated, as unit clauses need special handling. It works roughly like this:
solver = initialize it with clauses which are not to be removed
boolean satisfiable;
ConstrGroup group = new ConstrGroup();
IVecInt unit = new VecInt();
try {
for (all clauses to be added and removed) {
if (unit clause) {
unit.push(variable from clause);
} else {
group.add(solver.addClause(clause));
}
satisfiable = solver.isSatisfiable(unit);
} catch (ContradictionException e) {
satisfiable = false;
} finally {
group.removeFrom(solver);
}
Unfortunately, the removal of clauses is implemented in a rather naive way and requires quadratic effort in the number of clauses to be removed.
While this solution works in FeatureIDE (see isSatisfiable(Node node) in https://github.com/FeatureIDE/FeatureIDE/blob/develop/plugins/de.ovgu.featureide.fm.core/src/org/prop4j/SatSolver.java), it is likely that there are way more performant solutions out there.
The other solution with assumptions does not work in our case, as we have millions of queries to a single SAT solver instance with up-to 20,000 variables. Assumptions would increate the number of variables from 20 thousand to a million, which is unlikely to help.

Implementing a priority queue in matlab in order to solve optimization problems using BRANCH AND BOUND

I'm trying to code a priority queue in MATLAB, I know there is the SIMULINK toolbox for priority queue, but I'm trying to code it in MATLAB. I have a pseudo code that uses priority queue for a method called BEST First Search with Branch and Bound. The branch and bound algorithm design strategy is a state space tree and it is used to solve optimization problems. simple explanation of what is branch and bound
I have read chapter 5: Branch and Bound from a book called 'FOUNDATIONS OF ALGORITHMS', it's the 4th edition by Richard Neapolitan and Kumarss Naimipour , and the text is about designing algorithms, complexity analysis of algorithms, and computational complexity (analysis of problems), very interesting book, and I came across this pseudocode:
Void BeFS( state_space_tree T, number& best)
{
priority _queue-of_node PQ;
node(u,v);
initialize (PQ) % initialize PQ to be empty
u=root of T;
best=value(v);
insert(PQ,v) insert(PQ,v) is a procedure that adds v to the priority queue PQ
while(!empty(PQ){ % remove node with best bound
remove(PQ,v);
remove(PQ,v) is a procedure that removes the node with the best bound and it assigns its value to v
if(bound(v) is better than best) % check if node is still promising
for (each child of u of v){
if (value (u) is better than best)
(best=value(u);
if (bound(u) is better than best)
insert(PQ,u)
}
}
}
I don't know how to code it in matlab, and branch and bound is an interesting general algorithm for finding optimal solutions of various optimization problems, especially in discrete and combinatorial optimization, instead of using heuristics to find an optimal solution, since branch and bound reduces calculation time and finds the optimal solution faster.
EDIT:
I have checked everywhere whether a solution already has been implemented , before posting a question here. And I came here to get ideas of how I can get started to implement this code
I have included this in your post so people can know better what you expect of them. However, 'ideas to get started to implement' is still not much more specific than 'how to write code in matlab'.
However, I will still try to answer:
Make the structure of the code, write the basic loops and fill them with comments of what you want to do
Pick (the easiest or first) one of those comments, and see whether you can make it happen in a few lines, you can test it by generating some dummy input for that piece of code
Keep repeating step 2 untill all comments have the required code
If you get stuck in one of the blocks, and have searched but not found the answer to a specific question. Then this is not a bad place to ask.

Does MATLAB perform tail call optimization?

I've recently learned Haskell, and am trying to carry the pure functional style over to my other code when possible. An important aspect of this is treating all variables as immutable, i.e. constants. In order to do so, many computations that would be implemented using loops in an imperative style have to be performed using recursion, which typically incurs a memory penalty due to the allocation a new stack frame for each function call. In the special case of a tail call (where the return value of a called function is immediately returned to the callee's caller), however, this penalty can be bypassed by a process called tail call optimization (in one method, this can be done by essentially replacing a call with a jmp after setting up the stack properly). Does MATLAB perform TCO by default, or is there a way to tell it to?
If I define a simple tail-recursive function:
function tailtest(n)
if n==0; feature memstats; return; end
tailtest(n-1);
end
and call it so that it will recurse quite deeply:
set(0,'RecursionLimit',10000);
tailtest(1000);
then it doesn't look as if stack frames are eating a lot of memory. However, if I make it recurse much deeper:
set(0,'RecursionLimit',10000);
tailtest(5000);
then (on my machine, today) MATLAB simply crashes: the process unceremoniously dies.
I don't think this is consistent with MATLAB doing any TCO; the case where a function tail-calls itself, only in one place, with no local variables other than a single argument, is just about as simple as anyone could hope for.
So: No, it appears that MATLAB does not do TCO at all, at least by default. I haven't (so far) looked for options that might enable it. I'd be surprised if there were any.
In cases where we don't blow out the stack, how much does recursion cost? See my comment to Bill Cheatham's answer: it looks like the time overhead is nontrivial but not insane.
... Except that Bill Cheatham deleted his answer after I left that comment. OK. So, I took a simple iterative implementation of the Fibonacci function and a simple tail-recursive one, doing essentially the same computation in both, and timed them both on fib(60). The recursive implementation took about 2.5 times longer to run than the iterative one. Of course the relative overhead will be smaller for functions that do more work than one addition and one subtraction per iteration.
(I also agree with delnan's sentiment: highly-recursive code of the sort that feels natural in Haskell is typically likely to be unidiomatic in MATLAB.)
There is a simple way to check this. Create this function tail_recursion_check:
function r = tail_recursion_check(n)
if n > 1
r = tail_recursion_check(n - 1);
else
error('error');
end
end
and run tail_recursion_check(10), for example. You are going to see a very long stack trace with 10 items that says error at line 3. If there were tail call optimization, you would only see one.