Why is HashSet<int> much slower than HashSet<String> in dart?

Why is HashSet<int> much slower than HashSet<String> in dart? - flutter

I noticed a HashSet<int> performing very slowly when working on a Flutter project. I had about 20,000 integers in a Set, and checking set.contains() took a very long time. But when I use toString() to convert all items to string, it performed 1000x faster.
I then tried to create a minimal reproducible code with 10 million random integers, but I couldn't reproduce the issue. Turns out, something special about these data caused the extreme slowness. I've attached a test code (and data) at the end of this question.
How to reproduce:
First, click "add int" button to add 14790 integers to a set. Then click "query int" (runs set.contains(123)) and "query string" (runs set.contains('123')). Observe that: 1. both operations are super slow; 2. the int version is slower than the string version. Picture:
Then click "clear items", then "add string" to add the toString() version of the data. Then click "query int" and "query string" again, notice how much faster it becomes. Picture:
Lastly, click both "add int" and "add string" to create a mixed set (with twice the entries). Observe that the querying times dropped in half for the int version, as if the faster strings helped "dilute" the problem. Picture:
I've had several friends running the same test code on various machines (intel i5, apple M1, snapdragon), timings are different but the conclusions are the same.
What's not the answer here:
Here are some things I considered, but they couldn't explain what's happening with some more tests.
Maybe int needs boxing, whereas string is already an object?
That's probably not the issue here. With 1 million randomly generated values, ints performed faster than strings.
string is immutable so their hash value could be cached?
I don't know if they are cached, but this doesn't explain the results observed with 1 million randomly generated values.
int hash resulted in a lot of collisions?
I tried to print out .hashCode for all ints and strings in the data set, and verified they are all unique.
Test code:
The full test code with data is too long for StackOverflow, I've put it here https://pastebin.com/raw/4fm2hKQB instead.
So yeah, I'm lost, if anyone could help me understand what's going on that'll be greatly appreciated!

I commented on the issue in the Dart repo. For completeness I will mention the 'answer' part of the comment here.
The implementations of HashSet and LinkedHashSet make the assumption that the key.hashCode values are 'good' hash codes that are reasonably distributed over a range of integers so that the lower N bits do not collide or nearly collide to 'bunch up' in the hash-table. Unfortunately int.hashCode does not have this property as it is effectively the identity function.
Things go wrong when the lower bits of all the keys are the same (or have only a few of the possible values) so taking the lower N bits gives the same effective hash code value. This is just the power-of-two version of the % 1000 example mentioned by #ch271828n.
#ch271828n mentions using a different hashCode. This is probably the best short-term solution. Use LinkedHashSet(hashCode: dispersedHashCode) with something like this:
int dispersedHashCode(e) { // untested!
int hash = e.hashCode;
// Odd number with 30%-50% of the bits set in an irregular pattern.
hash *= 0x1736B4D29;
hash += hash >>> 20;
// maybe do it again to let bits higher that 20 influence the low bits.
return hash;
}
Something like this would ideally be built into the core library hashed structures. This might take a long time since, realistically, a performance issue with a simple work-around will be likely be prioritized behind security bugs, incorrect behaviour bugs, performance issues with no work-around, and new features that enable customers to do things that are otherwise impossible to difficult to do.
A completely different approach would be to use an ordered Set like SplayTreeSet.

I am also considering hash collision problem.
int hash resulted in a lot of collisions?
I tried to print out .hashCode for all ints and strings in the data set, and verified they are all unique.
Well, "all unique" does not mean "there is no collision". For a hash set, the number of bins are much less than the number of hashcode. For example, suppose you have a hash set with 1000 bins, and the mapping from hashCode to bin index is a simple bin index := hashCode % 1000, and suppose your data has hashCode like 0, 1000, 2000, 3000 etc. In this artificial case, your data has all unique hashCode, but they all fall into the first bin out of the 1000 bins. Huge collision!
A simple approach to debug whether it is the problem of hashcode: Re-run the program with LinkedHashSet(hashCode: (e) => some_other_hash_approach(e), equals: ...). By using such a new hash set, you can test on other hashCode generating functions. If some hashCode generating functions do not result in the same extremely slow speed, it is highly because of the original hashCode function which causes collision.
In addition, you can even use the same hashCode method for both the int and the String case. Then you guarantee that both cases have exactly the same collision behavior. Then it is easy to see whether collision is the cause, or is unrelated.
Another debug approach: Look at the C++ source code of LinkedHashSet, and see what algorithms it uses to assign data to bins. Then check whether collision as mentioned above happens or not.
A third debug method: Compile the pure-Dart program into an executable, and use profilers like perf to run it. Then you can see which code is hottest and consume most of the time. You may need debug symbols of Dart's native C++ code, which should be fetchable.

Related

What potential error can be caused by retrieving and changing of a value in the same line?

I was asked question about what would happen if I try to retrieve a reference value and then try to change it within the same line of code. My answer was that nothing will happen because as I tried to do this before I did not encounter any compiler errors (at least in C# or Java).
What is the real answer to this?
This is example with the pseudo code:
Module main()
Call changeNumber(10)
End Module
Module changeNumber(Integer Ref number)
Set number = number * number
Display number
End Module
(PS. Sorry for not formatting/creting this post correctly. I'm having bit of an issue here.)

There would be no unusual side effects, if that's what you're asking. The language specifications dictate a specific order of execution (number * number is evaluated, then set to number), which prevents any issues from occurring.

Nothing would happen, in your particular pseudo code.

In reference to a question you asked after this question -
Actually there could be issues in some rare instance, but it would depend on how you are allocating space for the number and your coding language. Consider this. you are explicitly naming the data type as an int, well what if the accepted input is a larger number than an int can handle (or a negative number), that ends up being x integers long and then you multiply it by that same number, your allocated space (which usually has padding with integer) could also have insufficient space that is too small for this particular instance, which would cause a segmentation fault in C. depending on which language you're using, whether it's higher level than C, it may have compilation checks for this case, but not always.

Which is better in PHP: suppress warnings with '#' or run extra checks with isset()?

For example, if I implement some simple object caching, which method is faster?
1. return isset($cache[$cls]) ? $cache[$cls] : $cache[$cls] = new $cls;
2. return #$cache[$cls] ?: $cache[$cls] = new $cls;
I read somewhere # takes significant time to execute (and I wonder why), especially when warnings/notices are actually being issued and suppressed. isset() on the other hand means an extra hash lookup. So which is better and why?
I do want to keep E_NOTICE on globally, both on dev and production servers.

I wouldn't worry about which method is FASTER. That is a micro-optimization. I would worry more about which is more readable code and better coding practice.
I would certainly prefer your first option over the second, as your intent is much clearer. Also, best to keep away edge condition problems by always explicitly testing variables to make sure you are getting what you are expecting to get. For example, what if the class stored in $cache[$cls] is not of type $cls?
Personally, if I typically would not expect the index on $cache to be unset, then I would also put error handling in there rather than using ternary operations. If I could reasonably expect that that index would be unset on a regular basis, then I would make class $cls behave as a singleton and have your code be something like
return $cls::get_instance();

The isset() approach is better. It is code that explicitly states the index may be undefined. Suppressing the error is sloppy coding.
According to this article 10 Performance Tips to Speed Up PHP, warnings take additional execution time and also claims the # operator is "expensive."
Cleaning up warnings and errors beforehand can also keep you from
using # error suppression, which is expensive.
Additionally, the # will not suppress the errors with respect to custom error handlers:
http://www.php.net/manual/en/language.operators.errorcontrol.php
If you have set a custom error handler function with
set_error_handler() then it will still get called, but this custom
error handler can (and should) call error_reporting() which will
return 0 when the call that triggered the error was preceded by an #.
If the track_errors feature is enabled, any error message generated by
the expression will be saved in the variable $php_errormsg. This
variable will be overwritten on each error, so check early if you want
to use it.

# temporarily changes the error_reporting state, that's why it is said to take time.
If you expect a certain value, the first thing to do to validate it, is to check that it is defined. If you have notices, it's probably because you're missing something. Using isset() is, in my opinion, a good practice.

I ran timing tests for both cases, using hash keys of various lengths, also using various hit/miss ratios for the hash table, plus with and without E_NOTICE.
The results were: with error_reporting(E_ALL) the isset() variant was faster than the # by some 20-30%. Platform used: command line PHP 5.4.7 on OS X 10.8.
However, with error_reporting(E_ALL & ~E_NOTICE) the difference was within 1-2% for short hash keys, and up 10% for longer ones (16 chars).
Note that the first variant executes 2 hash table lookups, whereas the variant with # does only one lookup.
Thus, # is inferior in all scenarios and I wonder if there are any plans to optimize it.

I think you have your priorities a little mixed up here.
First of all, if you want to get a real world test of which is faster - load test them. As stated though suppressing will probably be slower.
The problem here is if you have performance issues with regular code, you should be upgrading your hardware, or optimize the grand logic of your code rather than preventing proper execution and error checking.
Suppressing errors to steal the tiniest fraction of a speed gain won't do you any favours in the long run. Especially if you think that this error may keep happening time and time again, and cause your app to run more slowly than if the error was caught and fixed.

Implementing snapshot in FRP

I'm implementing an FRP framework in Scala and I seem to have run into a problem. Motivated by some thinking, this question I decided to restrict the public interface of my framework so Behaviours could only be evaluated in the 'present' i.e.:
behaviour.at(now)
This also falls in line with Conal's assumption in the Fran paper that Behaviours are only ever evaluated/sampled at increasing times. It does restrict transformations on Behaviours but otherwise we find ourselves in huge problems with Behaviours that represent some input:
val slider = Stepper(0, sliderChangeEvent)
With this Behaviour, evaluating future values would be incorrect and evaluating past values would require an unbounded amount of memory (all occurrences used in the 'slider' event would have to be stored).
I am having trouble with the specification for the 'snapshot' operation on Behaviours given this restriction. My problem is best explained with an example (using the slider mentioned above):
val event = mouseB // an event that occurs when the mouse is pressed
val sampler = slider.snapshot(event)
val stepper = Stepper(0, sampler)
My problem here is that if the 'mouseB' Event has occurred when this code is executed then the current value of 'stepper' will be the last 'sample' of 'slider' (the value at the time the last occurrence occurred). If the time of the last occurrence is in the past then we will consequently end up evaluating 'slider' using a past time which breaks the rule set above (and your original assumption). I can see a couple of ways to solve this:
We 'record' the past (keep hold of all past occurrences in an Event) allowing evaluation of Behaviours with past times (using an unbounded amount of memory)
We modify 'snapshot' to take a time argument ("sample after this time") and enforce that that time >= now
In a more wacky move, we could restrict creation of FRP objects to the initial setup of a program somehow and only start processing events/input after this setup is complete
I could also simply not implement 'sample' or remove 'stepper'/'switcher' (but I don't really want to do either of these things). Has anyone any thoughts on this? Have I misunderstood anything here?

Oh I see what you mean now.
Your "you can only sample at 'now'" restriction isn't tight enough, I think. It needs to be a bit stronger to avoid looking into the past. Since you are using an environmental conception of now, I would define the behavior construction functions in terms of it (so long as now cannot advance by the mere execution of definitions, which, per my last answer, would get messy). For example:
Stepper(i,e) is a behavior with the value i in the interval [now,e1] (where e1 is the
time of first occurrence of e after now), and the value of the most recent occurrence of e afterward.
With this semantics, your prediction about the value of stepper that got you into this conundrum is dismantled, and the stepper will now have the value 0. I don't know whether this semantics is desirable to you, but it seems natural enough to me.

From what I can tell, you are worried about a race condition: what happens if an event occurs while the code is executing.
Purely functional code does not like to have to know that it gets executed. Functional techniques are at their finest in the pure setting, such that it does not matter in what order code is executed. A way out of this dilemma is to pretend that every change happened in one sensitive (internal, probably) piece of imperative code; pretend that any functional declarations in the FRP framework happen in 0 time so it is impossible for something to change during their declaration.
Nobody should ever sleep, or really do anything time sensitive, in a section of code that is declaring behaviors and things. Essentially, code that works with FRP objects ought to be pure, then you don't have any problems.
This does not necessarily preclude running it on multiple threads, but to support that you might need to reorganize your internal representations. Welcome to the world of FRP library implementation -- I suspect your internal representation will fluctuate many times during this process. :-)

I'm confused about your confusion. The way I see is that Stepper will "set" the behavior to a new value whenever the event occurs. So, what happens is the following:
The instant in which the event mouseB occurs, the value of the slider behavior will be read (snapshot). This value will be "set" into the behavior stepper.
So, it is true that the Stepper will "remember" values from the past; the point is that it only remembers the latest value from the past, not everything.
Semantically, it is best to model Stepper as a function like luqui proposes.

Performance of immutable set implementations in Scala

I have recently been diving into Scala and (perhaps predictably) have spent quite a bit of time studying the immutable collections API in the Scala standard library.
I am writing an application that necessarily does many +/- operations on large sets. For this reason, I want to ensure that the implementation I choose is a so-called "persistent" data structure so that I avoid doing copy-on-write. I saw this answer by Martin Odersky, but it didn't really clear up the issue for me.
I wrote the following test code to compare the performance of ListSet and HashSet for add operations:
import scala.collection.immutable._
object TestListSet extends App {
var set = new ListSet[Int]
for(i <- 0 to 100000) {
set += i
}
}
object TestHashSet extends App {
var set = new HashSet[Int]
for(i <- 0 to 100000) {
set += i
}
}
Here is a rough runtime measurement of the HashSet:
$ time scala TestHashSet
real 0m0.955s
user 0m1.192s
sys 0m0.147s
And ListSet:
$ time scala TestListSet
real 0m30.516s
user 0m30.612s
sys 0m0.168s
Cons on a singly linked list is a constant-time operation, but this performance looks linear or worse. Is this performance hit related to the need to check each element of the set for object equality to conform to the no-duplicates invariant of Set? If this is the case, I realize it's not related to "persistence".
As for official documentation, all I could find was the following page, but it seems incomplete: Scala 2.8 Collections API -- Performance Characteristics. Since ListSet seems initially to be a good choice for its memory footprint, maybe there should be some information about its performance in the API docs.

An old question but also a good example of conclusions being drawn on the wrong foundation.
Connor, basically you're trying to do a microbenchmark. That is generally not recommended and damn hard to do properly.
Why? Because the JVM is doing many other things than executing the code in your examples. It's loading classes, doing garbage collection, compiling bytecode to native code, etc. All dynamically and based on different metrics sampled at runtime.
So you cannot conclude anything about the performance of the two collections with the above test code. For example, what you could actually be measuring could be the compilation time of the += method of HashSet and garbage collection times of ListSet. So it's a comparison between apples and pears.
To do a micro benchmark properly, you should:
Warm up the JVM: Load all classes, ensure all code paths in the benchmark are run and hot spots in the code are compiled (e.g. the += method).
Run the benchmark and ensure neither the GC or the compiler runs during the test (use the JVM flags -XX:-PrintCompilation and -XX:-PrintGC). If either runs during the test, discard the result.
Repeat step 2 and sample 10-15 good measurements. Calculate variance and standard deviation.
Evaluate: If the mean of each benchmark +/- 3 std do not overlap, then you can draw a conclusion about which is faster. Otherwise, it's a blurry result (depending on the amount of overlap).
I can recommend reading Oracle's recommendations for doing micro benchmarks and a great article about benchmark pitfalls by Brian Goetz.
Also, if you want to use a good tool, which does all the above for you, try Caliper by Google.

The key line from the ListSet source is (within subclass Node):
override def +(e: A): ListSet[A] = if (contains(e)) this else new Node(e)
where you can see that an item is added only if it is not already contained. So adding to the set is O(n). You can generally assume that XMap has similar performance characteristics to XSet, and ListMap is listed as linear time all the way along. This is why, and it is how a set is supposed to behave.
P.S. In the TestHashSet case you're measuring startup time. It's way more than 30x faster.

Since a set has to have with no duplicates, before adding an element, a Set must check to see if it already contains the element. This search in a list that has no guarantee of an element's position will be O(N) linear time. The same general idea applies to its remove operation.
With a HashSet, the class defines a function that picks a location for any element in O(1), which makes the contains(element) method much quicker, at the expense of taking up more space to decrease the chance of element location collisions.

How to reset Ada.Real_Time.Clock?

when reading Ada.Real_Time.Clock right after power-up it shows a value that isn't close to zero and sometimes even negative.
As far as I know Ada.Real_Time.Clock suppose to reset on power-up.
How can I reset Ada.Real_Time.Clock?
Thanks.

The Ada 2005 LRM declares that "real time is defined to be the physical time as observed in the external environment. [emphasis added--MC]
"It is not specified by the language whether the time values are synchronized with any standard time reference. For example, E can correspond to the time of system initialization or it can correspond to the epoch of some time standard." (D.8[18-19])
As it states, Ada does not require that "E", the start of the epoch serving as the "zero time" for real-time Time values, correspond to any particular starting point; it's left up to the compiler implementer.
Whatever specific numeric values you observe for the instances of Time you're seeing, whether near or far from zero, positive or negative, are dependent solely on the compiler implementer's choice of E, how it represents times values, and how it correspondingly implements the real-time capability.
Therefore you should avoid writing code that depends on specific, knowable values of Time, nor code that requires Time values to be intimately manipulable.
Real_Time.Time values should be considered abstract quantities.

Agreeing with Marc. While I have seen some platforms that use time since boot (particularly on Intel platforms, where I think they like to use the processor's iteration counter), that is entirely up to the compiler vendor.
If you need something like "time since startup" and your platform isn't giving you that, then the thing to do would be to grab Real_Time.Clock when you start up, and subtract that value from all further reads from Real_Time.Clock.
You can look at exactly what facilites are defined for the Real_Time package, including all the LRM sections Marc was quoting you, at its LRM page here.

It was long ago but if it helps someone...
I reseted the clock by writing 0 to the time base registers of the MCU.

That's a lovely explanation, but what if someone is trying to write unit tests against code which implements the real_Time clock? For instance, I know that my function foo does an internal comparison against Ada.Real_Time.Clock to check for time spans. Before executing foo with the appropriate inputs I want to reset the clock to force foo down a specific path internally and verify the resulting out parameter has changed.
return_value := foo;
assert (return_value = path1, "tested foo path1");
Reset_Clock;
return_value := foo;
assert (return_value = path2, "tested foo path2");

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse