What is the mathematical operation used by the python hash() function? - hash

print(hash('hello world'))
result :
6266945022561323786
What is the mathematical used by the python hash() function ?

The Python documentation makes no guarantee about the particular algorithm that is used by hash() (or more precisely, object.__hash__(self)).
The documentation only says this:
object.__hash__(self)
Called by built-in function hash() and for operations on members of hashed collections including set, frozenset, and dict. The __hash__() method should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple. Example:
def __hash__(self):
return hash((self.name, self.nick, self.color))
The only thing that is guaranteed is this: "The only required property is that objects which compare equal have the same hash value".
There are a couple of desirable properties related to security, safety, and performance, but they are not required.
Every object can implement its own __hash__() however it wants, as long as it satisfies the property that two equal objects have the same hash value. And, in fact, many objects do provide their own implementations.
Even for built-in core objects such as strings, different implementations (and even different versions of different implementations) use different algorithms. CPython even uses a different seed value every time you run it (again, for security reasons).
So, the answer is: you can't know what the algorithm is. All you know is that if a.__eq__(b) is True, then a.__hash__().__eq__(b.__hash__()) is also True.
Most importantly, there is no guarantee that a.__hash__().__eq__(b.__hash__()) being True implies a and b are equal, nor does a.__eq__(b) being False imply that a.__hash__().__eq__(b.__hash__()) is False.

Related

What are the differences between a Dictionary and containers.Map?

Recently, in version R2022b they announced the introduction of dictionaries.
I was under the impression that dictionaries were already available, provided by containers.Map. Are dictionaries just a different name mapped to containers.Map? Or are there other differences? I was unable to find anything comparing them online.
From what I can gather, after reading this blog post and the comments under it, and the documentation (I haven’t yet had a chance to experiment with them, so feel free to correct me if I’m wrong):
dictionary is an actual primitive type, like double, cell or struct. containers.Map is a “custom class”, even if nowadays the code is built-in, the functionality can never be as integrated as for a primitive type. Consequently, dictionary is significantly faster.
dictionary uses normal value semantics. If you make a copy you have two independent dictionaries (note MATLAB’s lazy copy mechanism). containers.Map is a handle class, meaning that all copies point to the same data, modifying one copy modifies them all.
containers.Map can use char arrays (the old string format) or numbers as keys (string is implicitly converted to char when used as key). dictionary can use any type, as long as it overloads keyhash. This means you can use your own custom class objects as keys.
dictionary is vectorized, you can look up multiple values at once. With a containers.Map you can look up multiple values using the values function, not the normal lookup syntax.
dictionary has actual O(1) lookup. If I remember correctly, containers.Map doesn’t.*
containers.Map can store any array as value, dictionary stores only scalars. The scalar can be a cell, which can contain any array, but this leads to awkward semantics, since retrieving the value retrieves the cell, not its contents.
* No, it is also O(1), at least in R2022b.

How can I comprehend this sentence "two instances with the same hash value don’t necessarily compare equally. "

When I reading the book Advanced Swift and in the Chapter 'Hashable Requirement', I got confused by this explanation
two instances that are equal (as defined by your == implementation) must have the same hash value. The reverse isn’t true: two instances with the same hash value don’t necessarily compare equally.
How can I comprehend the 'reverse' situation, or why do the two instances with the same hash value don’t necessarily compare equally.
Think of the hash value as a quick, compact, non-unique identifier for a given object instance. The only hard condition is this: if two objects compare equally, according to the == operator, than both instances must have the exact same hash value. That’s all there is to it ;)
In particular, given that hash values aren’t unique — and how they could be given Int limited range? — we can’t safely assume that two instances with the same hash value will compare equally.

Is there a possibility to create a memory-efficient sequence of bits in the JVM?

I've got a piece of code that takes into account a given amount of features, where each feature is Boolean. I'm looking for the most efficient way to store a set of such features. My initial thought was to try and store these as a BitSet. But then, I realized that this implementation is meant to be used to store numbers in bit format rather than manipulate each bit, which is something I'd like to do (see the effect of switching any feature on and off). I then thought of using a Boolean array, but apparently the JVM uses much more memory for each Boolean element than the one bit it actually needs.
I'm therefore left with the question: What is the most efficient way to store a set of bits that I'd like to treat as independent bits rather than the building blocks of some number?
Please refer to this question: boolean[] vs. BitSet: Which is more efficient?
According to the answer of Peter Lawrey, boolean[] (not Boolean[]) is your way to go since its values can be manipulated and it takes only one byte of memory per bit to store. Consider that there is no way for a JVM application to store one bit in only one bit of memory and let it be directly (array-like) manipulated because it needs a pointer to find the address of the bit and the smallest addressable unit is a byte.
The site you referenced already states that the mutable BitSet is the same as the java.util.BitSet. There is nothing you can do in Java that you can't do in Scala. But since you are using Scala, you probably want a safe implementation which is probably meant to be even multithreaded. Mutable datatypes are not suitable for that. Therefore, I would simply use an immutable BitSet and accept the memory cost.
However, BitSets have their limits (deriving from the maximum number of int). If you need larger data sizes, you may use LongBitSets, which are basically Map<Long, BitSet>. If you need even more space, you may nest them in another map Map<Long, LongBitSet>, but in that case you need to use two or more identifiers (longs).

When is my struct too large?

We're encouraged to use struct over class in Swift.
This is because
The compiler can do a lot of optimizations
Instances are created on the stack which is a lot more performant than malloc/free calls
The downside to struct variables is that they are copied each time when returning from or assigned to a function. Obviously, this can become a bottleneck too.
E.g. imagine a 4x4 matrix. 16 Float values would have to be copied on every assign/return which would be 1'024 bits on a 64 bit system.
One way you can avoid this is using inout when passing variables to functions, which is basically Swifts way of creating a pointer. But then we're also discouraged from using inout.
So to my question:
How should I handle large, immutable data structures in Swift?
Do I have to worry creating a large struct with many members?
If yes, when am I crossing the line?
This accepted answer is not entirely answering the question you had: Swift always copies structs. The trick that Array/Dictionary/String/etc do is that they are just wrappers around classes (which contain the actual stored properties). That way sizeof(Array) is just the size of the pointer to that class (MemoryLayout<Array<String>>.stride == MemoryLayout<UnsafeRawPointer>.stride)
If you have a really big struct, you might want to consider wrapping its stored properties in a class for efficient passing around as arguments, and checking isUniquelyReferenced before mutating to give COW semantics.
Structs have other efficiency benefits: they don't need reference-counting and can be decomposed by the optimiser.
In Swift, values keep a unique copy of their data. There are several
advantages to using value-types, like ensuring that values have
independent state. When we copy values (the effect of assignment,
initialization, and argument passing) the program will create a new
copy of the value. For some large values these copies could be time
consuming and hurt the performance of the program.
https://github.com/apple/swift/blob/master/docs/OptimizationTips.rst#the-cost-of-large-swift-values
Also the section on container types:
Keep in mind that there is a trade-off between using large value types
and using reference types. In certain cases, the overhead of copying
and moving around large value types will outweigh the cost of removing
the bridging and retain/release overhead.
From the very bottom of this page from the Swift Reference:
NOTE
The description above refers to the “copying” of strings, arrays, and dictionaries. The behavior you see in your code will always be as if a copy took place. However, Swift only performs an actual copy behind the scenes when it is absolutely necessary to do so. Swift manages all value copying to ensure optimal performance, and you should not avoid assignment to try to preempt this optimization.
I hope this answers your question, also if you want to be sure that an array doesn't get copied, you can always declare the parameter as inout, and pass it with &array into the function.
Also classes add a lot of overhead and should only be used if you really must have a reference to the same object.
Examples for structs:
Timezone
Latitude/Longitude
Size/Weight
Examples for classes:
Person
A View

What is the prefered way in using the parallel collections in Scala?

At first I assumed that every collection class would receive an additional par method which would convert the collection to a fitting parallel data structure (like map returns the best collection for the element type in Scala 2.8).
Now it seems that some collection classes support a par method (e. g. Array) but others have toParSeq, toParIterable methods (e. g. List). This is a bit weird, since Array isn't used or recommended that often.
What is the reason for that? Wouldn't it be better to just have a par available on all collection classes doing the "right thing"?
If I have data which might be processed in parallel, what types should I use? The traits in scala.collection or the type of the implementation directly?
Or should I prefer Arrays now, because they seem to be cheaper to parallelize?
Lists aren't that well suited for parallel processing. The reason is that to get to the end of the list, you have to walk through every single element. Thus, you may as well just treat the list as an iterator, and thus may as well just use something more generic like toParIterable.
Any collection that has a fast index is a good candidate for parallel processing. This includes anything implementing LinearSeqOptimized, plus trees and hash tables. Array has as fast of an index as you can get, so it's a fairly natural choice. You can also use things like ArrayBuffer (which has a par method returning a ParArray).