load factor in separate chaining? - hash

Why is it recommended to have a load factor of 1.0 in separate chaining?
I've seen plenty of people saying that it is recommended, but not given a clear explanation of why.
With open addressing, I know the load factor should be between 0.5 and 0.7 because it should be a fast operation to find an unoccupied index when dealing with collisions. But I can't see why a load factor of 1 should be better in separate chaining. I mean, if I have a table of size 100, isn't there still a chance that all 100 elements hashes to the same index and get placed in the same list? So I really can't comprehend why this specific load factor for separate chaining should be 1.

tl;dr: To save memory space by not having slots unoccupied and speed up access by minimizing the number of list traversal operation.
If you understand the load factor as n_used_slots / n_total_slots:
Having a load factor of 1 just describes the ideal situation for a well-implemented hash table using Separate Chaining collision handling: no slots are left empty.
The other classical approach, Open Addressing, requires the table to always have a free slot available when adding a new item. Resizing the table is way too costly to do it for each item, but we are also restricted on memory and wouldn’t want to have too many unused slots lying around. One has to find a balance between speed (few table resizes, quick inserts and lookups) and memory (few empty slots) [as ever so often in programming]. The ideal load factor is based on this idea of balancing and can be estimated based on the actual hash function, the value domain and other factors.
With Separate Chaining, on the other hand, we usually expect from the start to have (way) more items than available hash table slots. If a collision occurs, we need to add the item to the linked list stored in a specific slot. Since searching in a linked list is costly, we would like to minimize list traversal operations. For that, the best case is to have all slots filled with lists of ideally the same length! Having all slots filled corresponds to a load factor of 1.
To put it another way: A load factor < 1 means that there are empty slots and items had to be added to a linked list in another slot, increasing the number of list traversal operations and wasting some memory.
Concerning your example of a table with size 100: yes, there is a chance that all items collide and occupy just one single slot. In that case, the effective load factor would be 0.01 and performance would be heavily impacted.
If you understand the load factor as n_items / n_total_slots:
In that case, the load factor can be larger than 1. A factor < 1 means you have empty slots, while factor > 1 means that there are slots holding more than one item and consequently, list traversals are required. In the first case, you are wasting space and in the second case list traversals lead to a (small) performance hit, depending on the size of the lists.
Example: A load factor of 10 means that on average each slot holds 10 items. Searching for an item therefore implies traversing 5 list nodes on average.
A load factor of 1 means you waste no space and have the fastest lookup, if you use a decent hash function that ensures a regular and evenly balanced usage of slots.

Related

What should I do if primary or secondary clustering occurs in hash table

I get what primary and secondary clustering are but how to get rid of what way to minimise them properly
how to get rid of what way to minimise them properly
You can use a higher quality hash function to distribute the keys in a less collision-prone fashion. For some scenarios, the best practical hash function achievable has a kind of pseudo-random-but-repeatable placement property. In other cases, you might know something about the keys that lets you create a less collision-prone hash function - for example, you might know that the keys tend to be incrementing numbers, possibly with a few small gaps: in that case, an identity hash function h(n) = n will tend to place values in adjacent buckets, with less chance of collision than if the placements were more random.
In some cases, using prime numbers of buckets helps distribute elements better across the buckets than using a power-of-two bucket count. Basically, bucket counts that are powers of two are effectively masking out the high-order bits of the hash value when mapping onto the buckets: any randomness in the high order bits is discarded instead of helping to create a more uniform distribution across buckets. Still, bitwise masking is faster than a mod calculation on most hardware/CPUs.
You can also reduce the load factor: the ratio of elements to buckets. Clustering effects for hash tables using closed hashing get exponentially worse as the load factor approaches 1 (i.e. every bucket being full).
You could also stop using closed hashing and use separate chaining (maintaining containers of elements colliding at each bucket) instead, which doesn't suffer from primary clustering, but the indirection can lead to more memory usage overheads, indirection, and less optimal use of CPU cache, with consequently lower runtime performance - especially when the elements are small (a few bytes each).
You can also use multiple hash functions to identify successive buckets at which an element may be stored, rather than simple offers as in linear or quadratic probing, which reduces clustering. When you have alternative buckets, you can use techniques to move elements around to reduce the worse areas of clustering - search for robin hood hashing for example.

Why LRU implementation is expensive in full associative TLB?

I have a book statement:
Implementation of LRU in full associative TLB is very expensive, so the general way is to use random substitution.
I don't understand why it's expensive under full associative cache. Isn't that just adding an additional reference bit...?
LRU requires maintaining a total order relation between all valid cache lines in a cache set. For example, consider a 3-way cache set with the following lines A, B, and C ordered from the most recently accessed to the least recently accessed (represented as ABC). If C is accessed next, then the order becomes CAB. If a new line, D, needs to be filled in the same cache set, since there are no invalid lines, the LRU replacement policy will choose B to be evicted and replaced by the new line. Then the order becomes DCA.
For a 3-way cache, there are up to 3*2 = 6 possible orders for the lines in each set. In general, for an N-way cache, there are up to N! (N factorial) possible orders. Theoretically, you need at least log2(N!) bits (rounded up to the nearest integer) per cache set to maintain the LRU property accurately. Note that log2(N!) is Θ(Nlog(N)), so it grows superlinearly with respect to the number of ways. No normal person likes anything whose cost grows superlinearly.
A particularly cheap case is a 2-way cache, where the LRU state requires only log2(2!) = 1 bits, i.e., a single bit. It is much more expensive for any other number of ways though.
In practice, though, there is no easy way to maintain a single number that represents the LRU state of a set. If the current LRU state is X and then some access to a line occurs, how can the next LRU state be determined? There is no simple mathematical relation that can be implemented in hardware. So instead of using a single number, a realistic implementation would use multiple numbers, one per cache line. In this case, these numbers are called ages. Such design would even require (many) more bits than the theoretical minimum log2(N!) to maintain the LRU state.
Aside from the hardware overhead, the LRU replacement policy is not necessarily optimal for performance. It depends on the memory access patterns of the applications in the target market domain and the rest of the cache hierarchy.
LRU has been used in many real processors. Caches that are 2-way associative typically use LRU. For example, AMD SledgeHammer uses LRU for both L1I and L1D caches. The Itanium 2 processor's L1 instruction cache uses LRU and it is 4-way associative. Usually, when the number of ways is larger than two, caches don't use LRU.

How to calculate the best numberOfPartitions for coalesce?

So, I understand that in general one should use coalesce() when:
the number of partitions decreases due to a filter or some other operation that may result in reducing the original dataset (RDD, DF). coalesce() is useful for running operations more efficiently after filtering down a large dataset.
I also understand that it is less expensive than repartition as it reduces shuffling by moving data only if necessary. My problem is how to define the parameter that coalesce takes (idealPartionionNo). I am working on a project which was passed to me from another engineer and he was using the below calculation to compute the value of that parameter.
// DEFINE OPTIMAL PARTITION NUMBER
implicit val NO_OF_EXECUTOR_INSTANCES = sc.getConf.getInt("spark.executor.instances", 5)
implicit val NO_OF_EXECUTOR_CORES = sc.getConf.getInt("spark.executor.cores", 2)
val idealPartionionNo = NO_OF_EXECUTOR_INSTANCES * NO_OF_EXECUTOR_CORES * REPARTITION_FACTOR
This is then used with a partitioner object:
val partitioner = new HashPartitioner(idealPartionionNo)
but also used with:
RDD.filter(x=>x._3<30).coalesce(idealPartionionNo)
Is this the right approach? What is the main idea behind the idealPartionionNo value computation? What is the REPARTITION_FACTOR? How do I generally work to define that?
Also, since YARN is responsible for identifying the available executors on the fly is there a way of getting that number (AVAILABLE_EXECUTOR_INSTANCES) on the fly and use that for computing idealPartionionNo (i.e. replace NO_OF_EXECUTOR_INSTANCES with AVAILABLE_EXECUTOR_INSTANCES)?
Ideally, some actual examples of the form:
Here 's a dataset (size);
Here's a number of transformations and possible reuses of an RDD/DF.
Here is where you should repartition/coalesce.
Assume you have n executors with m cores and a partition factor equal to k
then:
The ideal number of partitions would be ==> ???
Also, if you can refer me to a nice blog that explains these I would really appreciate it.
In practice optimal number of partitions depends more on the data you have, transformations you use and overall configuration than the available resources.
If the number of partitions is too low you'll experience long GC pauses, different types of memory issues, and lastly suboptimal resource utilization.
If the number of partitions is too high then maintenance cost can easily exceed processing cost. Moreover, if you use non-distributed reducing operations (like reduce in contrast to treeReduce), a large number of partitions results in a higher load on the driver.
You can find a number of rules which suggest oversubscribing partitions compared to the number of cores (factor 2 or 3 seems to be common) or keeping partitions at a certain size but this doesn't take into account your own code:
If you allocate a lot you can expect long GC pauses and it is probably better to go with smaller partitions.
If a certain piece of code is expensive then your shuffle cost can be amortized by a higher concurrency.
If you have a filter you can adjust the number of partitions based on a discriminative power of the predicate (you make different decisions if you expect to retain 5% of the data and 99% of the data).
In my opinion:
With one-off jobs keep higher number partitions to stay on the safe side (slower is better than failing).
With reusable jobs start with conservative configuration then execute - monitor - adjust configuration - repeat.
Don't try to use fixed number of partitions based on the number of executors or cores. First understand your data and code, then adjust configuration to reflect your understanding.
Usually, it is relatively easy to determine the amount of raw data per partition for which your cluster exhibits stable behavior (in my experience it is somewhere in the range of few hundred megabytes, depending on the format, data structure you use to load data, and configuration). This is the "magic number" you're looking for.
Some things you have to remember in general:
Number of partitions doesn't necessarily reflect
data distribution. Any operation that requires shuffle (*byKey, join, RDD.partitionBy, Dataset.repartition) can result in non-uniform data distribution. Always monitor your jobs for symptoms of a significant data skew.
Number of partitions in general is not constant. Any operation with multiple dependencies (union, coGroup, join) can affect the number of partitions.
Your question is a valid one, but Spark partitioning optimization depends entirely on the computation you're running. You need to have a good reason to repartition/coalesce; if you're just counting an RDD (even if it has a huge number of sparsely populated partitions), then any repartition/coalesce step is just going to slow you down.
Repartition vs coalesce
The difference between repartition(n) (which is the same as coalesce(n, shuffle = true) and coalesce(n, shuffle = false) has to do with execution model. The shuffle model takes each partition in the original RDD, randomly sends its data around to all executors, and results in an RDD with the new (smaller or greater) number of partitions. The no-shuffle model creates a new RDD which loads multiple partitions as one task.
Let's consider this computation:
sc.textFile("massive_file.txt")
.filter(sparseFilterFunction) // leaves only 0.1% of the lines
.coalesce(numPartitions, shuffle = shuffle)
If shuffle is true, then the text file / filter computations happen in a number of tasks given by the defaults in textFile, and the tiny filtered results are shuffled. If shuffle is false, then the number of total tasks is at most numPartitions.
If numPartitions is 1, then the difference is quite stark. The shuffle model will process and filter the data in parallel, then send the 0.1% of filtered results to one executor for downstream DAG operations. The no-shuffle model will process and filter the data all on one core from the beginning.
Steps to take
Consider your downstream operations. If you're just using this dataset once, then you probably don't need to repartition at all. If you are saving the filtered RDD for later use (to disk, for example), then consider the tradeoffs above. It takes experience to become familiar with these models and when one performs better, so try both out and see how they perform!
As others have answered, there is no formula which calculates what you ask for. That said, You can make an educated guess on the first part and then fine tune it over time.
The first step is to make sure you have enough partitions. If you have NO_OF_EXECUTOR_INSTANCES executors and NO_OF_EXECUTOR_CORES cores per executor then you can process NO_OF_EXECUTOR_INSTANCES*NO_OF_EXECUTOR_CORES partitions at the same time (each would go to a specific core of a specific instance).
That said this assumes everything is divided equally between the cores and everything takes exactly the same time to process. This is rarely the case. There is a good chance that some of them would be finished before others either because of locallity (e.g. the data needs to come from a different node) or simply because they are not balanced (e.g. if you have data partitioned by root domain then partitions including google would probably be quite big). This is where the REPARTITION_FACTOR comes into play. The idea is that we "overbook" each core and therefore if one finishes very quickly and one finishes slowly we have the option of dividing the tasks between them. A factor of 2-3 is generally a good idea.
Now lets take a look at the size of a single partition. Lets say your entire data is X MB in size and you have N partitions. Each partition would be on average X/N MBs. If N is large relative to X then you might have very small average partition size (e.g. a few KB). In this case it is usually a good idea to lower N because the overhead of managing each partition becomes too high. On the other hand if the size is very large (e.g. a few GB) then you need to hold a lot of data at the same time which would cause issues such as garbage collection, high memory usage etc.
The optimal size is a good question but generally people seem to prefer partitions of 100-1000MB but in truth tens of MB probably would also be good.
Another thing you should note is when you do the calculation how your partitions change. For example, lets say you start with 1000 partitions of 100MB each but then filter the data so each partition becomes 1K then you should probably coalesce. Similar issues can happen when you do a groupby or join. In such cases both the size of the partition and the number of partitions change and might reach an undesirable size.

How does Scala's Vector work?

I read this page about the time complexity of Scala collections. As it says, Vector's complexity is eC for all operations.
It made me wonder what Vector is. I read the document and it says:
Because vectors strike a good balance between fast random selections and fast random functional updates, they are currently the
default implementation of immutable indexed sequences. It is backed by
a little endian bit-mapped vector trie with a branching factor of 32.
Locality is very good, but not contiguous, which is good for very
large sequences.
As with everything else about Scala, it's pretty vague. How actually does Vector work?
The keyword here is Trie.
Vector is implemented as a Trie datastructure.
See http://en.wikipedia.org/wiki/Trie.
More precisely, it is a "bit-mapped vector trie". I've just found a consise enough description of the structure (along with an implementation - apparently in Rust) here:
https://bitbucket.org/astrieanna/bitmapped-vector-trie
The most relevant excerpt is:
A Bitmapped Vector Trie is basically a 32-tree. Level 1 is an array of size 32, of whatever data type. Level 2 is an array of 32 Level 1's. and so on, until: Level 7 is an array of 2 Level 6's.
UPDATE: In reply to Lai Yu-Hsuan's comment about complexity:
I will have to assume you meant "depth" here :-D. The legend for "eC" says "The operation takes effectively constant time, but this might depend on some assumptions such as maximum length of a vector or distribution of hash keys.".
If you are willing to consider the worst case, and given that there is an upper bound to the maximum size of the vector, then yes indeed we can say that the complexity is constant.
Say that we consider the maximum size to be 2^32, then this means that the worst case is 7 operations at most, in any case.
Then again, we can always consider the worst case for any type of collection, find an upper bound and say this is constant complexity, but for a list by example, this would mean a constant of 4 billions, which is not quite practical.
But Vector is the opposite, 7 operations being more than practical, and this is how we can afford to consider its complexity constant in practice.
Another way to look at this: we are not talking about log(2,N), but log(32,N). If you try to plot that you'll see it is practically an horizontal line. So pragmatically speaking you'll never be able to see much increase in processing time as the collection grows.
Yes, that's still not really constant (which is why it is marked as "eC" and not just "C"), and you'll be able to see a difference around short vectors (but again, a very small difference because the number of operations grows so much slowly).
The other answers re 'Trie' are good. But as a close approximation, just for quick understanding:
Vector internally uses a tree structure - not a binary tree, but a 32-ary tree
Each '32-way node' uses Array[32] and can store either 0-32 references to child nodes or 0-32 pieces of data
The tree is structured to be balanced in a certain way - it is "n" levels deep, but levels 1 to n-1 are "index-only levels" (100% child references; no data) and level n contains all the data (100% data; no child references). So if the number of elements of data is "d" then n = log-base-32(d) rounded upwards
Why this? Simple: for performance.
Instead of doing thousands/millions/gazillions of memory allocations for each individual data element, memory is allocated in 32 element chunks. Instead of walking miles deep to find your data, the structure is quite shallow - it's a very wide, short tree. E.g. 5 levels deep can contain 32^5 data elements (for 4 byte elements = 132GB i.e. pretty big) and each data access would lookup & walk through 5 nodes from the root (whereas a big array would use a single data access). The vector does not proactively allocat memory for all of Level n (data), - it allocates in 32 element chunks as needed. It gives read performance somewhat similar to a huge array, whilst having functional characteristics (power & flexibility & memory-efficiency) somewhat similar to a binary tree.
:)
These may be interesting for you:
Ideal Hash Trees by Phil Bagwell.
Implementing Persistent Vectors in Scala - Daniel Spiewak
More Persistent Vectors: Performance Analysis - Daniel Spiewak
Persistent data structures in Scala

Sieve of Eratosthenes (reducing space complexity)

I wanted to generate prime numbers between two given numbers ‘a’ and ‘b’ (b > a). What I did was store Boolean values in an array of size b-1 (that is for numbers 2 to b) and then I applied the sieve method.
Is there a better way, that reduces space complexity, if I don't need all prime numbers from 2 to b?
You need to store all primes which are smaller of equal than the square root of b, then for each number between a and b check whether they are divisible by any of these numbers and they don't equal these numbers. So in our case the magic number is sqrt(b)
You can use segmented sieve of Eratosthenes. The basic idea is pretty simple.
In a typical sieve, we start with a large array of Booleans, all set to the same value. These represent odd numbers, starting from 3. We look at the first and see that it's true, so we add it to the list of prime numbers. Then we mark off every multiple of that number as not prime.
Now, the problem with this is that it's not very cache friendly. As we mark off the multiples of each number, we go through the entire array. Then when we reach the end, we start over from the beginning (which is no longer in the cache) and walk through the entire array again. Each time through the array, we read the entire array from main memory again.
For a segmented sieve, we do things a bit differently. We start by by finding only the primes up to the square root of the limit we care about. Then we use those to mark off primes in the main array. The difference here is the order in which we mark off primes. Instead of marking off all the multiples of three, then all the multiples of 5, and so on, we start by marking off the multiples of three for data that will fit in the cache. Then, instead of continuing on to more data in the array, we go back and mark off the multiples of five for the data that fits in the cache. Then the multiples of 7, and so on.
Then, when we've marked off all the multiples in that cache-sized chunk of data, we move on to the next cache-sized chunk of data. We start over with marking off multiples of 3 in this chunk, then multiples of 5, and so on until we've marked off all the multiples in this chunk. We continue that pattern until we've marked off all the non-prime numbers in all the chunks, and we're done.
So, given N primes below the square root of the limit we care about, a naive sieve will read the entire array of Booleans N times. By contrast, a segmented sieve will only read each chunk of the data once. Once a chunk of data is read from main memory, all the processing on that chunk is done before any more data is read from main memory.
The exact speed-up this gives will depend on the ratio of the speed of cache to the speed of main memory, the size of the array you're using vs. the size of the cache, and so on. Nonetheless, it is generally pretty substantial--for example, on my particular machine, looking for the primes up to 100 million, the segmented sieve has a speed advantage of about 10:1.
One thing you must remember, if you're using C++. A well-known issue with std::vector<bool> is Under C++98/03, vector<bool> was required to be a specialization that stored each Boolean as a single bit with some proxy trickery to get bool-like behavior. That requirement has since been lifted, but many libraries still include it.
With a non-segmented sieve, it's generally a useful trade-off. Although it requires a little extra CPU time to compute masks and such to modify only a single bit at a time, it saves enough bandwidth to main memory to more than compensate.
With a segmented sieve, bandwidth to main memory isn't nearly as large a factor, so using a vector<char> generally seems to give better results (at least with the compilers and processors I have handy).
Getting optimal performance from a segmented sieve does require knowledge of the size of your processor's cache, but getting it precisely correct isn't usually critical--if you assume the size is smaller than it really is, you won't necessarily get optimal use of your cache, but you usually won't lose a lot either.