Is there an efficient way to calculate the smallest N numbers from a set of numbers in hardware (HDL)? - system-verilog

I am trying to calculate the smallest N numbers from a set and I've found software algorithms to do this. I'm wondering if there is an efficient way to do this in hardware (i.e. HDL - in System Verilog or Verilog)? I am specifically trying to calculate the smallest 2 numbers from a set.
I am trying to do this combinationally optimizing with respect to area and speed (for a large set of signals) but I can only think of comparator trees to do this? Is there a more efficient way of doing this?
Thank you, any help is appreciated~

I don't think you can work around using comparator trees if you want to find the two smallest elements combinationally. However, if your goal isn't low latency than a (possibly pipelined) sequential circuit could also be an option.
One approach that I can come up with on the spot would be to break down the operation doing kind of an incomplete bubble sort in hardware using small sorting networks. Depending on the amount of area you are willing to spend you can use a smaller or larger p-sorting network that combinationaly sorts p elements at a time where p >= 3. You can then apply this network on your input set of size N, sorting p elements at a time. The two smallest elements in each iteration are stored in some sort of memory (e.g. an SRAM memory, if you want to process larger amounts of elements).
Here is an example for p=3 (the brackets indicate the grouping of elements the p-sorter is applied to):
(4 0 9) (8 6 7) (4 2 1) --> (0 4 9) (6 7 8) (1 2 4) --> 0 4 6 7 1 2
Now you start the next round:
You apply the p-sorter on the results of the first round.
Again you store the two smallest outputs of your p-sorter into the same memory overwriting values from the previous round.
Here the continuation of the example:
(0 4 6) (7 1 2) --> (0 4 6) (1 2 7) --> 0 4 1 2
In each round you can reduce the number of elements to look at by a factor of 2/p. E.g. with p==4 you discard half the elements in each round until the smallest two elements are stored at the first two memory locations. So the algorithm has time/cycle complexity of O(n log(n)). For an actual hardware implementation, you probably want to stick to powers of two for the size p of the sorting network.
Although the control logic of such a circuit is not trivial to implement the area should be mainly dominated by the size of your sorting network and the memory you need to hold the first 2/p*N intermediate results (assuming your input signals are not already stored in a memory that you can reuse for that purpose). If you want to tune your circuit towards throughput you can increase p and pipeline the sorting network at the expense of additional area. Additional speedup could be gained by replacing the single memory using up to p two-port memories (1 read and 1 write port each) which would allow you to fetch and write back the data for the sorting network in a single cycle thus increasing the utilization ratio of the comparators in the sorting network.

Related

Genetic algorithm techniques for allocation of electric vehicles

The problem I'm trying to solve is about the best allocation for electric vehicles (EVs) in the electrical power grid. My grid has 20 possible positions (busbar) allowed to receive one EV each. Each chromosome has length 20 and its genes can be 0 or 1, where 0 means no EV and 1 means there´s an EV at that position (busbar).
I start my population (100 individuals) with a fixed number of EVs (5, for instance) allocated randomly. And let them evolve through my GA. The GA utilizes tournament selection, 2-points crossover and flip-bit mutation. Each chromosome/individual is evaluated through a fitness function that calculate the power losses between bars (sum of RI^2). The best chromosome is the one with the lowest power losses.
The problem is that utilizing 2-points crossover and flip-bit mutation changes the fixed number of EVs that must be in the grid. I would like to know what are the best techniques for my GA operations. Besides this, I get this weird looking graphic of the most fitness chromosome throughout generations 1
I would appreciate any help/suggestions. Thanks.
You want to define your state space in such a way where the mutations you've chosen can't create an illegal configuration.
This is probably not a great fit for a genetic algorithm. If you want to pick 5 from 20, there are ~15k possibilities. Testing a population of 100 over 50 generations already gives you enough computations to have done 1/3 of the brute force work.
If you have N EV to assign on your grid, you can use chromosomes of size N, each gene being an integer representing the position of an EV. For the crossover, you first need to separate the values that are the same in both parents from the rest and apply a classic (1 or 2 points) crossover on the parts that differ, and mutate a gene randomly picking a valid available position.

load factor in separate chaining?

Why is it recommended to have a load factor of 1.0 in separate chaining?
I've seen plenty of people saying that it is recommended, but not given a clear explanation of why.
With open addressing, I know the load factor should be between 0.5 and 0.7 because it should be a fast operation to find an unoccupied index when dealing with collisions. But I can't see why a load factor of 1 should be better in separate chaining. I mean, if I have a table of size 100, isn't there still a chance that all 100 elements hashes to the same index and get placed in the same list? So I really can't comprehend why this specific load factor for separate chaining should be 1.
tl;dr: To save memory space by not having slots unoccupied and speed up access by minimizing the number of list traversal operation.
If you understand the load factor as n_used_slots / n_total_slots:
Having a load factor of 1 just describes the ideal situation for a well-implemented hash table using Separate Chaining collision handling: no slots are left empty.
The other classical approach, Open Addressing, requires the table to always have a free slot available when adding a new item. Resizing the table is way too costly to do it for each item, but we are also restricted on memory and wouldn’t want to have too many unused slots lying around. One has to find a balance between speed (few table resizes, quick inserts and lookups) and memory (few empty slots) [as ever so often in programming]. The ideal load factor is based on this idea of balancing and can be estimated based on the actual hash function, the value domain and other factors.
With Separate Chaining, on the other hand, we usually expect from the start to have (way) more items than available hash table slots. If a collision occurs, we need to add the item to the linked list stored in a specific slot. Since searching in a linked list is costly, we would like to minimize list traversal operations. For that, the best case is to have all slots filled with lists of ideally the same length! Having all slots filled corresponds to a load factor of 1.
To put it another way: A load factor < 1 means that there are empty slots and items had to be added to a linked list in another slot, increasing the number of list traversal operations and wasting some memory.
Concerning your example of a table with size 100: yes, there is a chance that all items collide and occupy just one single slot. In that case, the effective load factor would be 0.01 and performance would be heavily impacted.
If you understand the load factor as n_items / n_total_slots:
In that case, the load factor can be larger than 1. A factor < 1 means you have empty slots, while factor > 1 means that there are slots holding more than one item and consequently, list traversals are required. In the first case, you are wasting space and in the second case list traversals lead to a (small) performance hit, depending on the size of the lists.
Example: A load factor of 10 means that on average each slot holds 10 items. Searching for an item therefore implies traversing 5 list nodes on average.
A load factor of 1 means you waste no space and have the fastest lookup, if you use a decent hash function that ensures a regular and evenly balanced usage of slots.

work stealing with Parallel Computing Toolbox

I have implemented a combinatorial search algorithm (for comparison to a more efficient optimization technique) and tried to improve its runtime with parfor.
Unfortunately, the work assignments appear to be very badly unbalanced.
Each subitem i has complexity of approximately nCr(N - i, 3). As you can see, the tasks i < N/4 involve significantly more work than i > 3*N/4, yet it seems MATLAB is assigning all of i < N/4 to a single worker.
Is it true that MATLAB divides the work based on equally sized subsets of the loop range?
No, this question cites the documentation saying it does not.
Is there a convenient way to rebalance this without hardcoding the number of workers (e.g. if I require exactly 4 workers in the pool, then I could swap the two lowest bits of i with two higher bits in order to ensure each worker received some mix of easy and hard tasks)?
I don't think a full "work-stealing" implementation is necessary, perhaps just assigning 1, 2, 3, 4 to the workers, then when 4 completes first, its worker begins on item 5, and so on. The size of each item is sufficiently larger than the number of iterations that I'm not too worried about the increased communication overhead.
If the loop iterations are indeed distributed ahead of time (which would mean that in the end, there is a single worker that will have to complete several iterations while the other workers are idle - is this really the case?), the easiest way to ensure a mix is to randomly permute the loop iterations:
permutedIterations = randperm(nIterations);
permutedResults = cell(nIterations,1); %# or whatever else you use for storing results
%# run the parfor loop, completing iterations in permuted order
parfor iIter = 1:nIterations
permutedResults(iIter) = f(permutedIterations(iIter));
end
%# reorder results for easier subsequent analysis
results = permutedResults(permutedIterations);

Sieve of Eratosthenes (reducing space complexity)

I wanted to generate prime numbers between two given numbers ‘a’ and ‘b’ (b > a). What I did was store Boolean values in an array of size b-1 (that is for numbers 2 to b) and then I applied the sieve method.
Is there a better way, that reduces space complexity, if I don't need all prime numbers from 2 to b?
You need to store all primes which are smaller of equal than the square root of b, then for each number between a and b check whether they are divisible by any of these numbers and they don't equal these numbers. So in our case the magic number is sqrt(b)
You can use segmented sieve of Eratosthenes. The basic idea is pretty simple.
In a typical sieve, we start with a large array of Booleans, all set to the same value. These represent odd numbers, starting from 3. We look at the first and see that it's true, so we add it to the list of prime numbers. Then we mark off every multiple of that number as not prime.
Now, the problem with this is that it's not very cache friendly. As we mark off the multiples of each number, we go through the entire array. Then when we reach the end, we start over from the beginning (which is no longer in the cache) and walk through the entire array again. Each time through the array, we read the entire array from main memory again.
For a segmented sieve, we do things a bit differently. We start by by finding only the primes up to the square root of the limit we care about. Then we use those to mark off primes in the main array. The difference here is the order in which we mark off primes. Instead of marking off all the multiples of three, then all the multiples of 5, and so on, we start by marking off the multiples of three for data that will fit in the cache. Then, instead of continuing on to more data in the array, we go back and mark off the multiples of five for the data that fits in the cache. Then the multiples of 7, and so on.
Then, when we've marked off all the multiples in that cache-sized chunk of data, we move on to the next cache-sized chunk of data. We start over with marking off multiples of 3 in this chunk, then multiples of 5, and so on until we've marked off all the multiples in this chunk. We continue that pattern until we've marked off all the non-prime numbers in all the chunks, and we're done.
So, given N primes below the square root of the limit we care about, a naive sieve will read the entire array of Booleans N times. By contrast, a segmented sieve will only read each chunk of the data once. Once a chunk of data is read from main memory, all the processing on that chunk is done before any more data is read from main memory.
The exact speed-up this gives will depend on the ratio of the speed of cache to the speed of main memory, the size of the array you're using vs. the size of the cache, and so on. Nonetheless, it is generally pretty substantial--for example, on my particular machine, looking for the primes up to 100 million, the segmented sieve has a speed advantage of about 10:1.
One thing you must remember, if you're using C++. A well-known issue with std::vector<bool> is Under C++98/03, vector<bool> was required to be a specialization that stored each Boolean as a single bit with some proxy trickery to get bool-like behavior. That requirement has since been lifted, but many libraries still include it.
With a non-segmented sieve, it's generally a useful trade-off. Although it requires a little extra CPU time to compute masks and such to modify only a single bit at a time, it saves enough bandwidth to main memory to more than compensate.
With a segmented sieve, bandwidth to main memory isn't nearly as large a factor, so using a vector<char> generally seems to give better results (at least with the compilers and processors I have handy).
Getting optimal performance from a segmented sieve does require knowledge of the size of your processor's cache, but getting it precisely correct isn't usually critical--if you assume the size is smaller than it really is, you won't necessarily get optimal use of your cache, but you usually won't lose a lot either.

An instance of online data clustering

I need to derive clusters of integers from an input array of integers in such a way that the variation within the clusters is minimized. (The integers or data values in the array are corresponding to the gas usage of 16 cars running between cities. At the end I will derive 4 clusters from the 16 cars into based on the clusters of the data values.)
Constraints: always the number of elements is 16, no. of clusters is 4 and the size of
the cluster is 4.
One simple way I am planning to do is that I will sort the input array and then divide them into 4 groups as shown below. I think that I can also use k-means clustering.
However, the place where I stuck was as follows: The data in the array change over time. Basically I need to monitor the array for every 1 second and regroup/recluster them so that the variation within the cluster is minimized. Moreover, I need to satisfy the above constraint. For this, one idea I am getting is to select two groups based on their means and variations and move data values between the groups to minimize variation within the group. However, I am not getting any idea of how to select the data values to move between the groups and also how to select those groups. I cannot apply sorting on the array in every second because I cannot afford NlogN for every second. It would be great if you guide me to produce a simple solution.
sorted `input array: (12 14 16 16 18 19 20 21 24 26 27 29 29 30 31 32)`
cluster-1: (12 14 16 16)
cluster-2: (18 19 20 21)
cluster-3: (24 26 27 29)
cluster-4: (29 30 31 32)
Let me first point out that sorting a small number of objects is very fast. In particular when they have been sorted before, an "evil" bubble sort or insertion sort usually is linear. Consider in how many places the order may have changed! All of the classic complexity discussion doesn't really apply when the data fits into the CPUs first level caches.
Did you know that most QuickSort implementations fall back to insertion sort for small arrays? Because it does a fairly good job for small arrays and has little overhead.
All the complexity discussions are only for really large data sets. They are in fact proven only for inifinitely sized data. Before you reach infinity, a simple algorithm of higher complexity order may still perform better. And for n < 10, quadratic insertion sort often outperforms O(n log n) sorting.
k-means however won't help you much.
Your data is one-dimensional. Do not bother to even look at multidimensional methods, they will perform worse than proper one-dimensional methods (which can exploit that the data can be ordered)
If you want guaranteed runtime, k-means with possibly many iterations is quite uncontrolled.
You can't easily add constraints such as the 4-cars rule into k-means
I believe the solution to your task (because of the data being 1 dimensional and the constraints you added) is:
Sort the integers
Divide the sorted list into k even-sized groups