Given that there are quite a few bioinformaticians around these days, wondering if any of you have seen examples of hash functions (or mapping mechanisms) occuring in nature? And if so how do they work?
Smell is a sort of hashing.
A smelling hash function transforms a huge variety of chemical compounds to a small piece of information (several things may smell alike). Living organisms learn to distinguish the "buckets" such "hashing" puts the objects to, and tend to infer behavioral patterns that rely on such a function.
This applies to taste as well.
On a more humorous note, I would say that the nervous and sensory systems of most or all animals might be seen a hash function designed to divide things into:
things to mate with
things to eat
things to run away from
rocks.
Kudos to Terry Pratchett for identifying this elaborate system of categories. ;)
The visual system (lens, retina, cortex) is a suite of serially composed hashing functions.
(1) projection from 3d to 2d.
(2) photon to neural stimulus
(3) spatial to pattern
(4) spatial and pattern to temporal
(5) (in predator species with aligned pairs of eyes) two sets of all the above back to 3d.
etc.
Hubel and Wiesel got the Nobel in 1981 for figuring this out at the expense of a bunch of kittehs. They didn't call them hash functions, but that's actually a terrific way of representing them.
I guess turning a color image into a B&W one could be see has a mapping/hashing -- and such filter exist in the nature for photography.
Related
I am coding a spell-casting system where you draw a symbol with your wand (mouse), and it can recognize said symbol.
There are two methods I believe might work; neural networking and an "invisible grid system"
The problem with the neural networking system is that It would be (likely) suboptimal in Roblox Luau, and not be able to match the performance nor speed I wish for. (Although, I may just be lacking in neural networking knowledge. Please let me know whether I should continue to try implementing it this way)
For the invisible grid system, I thought of converting the drawing into 1s and 0s (1 = drawn, 0 = blank), then seeing if it is similar to one of the symbols. I create the symbols by making a dictionary like:
local Symbol = { -- "Answer Key" shape, looks like a tilted square
00100,
01010,
10001,
01010,
00100,
}
The problem is that user error will likely cause it to be inaccurate, like this "spell"'s blue boxes, showing user error/inaccuracy. I'm also sure that if I have multiple Symbols, comparing every value in every symbol will surely not be quick.
Do you know an algorithm that could help me do this? Or just some alternative way of doing this I am missing? Thank you for reading my post.
I'm sorry if the format on this is incorrect, this is my first stack-overflow post. I will gladly delete this post if it doesn't abide to one of the rules. ( Let me know if there are any tags I should add )
One possible approach to solving this problem is to use a template matching algorithm. In this approach, you would create a "template" for each symbol that you want to recognize, which would be a grid of 1s and 0s similar to what you described in your question. Then, when the user draws a symbol, you would convert their drawing into a grid of 1s and 0s in the same way.
Next, you would compare the user's drawing to each of the templates using a similarity metric, such as the sum of absolute differences (SAD) or normalized cross-correlation (NCC). The template with the lowest SAD or highest NCC value would be considered the "best match" for the user's drawing, and therefore the recognized symbol.
There are a few advantages to using this approach:
It is relatively simple to implement, compared to a neural network.
It is fast, since you only need to compare the user's drawing to a small number of templates.
It can tolerate some user error, since the templates can be designed to be tolerant of slight variations in the user's drawing.
There are also some potential disadvantages to consider:
It may not be as accurate as a neural network, especially for complex or highly variable symbols.
The templates must be carefully designed to be representative of the expected variations in the user's drawings, which can be time-consuming.
Overall, whether this approach is suitable for your use case will depend on the specific requirements of your spell-casting system, including the number and complexity of the symbols you want to recognize, the accuracy and speed you need, and the resources (e.g. time, compute power) that are available to you.
I believe (from doing some reading) that reading/writing data across the bus from CPU caches to main memory places a considerable constraint on how fast a computational task (which needs to move data across the bus) can complete - the Von Neumann bottleneck.
I have come across a few articles so far which mention that functional programming can be more performant than other paradigms like the imperative approach eg. OO (in certain models of computation).
Can someone please explain some of the ways that purely functional programming can reduce this bottleneck? ie. are any of the following points found (in general) to be true?
Using immutable data structures means generally less data is moving across that bus - less writes?
Using immutable data structures means that data is possibly more likely to be hanging around in CPU cache - because less updates to existing state means less flushing of objects from cache?
Is it possible that using immutable data structures means that we may often never even read the data back from main memory because we may create the object during computation and have it in local cache and then during same time slice create a new immutable object off of it (if there is a need for an update) and we then never use original object ie. we are working a lot more with objects that are sitting in local cache.
Oh man, that’s a classic. John Backus’ 1977 ACM Turing Award lecture is all about that: “Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs.” (The paper, “Lambda: The Ultimate Goto,” was presented at the same conference.)
I’m guessing that either you or whoever raised this question had that lecture in mind. What Backus called “the von Neumann bottleneck” was “a connecting tube that can transmit a single word between the CPU and the store (and send an address to the store).”
CPUs do still have a data bus, although in modern computers, it’s usually wide enough to hold a vector of words. Nor have we gotten away from the problem that we need to store and look up a lot of addresses, such as the links to daughter nodes of lists and trees.
But Backus was not just talking about physical architecture (emphasis added):
Not only is this tube a literal bottleneck for the data traffic of a problem, but, more importantly, it is an intellectual bottleneck that has kept us tied to word-at-a-time thinking instead of encouraging us to think in terms of the larger conceptual units of the task at hand. Thus programming is basically planning and detailing the enormous traffic of words through the von Neumann bottleneck, and much of that traffic concerns not significant data itself but where to find it.
In that sense, functional programming has been largely successful at getting people to write higher-level functions, such as maps and reductions, rather than “word-at-a-time thinking” such as the for loops of C. If you try to perform an operation on a lot of data in C, today, then just like in 1977, you need to write it as a sequential loop. Potentially, each iteration of the loop could do anything to any element of the array, or any other program state, or even muck around with the loop variable itself, and any pointer could potentially alias any of these variables. At the time, that was true of the DO loops of Backus’ first high-level language, Fortran, as well, except maybe the part about pointer aliasing. To get good performance today, you try to help the compiler figure out that, no, the loop doesn’t really need to run in the order you literally specified: this is an operation it can parallelize, like a reduction or a transformation of some other array or a pure function of the loop index alone.
But that’s no longer a good fit for the physical architecture of modern computers, which are all vectorized symmetric multiprocessors—like the Cray supercomputers of the late ’70s, but faster.
Indeed, the C++ Standard Template Library now has algorithms on containers that are totally independent of the implementation details or the internal representation of the data, and Backus’ own creation, Fortran, added FORALL and PURE in 1995.
When you look at today’s big data problems, you see that the tools we use to solve them resemble functional idioms a lot more than the imperative languages Backus designed in the ’50s and ’60s. You wouldn’t write a bunch of for loops to do machine learning in 2018; you’d define a model in something like Tensorflow and evaluate it. If you want to work with big data with a lot of processors at once, it’s extremely helpful to know that your operations are associative, and therefore can be grouped in any order and then combined, allowing for automatic parallelization and vectorization. Or that a data structure can be lock-free and wait-free because it is immutable. Or that a transformation on a vector is a map that can be implemented with SIMD instructions on another vector.
Examples
Last year, I wrote a couple short programs in several different languages to solve a problem that involved finding the coefficients that minimized a cubic polynomial. A brute-force approach in C11 looked, in relevant part, like this:
static variable_t ys[MOST_COEFFS];
// #pragma omp simd safelen(MOST_COEFFS)
for ( size_t j = 0; j < n; ++j )
ys[j] = ((a3s[j]*t + a2s[j])*t + a1s[j])*t + a0s[j];
variable_t result = ys[0];
// #pragma omp simd reduction(min:y)
for ( size_t j = 1; j < n; ++j ) {
const variable_t y = ys[j];
if (y < result)
result = y;
} // end for j
The corresponding section of the C++14 version looked like this:
const variable_t result =
(((a3s*t + a2s)*t + a1s)*t + a0s).min();
In this case, the coefficient vectors were std::valarray objects, a special type of object in the STL that have restrictions on how their components can be aliased, and whose member operations are limited, and a lot of the restrictions on what operations are safe to vectorize sound a lot like the restrictions on pure functions. The list of allowed reductions, like .min() at the end, is, not coincidentally, similar to the instances of Data.Semigroup. You’ll see a similar story these days if you look through <algorithm> in the STL.
Now, I’m not going to claim that C++ has become a functional language. As it happened, I made all the objects in the program immutable and automatically collected by RIIA, but that’s just because I’ve had a lot of exposure to functional programming and that’s how I like to code now. The language itself doesn’t impose such things as immutability, garbage collection or absence of side-effects. But when we look at what Backus in 1977 said was the real von Neumann bottleneck, “an intellectual bottleneck that has kept us tied to word-at-a-time thinking instead of encouraging us to think in terms of the larger conceptual units of the task at hand,” does that apply to the C++ version? The operations are linear algebra on coefficient vectors, not word-at-a-time. And the ideas C++ borrowed to do this—and the ideas behind expression templates even more so—are largely functional concepts. (Compare that snippet to how it would’ve looked in K&R C, and how Backus defined a functional program to compute inner product in section 5.2 of his Turing Award lecture in 1977.)
I also wrote a version in Haskell, but I don’t think it’s as good an example of escaping that kind of von Neumann bottleneck.
It’s absolutely possible to write functional code that meets all of Backus’ descriptions of the von Neumann bottleneck. Looking back on the code I wrote this week, I’ve done it myself. A fold or traversal that builds a list? They’re high-level abstractions, but they’re also defined as sequences of word-at-a-time operations, and half or more of the data passed through the bottleneck when you create and traverse a singly-linked list is the addresses of other data! They’re efficient ways to put data through the von Neumann bottleneck, and that’s basically why I did it: they’re great patterns for programming von Neumann machines.
If we’re interested in coding a different way, however, functional programming gives us tools to do so. (I’m not going to claim it’s the only thing that does.) Express a reduction as a foldMap, apply it to the right kind of vector, and the associativity of the monoidal operation lets you split up the problem into chunks of whatever size you want and combine the pieces later. Make an operation a map rather than a fold, on a data structure other than a singly-linked list, and it can be automatically parallelized or vectorized. Or transformed in other ways that produce the same result, since we’ve expressed the result at a higher level of abstraction, not a particular sequence of word-at-a-time operations.
My examples so far have been about parallel programming, but I’m sure quantum computing will shake up what programs look like a lot more fundamentally.
how can I compare methods of conflict resolution (ie. linear hashing, square hashing and double hashing) in the tables hash? What data would be best to show the differences between them? Maybe someone has seen such comparisons.
There is no simple approach that's also universally meaningful.
That said, a good approach if you're tuning an actual app is to instrument (collect stats) for the hash table implementation you're using in the actual application of interest, with the real data it processes, and for whichever functions are of interest (insert, erase, find etc.). When those functions are called, record whatever you want to know about the collisions that happen: depending on how thorough you want to be, that might include the number of collisions before the element was inserted or found, the number of CPU/memory cache lines touched during that probing, the elapsed CPU or wall-clock time etc..
If you want a more general impression, instrument an implementation and throw large quantities of random data at it - but be aware that the real-world applicability of whatever conclusions you draw may only be as good as the random data is similar to the real-world data.
There are also other, more subtle implications to the choice of collision-handling mechanism: linear probing allows an implementation to cleanup "tombstone" buckets where deleted elements exist, which takes time but speeds later performance, so the mix of deletions amongst other operations can affect the stats you collect.
At the other extreme, you could try a mathematical comparison of the properties of different collision handling - that's way beyond what I'm able or interested in covering here.
I'm writing a program which can significantly lessen the number of collisions that occur while using hash functions like 'key mod table_size'. For this I would like to use Genetic Programming/Algorithm. But I don't know much about it. Even after reading many articles and examples I don't know that in my case (as in program definition) what would be the fitness function, target (target is usually the required result), what would pose as the population/individuals and parents, etc.
Please help me in identifying the above and with a few codes/pseudo-codes snippets if possible as this is my project.
Its not necessary to be using genetic programming/algorithm, it can be anything using evolutionary programming/algorithm.
thanks..
My advice would be: don't do this that way. The literature on hash functions is vast and we more or less understand what makes a good hash function. We know enough mathematics not to look for them blindly.
If you need a hash function to use, there is plenty to choose from.
However, if this is your uni project and you cannot possibly change the subject or steer it in a more manageable direction, then as you noticed there will be complex issues of getting fitness function and mutation operators right. As far as I can tell off the top of my head, there are no obvious candidates.
You may look up e.g. 'strict avalanche criterion' and try to see if you can reason about it in terms of fitness and mutations.
Another question is how do you want to represent your function? Just a boolean expression? Something built from word operations like AND, XOR, NOT, ROT ?
Depending on your constraints (or rather, assumptions) the question of fitness and mutation will be different.
Broadly fitness is clearly minimize the number of collisions in your 'hash modulo table-size' model.
The obvious part is to take a suitably large and (very important) representative distribution of keys and chuck them through your 'candidate' function.
Then you might pass them through 'hash modulo table-size' for one or more values of table-size and evaluate some measure of 'niceness' of the arising distribution(s).
So what that boils down to is what table-sizes to try and what niceness measure to apply.
Niceness is context dependent.
You might measure 'fullest bucket' as a measure of 'worst case' insert/find time.
You might measure sum of squares of bucket sizes as a measure of 'average' insert/find time based on uniform distribution of amongst the keys look-up.
Finally you would need to decide what table-size (or sizes) to test at.
Conventional wisdom often uses primes because hash modulo prime tends to be nicely volatile to all the bits in hash where as something like hash modulo 2^n only involves the lower n-1 bits.
To keep computation down you might consider the series of next prime larger than each power of two. 5(>2^2) 11 (>2^3), 17 (>2^4) , etc. up to and including the first power of 2 greater than your 'sample' size.
There are other ways of considering fitness but without a practical application the question is (of course) ill-defined.
If your 'space' of potential hash functions don't all have the same execution time you should also factor in 'cost'.
It's fairly easy to define very good hash functions but execution time can be a significant factor.
I'm doing an AutoLisp project which uses long associative structures to do heavy geometrical processing - so I'm curious about the associative list intense use timing results.
How simple/complex is the implementation? It uses some data structure or a normal list of dotted pairs?
The are any extension for b-tree or something?
the turning point for SBCL on recent x86 hardware between alists and identity based hashtables, assuming even distribution of access, is around 30-40 elements.
In Common Lisp and Emacs Lisp association lists are linked lists, so they have linear search time. Assuming that AutoLisp is the same (and if it isn't, then their use of the term "Associative List" is misleading), you can assume that all operations will be linear in the length of the list. For example, an alist with 100 elements will, on average, need 50 accesses to find the thing that you are after.
Of course, most Scheme implementations (or maybe it's in the specs?) have hashtables, which use mostly the same API; but it's not transparent, when you ask for an alist, you get a list of pairs, if you want a hashtable, ask for it.
that said, it's important to remember that linear algorithms aren't slow; they're 'unscalable'. for a small number of elements, they'll outperform a more complex 'clever' algorithm. just how large 'n' has to be, depends a lot on the algorithm, and fast processors with big caches but slow RAM, keep pushing it. Also, heavy optimising compilers (like those available on some Lisp's) generate very tight linear code.
I have not worked with AutoLisp in about 10 years, but I never found any real performance issues with association list manipulation. And I wrote code that would do a fair amount of association list manipulation.
Working in VBA or ObjectARX might have some performance benefits, but you would probably need to run some comparison testing to see if it is really better.
There is no extension for b-tree that I know of but if you use Visual LISP you can use ActiveX objects and thus access most types of databases.