LightFM and other libraries ask for a 32 bit integer id e.g for users. But, our user id is a UUID e.g. 0003374a-a35c-46ed-96d2-0ea32b753199. I was wondering what you would recommend in scenarios like these. What I have come up with is:
Create a bidirectional dictionary either in memory or in a database to keep a UUID <-> Int mapping. e.g. https://github.com/jab/bidict
Use a non cryptographic hash function like MurmurHash3 or xxHash. For e.g. for 10 million UUIDs, I got around 11,521 or 0.1% collision using xxhash. Is that negligible for a recommender system?
I'm also curious on how this would apply in an online prediction scenario, where given the UUID, user interactions and the model, I have to predict the recommendations for a model which needs 32 bit integers. If I use the in memory bidict approach, then that won't work in this case and hence I may have to create a persistent key-value store in the worst case.
This will definitely work, and is probably the solution the vast majority of users will choose. The disadvantage lies, of course, in having to maintain the mapping.
A hashing function will also work. There are, in fact, approaches which use hashing to reduce the dimensionality of the embedding layers required. One thing worth bearing in mind is that the resulting hash range should be relatively compact: most implementations will allocate parameters for all possible values, so a hashing function that can hash to very large values will require exorbitant amounts of memory. Hashing following by a modulo function could work well; the trade-off then is between memory required to hold all parameters and collision probability.
In LightFM as well as most other implementations, recommendations can only be made for users and items (or at least for user and item features) that were present during the training. The mapping will then be a part of the model itself, and be effectively frozen until a new model is trained.
Related
I wondered a time ago why no technology exists to equalize the hash creation speed across different cpu's/gpu's. I have no idea if this is feasible or not, that's why I ask this question here. The idea behind this is to make the proof of work just between two parties with each a 50% chance to create the winning hash (equal hashing speed!). In combination with an easier to find nonce, this solution is energy friendlier than existing proof of work technologies, while the desired goal is still met.
This is more or less impossible for the simple reason that a faster machine is just … faster. If one of the two parties buys a faster machine, then they will compute the hash faster. That's just the way it is.
However, there is something we can do. Bitcoin, for example, is based on SHA-256 (the 256 bit long version of SHA-2). SHA-2 is specifically designed to be fast, and to be easy to speed up by specialized hardware. And that is exactly what we see happen in the Bitcoin mining space with the move from pure software-based mining to CPUs with built-in acceleration for SHA-2 to GPUs to FPGAs to ASICs.
The reason for this is that SHA-2 is designed as a general cryptographic hash function, and one of the main usage of cryptographic hashes is as the basis for TLS/SSL and digital signatures where large amounts of data need to be hashed in a short amount of time.
But, there are other use cases for cryptographic hash functions, in particular, password hashing. For password hashing, we want the hash function to be slow and hard to speed up, since a legitimate user only needs to hash a very small amount of data (the password) once (when logging in), whereas an attacker needs to hash large numbers of passwords over and over again, for a brute force attack.
Examples for such hash functions are PBKDF2, bcrypt, scrypt, Catena, Lyra2, yescrypt, Makwa, and Argon2 (the latter being the winner of the 2013 Password Hashing Competition). Scrypt in particular is designed to be hard to speed up using GPUs, FPGAs, and ASICs as well as through space-time or time-space tradeoffs. Scrypt uses a cryptographically secure pseudo-random number generator to initialize a huge array of pseudo-random numbers in memory, and then uses another CSPRNG to generate indices for accesses into this array, thus making both the memory contents as well as the memory access patterns pseudo-random.
Theoretically, of course, it would be possible to pre-compute the result, after all, accessing an array in some specific order is the same as accessing a much larger array in linear order, however, scrypt is designed in such a way that this pre-computed array would be prohibitively large. Plus, scrypt has a simple work-factor parameter that can be used to exponentially increase the size of this array, if memory capacity increases. So, trading space for time is not possible.
Likewise, it would be possible to create a PRNG which combines the two pseudo-random processes into one process and computes the results on the fly. However, scrypt is designed in such a way that the computing time for this would be prohibitively long, and again, there is the exponential work-factor which can be used to drastically increase the computing time without changes to the algorithm. So, trading time for space is not possible.
The pseudo-random access pattern to the memory also defeats any sort of branch-prediction, memory prefetching or caching scheme of the CPU.
And lastly, since the large array is a shared global mutable state, and there is no way to sensibly divide the work into independent units, the algorithm is not sensibly parallelizable, which means you can't speed it up using GPUs.
And in fact, some newer cryptocurrencies, smart contracts, blockchains etc. use an scrypt-based proof-of-work scheme.
Note, however, that running scrypt on a faster machine is still faster than running scrypt on a slower machine. There is no way around that. It just means that we cannot get the ridiculous amounts of speedup we get from using specialized hardware for SHA-2, for example. But, designing cryptographic algorithms is hard, and there actually are ASIC-based scrypt miners for Litecoin out there, that do get a significant speedup, however still less than the impressive ones we see for SHA-2 / Bitcoin.
how can I compare methods of conflict resolution (ie. linear hashing, square hashing and double hashing) in the tables hash? What data would be best to show the differences between them? Maybe someone has seen such comparisons.
There is no simple approach that's also universally meaningful.
That said, a good approach if you're tuning an actual app is to instrument (collect stats) for the hash table implementation you're using in the actual application of interest, with the real data it processes, and for whichever functions are of interest (insert, erase, find etc.). When those functions are called, record whatever you want to know about the collisions that happen: depending on how thorough you want to be, that might include the number of collisions before the element was inserted or found, the number of CPU/memory cache lines touched during that probing, the elapsed CPU or wall-clock time etc..
If you want a more general impression, instrument an implementation and throw large quantities of random data at it - but be aware that the real-world applicability of whatever conclusions you draw may only be as good as the random data is similar to the real-world data.
There are also other, more subtle implications to the choice of collision-handling mechanism: linear probing allows an implementation to cleanup "tombstone" buckets where deleted elements exist, which takes time but speeds later performance, so the mix of deletions amongst other operations can affect the stats you collect.
At the other extreme, you could try a mathematical comparison of the properties of different collision handling - that's way beyond what I'm able or interested in covering here.
I have a static dictionary containing millions of keys which refer to values in a sparse data structure stored out-of-core. The number of keys is a small fraction, say 10%, of the number of values. The key size is typically 64-bit. The keys are linearly ordered and the queries will often consist of keys which are close together in this order. Data compression is a factor, but it is the values which are expected to be the biggest contributor to data size rather than the keys. Key compression helps, but is not critical. Query time should be constant, if possible, and fast since a user is interacting with the data.
Given these conditions I would like to know an effective way to query the dictionary to determine whether a specific key is contained in it. Query speed is the top priority, construction time is not as critical.
Currently I'm looking at cache-oblivious b+-trees and order-preserving minimal perfect hashes tied to external storage.
At this point CHD or some other form of hashing seems like a candidate. Since the keys are queried in approximate linear order it seems that an order-preserving hash would avoid cache misses, but I'm not knowledgeable enough to say whether CHD can preserve the order of the keys. Constant-time queries are also desirable. The search is O(1), but the upper limit on query times over the key space is also unknown.
Trees seem less attractive. Although there are some cache-oblivious and cache-specific approaches I think much of the effort is aimed at range queries on dynamic dictionaries rather than constant-time membership queries. Processors and memories, in general, don't like branches.
There have been a number of questions asked along these lines, but this case (hopefully) constrains the problem in a manner that might be useful to others.
any feedback would be appreciated,
thanks
Does anyone know if there's a real benefit regarding decreasing collision probability by combining hash functions? I especially need to know this regarding 32 bit hashing, namely combining Adler32 and CRC32.
Basically, will adler32(crc32(data)) yield a smaller collision probability than crc32(data)?
The last comment here gives some test results in favor of combining, but no source is mentioned.
For my purpose, collision is not critical (i.e. the task does not involve security), but I'd rather minimize the probability anyway, if possible.
PS: I'm just starting in the wonderful world of hashing, doing a lot of reading about it. Sorry if I asked a silly question, I haven't even acquired the proper "hash dialect" yet, probably my Google searches regarding this were also poorly formed.
Thanks.
This doesn't make sense combining them in series like that. You are hashing one 32-bit space to another 32-bit space.
In the case of a crc32 collision in the first step, the final result is still a collision. Then you add on any potential collisions in the adler32 step. So it can not get any better, and can only be the same or worse.
To reduce collisions, you might try something like using the two hashes independently to create a 64-bit output space:
adler32(data) << 32 | crc32(data)
Whether there is significant benefit in doing that, I'm not sure.
Note that the original comment you referred to was storing the hashes independently:
Whichever algorithm you use there is
going to be some chance of false
positives. However, you can reduce
these chances by a considerable margin
by using two different hashing
algorithms. If you were to calculate
and store both the CRC32 and the
Alder32 for each url, the odds of a
simultaneous collision for both hashes
for any given pair of urls is vastly
reduced.
Of course that means storing twice as
much information which is a part of
your original problem. However, there
is a way of storing both sets of hash
data such that it requires minimal
memory (10kb or so) whilst giving
almost the same lookup performance (15
microsecs/lookup compared to 5
microsecs) as Perl's hashes.
I am facing an application that uses hashing, but I cannot still figure out how it works. Here is my problem, hashing is used to generate some index, and with those indexes I access to different tables, and after I add the value of every table that I get using the indexes and with that I get my final value. This is done to reduce the memory requirements. The input to the hashing function is doing the XOR between a random constant number and some parameters from the application.
Is this a typical hashing application?. The thing that I do not understand is how using hashing can we reduce the memory requirements?. Can anyone clarify this?.
Thank you
Hashing alone doesn't have anything to do with memory.
What it is often used for is a hashtable. Hashtables work by computing the hash of what you are keying off of, which is then used as an index into a data structure.
Hashing allows you to reduce the key (string, etc.) into a more compact value like an integer or set of bits.
That might be the memory savings you're referring to--reducing a large key to a simple integer.
Note, though, that hashes are not unique! A good hashing algorithm minimizes collisions but they are not intended to reduce to a unique value--doing so isn't possible (e.g., if your hash outputs a 32bit integer, your hash would have only 2^32 unique values).
Is it a bloom filter you are talking about? This uses hash functions to get a space efficient way to test membership of a set. If so then see the link for an explanation.
Most good hash implementations are memory inefficient, otherwise there would be more computing involved - and that would exactly be missing the point of hashing.
Hash implementations are used for processing efficiency, as they'll provide you with constant running time for operations like insertion, removal and retrieval.
You can think about the quality of hashing in a way that all your data, no matter what type or size, is always represented in a single fixed-length form.
This could be explained if the hashing being done isn't to build a true hash table, but is to just create an index in a string/memory block table. If you had the same string (or memory sequence) 20 times in your data, and you then replaced all 20 instances of that string with just its hash/table index, you could achieve data compression in that way. If there's an actual collision chain contained in that table for each hash value, however, then what I just described is not what's going on; in that case, the reason for the hashing would most likely be to speed up execution (by providing quick access to stored values), rather than compression.