Efficient encoding of integers with constant digit sum - encoding

How can a large set of integers all with a known constant digit sum, and a constant amount of digits be encoded.
Example of integers in base 10, with digit sum 5, and 3 digits:
014, 041, 104, 113, 122, 131, 140, 203 ....
The most important factor is space, but computing time is not completely unimportant.

The simplest way would be to store the digit sum itself, and leave it at that.
But it's possible that I misunderstand the question.
Edit: Here's my takeaway: You want to encode the set itself; yeah?
Encoding the set itself is as easy as storing the base, the digit sum, and the number of digits, e.g. {10, 5, 3} in the example you give.
Most of the time, however, you'll find that the most compact representation of a number is the number itself, unless it is very large.
Also, because digit sum is commonly taken to be recursive; and between one and nine, inclusive; 203 has the same digit sum as 500, or as 140, or as 950. This means that the set is huge for any combination of numbers, and also that any set (except for certain degenerate cases) uses every available digit in the base they are related to.
So, you know, the most efficient encoding of the numbers themselves when stored singly becomes the number itself, especially considering that every number between ±2 147 483 648 generally takes the same amount of space in memory, and often in storage.

When you have as clearly a defined set of possible values to encode as this the straight-forward encoding theoretic approach is to sequentially number all possible values, then store this number is as many bits as necessary. This is quite clearly optimal, if the frequencies of the individual values are identical or not known. If you know something about the frequency distribution you'll instead have to use something like a Huffman code to get a truly optimal result, but that's rather complicated and I'll handle only the other case.
For the uniformly distributed (or unknown) case the approach is as follows:
Imagine (you can pre-generate and store it, or generate it on the fly) a lexicographically sorted list of all your input (for the encoding) values. E.g. in your case the list would start be (unless your digit sum is recursive): 005, 023, 032, 050, 104, 113, 122, 131, 140, 203, 212, 221, 230, 401, 410, 500.
Then assign each item in the list an integer based on its position in the list: 005 becomes 0, 023 becomes 1, 032 becomes 2 and so on. There are (unless I made a mistake, if so, adjust appropriately) 16 values in the list, for which you need 4 bits to encode any index into the list. This index is your encoded value, and encoding and decoding become obvious.
As for an algorithm to generate the list in the first place: The simplest way is to count from 000 to 999 and throw away everything that doesn't match your criterion. It is possible to be more clever about that by replicating counting and overflow (e.g. how 104 follows 050) but it's probably not worth the effort.

Related

How can I calculate the impact on collision probability when truncating a hash?

I'd like to reduce an MD5 digest from 32 characters down to, ideally closer to 16. I'll be using this as a database key to retrieve a set of (public) user-defined parameters. I'm expecting the number of unique "IDs" to eventually exceed 10,000. Collisions are undesirable but not the end of the world.
I'd like to understand the viability of a naive truncation of the MD5 digest to achieve a shorter key. But I'm having trouble digging up a formula that I can understand (given I have a limited Math background), let alone use to determine the impact on collision probability that truncating the hash would have.
The shorter the better, within reason. I feel there must be a simple formula, but I'd rather have a definitive answer than do my own guesswork cobbled together from bits and pieces I have read around the web.
You can calculate the chance of collisions with this formula:
chance of collision = 1 - e^(-n^2 / (2 * d))
Where n is the number of messages, d is the number of possibilities, and e is the constant e (2.718281828...).
#mypetition's answer is great.
I found a few other equations that are more-or-less accurate and/or simplified here, along with a great explanation and a handy comparison of real-world probabilities:
1−e^((−k(k−1))/2N) - sample plot here
(k(k-1))/2N - sample plot here
k^2/2N - sample plot here
...where k is the number of ID's you'll be generating (the "messages") and N is the largest number that can be produced by the hash digest or the largest number that your truncated hexadecimal number could represent (technically + 1, to account for 0).
A bit more about "N"
If your original hash is, for example, "38BF05A71DDFB28A504AFB083C29D037" (32 hex chars), and you truncate it down to, say, 12 hex chars (e.g.: "38BF05A71DDF"), the largest number you could produce in hexadecimal is "0xFFFFFFFFFFFF" (281474976710655 - which is 16^12-1 (or 256^6 if you prefer to think in terms of bytes). But since "0" itself counts as one of the numbers you could theoretically produce, you add back that 1, which leaves you simply with 16^12.
So you can think of N as 16 ^ (numberOfHexDigits).

Does 'BitSet' store bits or integers?

I am confused about BitSet. Does a BitSet data structure stores 1s and 0s?
val b = BitSet(0, 2, 3)
means store 1s for bit locations 0, 2 and 3?
If so, what are the max. no. of bits, 32 or 64?
A BitSet in Scala is implemented as an Array[Long], where each bit signals the presence of a number in the array. Long is 64 bit in Scala (on the JVM). One such Long can store values 0 to 63, the next one after it 64 to 127, on so on. This is possible since we're only talking about positive numbers and don't need to account for the sign.
Given your example:
BitSet(0, 2, 3)
We can store all these numbers inside a single Long, which in binary would be:
1101
Since we're in the range of 0 to 63, this works on a single Long value.
In general, the upper limit, or the biggest value stored inside a BitSet in Scala is Int.MaxValue, meaning 2^31-1 (2147483647). In order to store it, you'd need 2147483647 / 64 "bits" representing the number, which is ~= 33554432 longs. This is why storing large numbers in a bit set can get quite expensive, and it is usually only recommended when you're dealing with numbers in around the hundreds.
As a side note, immutable.BitSet has a special implementation in Scala (of the BitSetLike trait), namely BitSet1 and BitSet2, which are backed by a one and two longs, respectively, avoiding the need to allocate an additional array to wrap them.
From the documentation:
Bitsets are sets of non-negative integers which are represented as
variable-size arrays of bits packed into 64-bit words. The memory
footprint of a bitset is determined by the largest number stored in
it.
Given that the API deals with adding and removing Ints, then it seems reasonable to believe the maximum bit that can be set is a max integer, i.e. 2^31-1. Looking at the source for scala.collection.immutable.BitSet, there's also an assert to disallow negative integers (which makes sense according to the above description):
def + (elem: Int): BitSet = {
require(elem >= 0, "bitset element must be >= 0")

Design for max hash size given N-digit numerical input and collision related target

Assume a hacker obtains a data set of stored hashes, salts, pepper, and algorithm and has access to unlimited computing resources. I wish to determine a max hash size so that the certainty of determining the original input string is nominally equal to some target certainty percentage.
Constraints:
The input string is limited to exactly 8 numeric characters
uniformly distributed. There is no inter-digit relation such as a
checksum digit.
The target nominal certainty percentage is 1%.
Assume the hashing function is uniform.
What is the maximum hash size in bytes so there are nominally 100 (i.e. 1% certainty) 8-digit values that will compute to the same hash? It should be possible to generalize to N numerical digits and X% from the accepted answer.
Please include whether there are any issues with using the first N bytes of the standard 20 byte SHA1 as an acceptable implementation.
It is recognized that this approach will greatly increase susceptibility to a brute force attack by increasing the possible "correct" answers so there is a design trade off and some additional measures may be required (time delays, multiple validation stages, etc).
It appears you want to ensure collisions, with the idea that if a hacker obtained everything, such that it's assumed they can brute force all the hashed values, then they will not end up with the original values, but only a set of possible original values for each hashed value.
You could achieve this by executing a precursor step before your normal cryptographic hashing. This precursor step simply folds your set of possible values to a smaller set of possible values. This can be accomplished by a variety of means. Basically, you are applying an initial hash function over your input values. Using modulo arithmetic as described below is a simple variety of hash function. But other types of hash functions could be used.
If you have 8 digit original strings, there are 100,000,000 possible values: 00000000 - 99999999. To ensure that 100 original values hash to the same thing, you just need to map them to a space of 1,000,000 values. The simplest way to do that would be convert your strings to integers, perform a modulo 1,000,000 operation and convert back to a string. Having done that the following values would hash to the same bucket:
00000000, 01000000, 02000000, ....
The problem with that is that the hacker would not only know what 100 values a hashed value could be, but they would know with surety what 6 of the 8 digits are. If the real life variability of digits in the actual values being hashed is not uniform over all positions, then the hacker could use that to get around what you're trying to do.
Because of that, it would be better to choose your modulo value such that the full range of digits are represented fairly evenly for every character position within the set of values that map to the same hashed value.
If different regions of the original string have more variability than other regions, then you would want to adjust for that, since the static regions are easier to just guess anyway. The part the hacker would want is the highly variable part they can't guess. By breaking the 8 digits into regions, you can perform this pre-hash separately on each region, with your modulo values chosen to vary the degree of collisions per region.
As an example you could break the 8 digits thus 000-000-00. The prehash would convert each region into a separate value, perform a modulo, on each, concatenate them back into an 8 digit string, and then do the normal hashing on that. In this example, given the input of "12345678", you would do 123 % 139, 456 % 149, and 78 % 47 which produces 123 009 31. There are 139*149*47 = 973,417 possible results from this pre-hash. So, there will be roughly 103 original values that will map to each output value. To give an idea of how this ends up working, the following 3 digit original values in the first region would map to the same value of 000: 000, 139, 278, 417, 556, 695, 834, 973. I made this up on the fly as an example, so I'm not specifically recommending these choices of regions and modulo values.
If the hacker got everything, including source code, and brute forced all, he would end up with the values produced by the pre-hash. So for any particular hashed value, he would know that that it is one of around 100 possible values. He would know all those possible values, but he wouldn't know which of those was THE original value that produced the hashed value.
You should think hard before going this route. I'm wary of anything that departs from standard, accepted cryptographic recommendations.

Should I use _0, _1, _2... as Strings or 0, 1, 2... as Numbers for _id values of a constant list of things?

Initially, I had documents in the form of: resources:{wood:123, coal:1, silver:5} and boxes:{wood:999, coal:20}. In this example, my server's code tests (quite efficiently) if there is enough space for the wood (and it is) and enough space for the coal (and it is) and enough space for the silver (there is not, if space is 0 I don't even include it in boxes) then all is well.
I want to shorten the _id value from wood, coal, silver to a numeric representation which in turns occupies less space, packets of information are smaller when communicating to and from the client / server, etc.
I am curious about using 0, 1, 2...as numbers for the _id or _0, _1, _2...
What are the advantages of using Number or String? Are Numbers faster for queries? (ignoring index speed).
I am adding these values manually btw :P
The number of bytes necessary to represent an integer can be found by taking the integer and dividing by 256. The number of bytes necessary to represent a string are the number of characters in the string.
It takes only one byte to represent the numbers 87 or 202, but it takes two and three bytes to represent the same as a string (plus one more if you use the underscore).
Integers are almost certainly what you want here. However, if you're concerned about over-the-wire size, then you might see gains by shortening your keys. Rather than using wood, coal, and silver, you could use w, c, and s, saving you 11 bytes per record pulled.

How do BigNums implementations work?

I wanted to know how the BigInt and other such stuff are implemented. I tried to check out JAVA source code, but it was all Greek and Latin to me.
Can you please explain me the algo in words - no code, so that i understand what i am actually using when i use something from the JAVA API.
regards
Conceptually, the same way you do arbitrary size arithmentic by hand. You have something like an array of values, and algorithms for the various operations that work on the array.
Say you want to add 100 to 901. You start with the two numbers as arrays:
[0, 1, 0, 0]
[0, 9, 0, 1]
When you add, your addition algorithm starts from the right, takes 0+1, giving 1, 0+0, giving 0, and -- now the tricky part -- 9+1 gives 10, but now we need to carry, so we add 1 to the next column over, and put (9+1)%10 into the third column.
When your numbers grow big enough -- greater than 9999 in this example -- then you have to allocate more space somehow.
This is, of course, somewhat simplified if you store the numbers in reverse order.
Real implementations use full words, so the modulus is really some large power of two, but the concept is the same.
There's a very good section on this in Knuth.