Should I use _0, _1, _2... as Strings or 0, 1, 2... as Numbers for _id values of a constant list of things? - mongodb

Initially, I had documents in the form of: resources:{wood:123, coal:1, silver:5} and boxes:{wood:999, coal:20}. In this example, my server's code tests (quite efficiently) if there is enough space for the wood (and it is) and enough space for the coal (and it is) and enough space for the silver (there is not, if space is 0 I don't even include it in boxes) then all is well.
I want to shorten the _id value from wood, coal, silver to a numeric representation which in turns occupies less space, packets of information are smaller when communicating to and from the client / server, etc.
I am curious about using 0, 1, 2...as numbers for the _id or _0, _1, _2...
What are the advantages of using Number or String? Are Numbers faster for queries? (ignoring index speed).
I am adding these values manually btw :P

The number of bytes necessary to represent an integer can be found by taking the integer and dividing by 256. The number of bytes necessary to represent a string are the number of characters in the string.
It takes only one byte to represent the numbers 87 or 202, but it takes two and three bytes to represent the same as a string (plus one more if you use the underscore).
Integers are almost certainly what you want here. However, if you're concerned about over-the-wire size, then you might see gains by shortening your keys. Rather than using wood, coal, and silver, you could use w, c, and s, saving you 11 bytes per record pulled.

Related

Decoding Arbitrary-Length Values Using a Fixed Block Size?

Background
In the past I've written an encoder/decoder for converting an integer to/from a string using an arbitrary alphabet; namely this one:
abcdefghjkmnopqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ23456789
Lookalike characters are excluded, so 1, I, l, O, and 0 are not present in this alphabet. This was done for user convenience and to make it easier to read and to type out a value.
As mentioned above, my previous project, python-ipminify converts a 32-bit IPv4 address to a string using an alphabet similar to the above, but excluding upper-case characters. In my current undertaking, I don't have the constraint of excluding upper-case characters.
I wrote my own Python for this project using the excellent question and answer here on how to build a URL-shortener.
I have published a stand-alone example of the logic here as a Gist.
Problem
I'm now writing a performance-critical implementation of this in a compiled language, most likely Rust, but I'd need to port it to other languages as well.. I'm also having to accept an arbitrary-length array of bytes, rather than an arbitrary-width integer, as is the case in Python.
I suppose that as long as I use an unsigned integer and use consistent endianness, I could treat the byte array as one long arbitrary-precision unsigned integer and do division over it, though I'm not sure how performance will scale with that. I'd hope that arbitrary-precision unsigned integer libraries would try to use vector instructions where possible, but I'm not sure how this would work when the input length does not match a specific instruction length, i.e. when the input size in bits is not evenly divisible by supported instructions, e.g. 8, 16, 32, 64, 128, 256, 512 bits.
I have also considered breaking up the byte array into 256-bit (32 byte) blocks and using SIMD instructions (I only need to support x86_64 on recent CPUs) directly to operate on larger unsigned integers, but I'm not exactly sure how to deal with size % 32 != 0 blocks; I'd probably need to zero-pad, but I'm not clear on how I would know when to do this during decoding, i.e. when I don't know the underlying length of the source value, only that of the decoded value.
Question
If I'm going the arbitrary unsigned integer width route, I'd essentially be at the mercy of the library author, which is probably fine; I'd imagine that these libraries would be fairly optimized to vectorize as much as possible.
If I try to go the block route, I'd probably zero-pad any remaining bits in the block if the input length was not divisible by the block size during encoding. However, would it even be possible to decode such a value without knowing the decoded value size?

Compressing Sets of Integers Into Smaller Integers

Along the lines of How to encode integers into other integers, I am wondering if it is possible to encode one integer or a set of integers into one smaller integer or a smaller set of integers, and if so, how it is done. For example, encoding an 8 bit integer into a 4 bit integer, a 256 integer into a 16 bit integer. It doesn't seem possible but perhaps there is something along these lines. Basically, how to get a set of integers to take up less space. Not necessarily encoding into another sequence of bytes, but maybe even into a data structure that is more compact.
Sure, you can always encode them into fewer bits. However you won't be able to decode them back to the original bits. Though you neglected to mention that step, I'm guessing that's what you're looking for.

Does 'BitSet' store bits or integers?

I am confused about BitSet. Does a BitSet data structure stores 1s and 0s?
val b = BitSet(0, 2, 3)
means store 1s for bit locations 0, 2 and 3?
If so, what are the max. no. of bits, 32 or 64?
A BitSet in Scala is implemented as an Array[Long], where each bit signals the presence of a number in the array. Long is 64 bit in Scala (on the JVM). One such Long can store values 0 to 63, the next one after it 64 to 127, on so on. This is possible since we're only talking about positive numbers and don't need to account for the sign.
Given your example:
BitSet(0, 2, 3)
We can store all these numbers inside a single Long, which in binary would be:
1101
Since we're in the range of 0 to 63, this works on a single Long value.
In general, the upper limit, or the biggest value stored inside a BitSet in Scala is Int.MaxValue, meaning 2^31-1 (2147483647). In order to store it, you'd need 2147483647 / 64 "bits" representing the number, which is ~= 33554432 longs. This is why storing large numbers in a bit set can get quite expensive, and it is usually only recommended when you're dealing with numbers in around the hundreds.
As a side note, immutable.BitSet has a special implementation in Scala (of the BitSetLike trait), namely BitSet1 and BitSet2, which are backed by a one and two longs, respectively, avoiding the need to allocate an additional array to wrap them.
From the documentation:
Bitsets are sets of non-negative integers which are represented as
variable-size arrays of bits packed into 64-bit words. The memory
footprint of a bitset is determined by the largest number stored in
it.
Given that the API deals with adding and removing Ints, then it seems reasonable to believe the maximum bit that can be set is a max integer, i.e. 2^31-1. Looking at the source for scala.collection.immutable.BitSet, there's also an assert to disallow negative integers (which makes sense according to the above description):
def + (elem: Int): BitSet = {
require(elem >= 0, "bitset element must be >= 0")

MD5 digest vs. hexdigest collision risk

I am comparing personal info of individuals, specifically their name, birthdate, gender, and race by hashing a string containing all of this info, and comparing the hash objects' hexdigests. This produces a 32 digit hexadecimal number, which I am using as a primary key in a database. For example, using my identifying string would work like this:
>> import hashlib
>> id_string = "BrianPeterson08041993MW"
>> byte_string = id_string.encode('utf-8')
>> hash_id = hashlib.md5(bytesring).hexdigest()
>> print(hash_id)
'3b807ad8a8b3a3569f098a575091bc79'
At this point, I am trying to ascertain collision risk. My understanding is that MD5 doesn't have significant collision risk, at least for strings that are relatively small, which mine are (about 20-40 characters in length). However, I am not using the 128-bit digest object, but the 32 digit hexdigest.
Now, I believe the hexdigest is a compression of the digest (that is, it's stored in fewer characters), so isn't there an increased risk of collision when comparing hexdigests? Or am I off-base?
Now, I believe the hexdigest is a compression of the digest (that is, it's stored in fewer characters), so isn't there an increased risk of collision when comparing hexdigests? Or am I off-base?
[...]
I guess my question is: don't different representations have different chances to be non-unique based on how many units of information they use to do the representation vs. how many units of information the original message takes to encode? And if so, what is the best representation to use? Um, let me preface your next answer with: talk to me like I'm 10
Old question, but yes, you were a bit off base, so to speak.
It’s the number of random bits that matters, not the length of the presentation.
The digest is just a number, an integer, which could be converted to a string using different amount of distinct digits. For example, a 128-bit number shown in some different radices:
"340106575100070649932820283680426757569" (base 10)
"ffde24cb47ecbff8d6e461a67c930dc1" (base 16, hexadecimal)
"7vroicmhvcnvsddp31kpu963e1" (base 32)
Shorter is nicer and more convenient (in auth tokens etc), but each representation has the exact same information and chance of collision. Shorter representations are shorter for the same reason as why "55" is shorter than "110111", while still encoding the same thing.
This answer might also clarify things, as well as toying with code like:
new BigInteger("340106575100070649932820283680426757569").toString(2)
...or something equivalent in other languages (Java/Scala above).
On a more practical level,
[...] which I am using as a primary key in a database
I don't see why not do away with any chance of collision by using a normal autoincremented id column (BIGINT AUTO_INCREMENT in MySQL, BIGSERIAL in PostgreSQL).
An abbreviated 32-bit hexdigest (8 hex characters) would not be long enough to effectively guarantee a collision-free database of users.
The formula for the birthday collision probability is here:
What is the probability of md5 collision if I pass in 2^32 sets of string?
Using a 32-bit key would mean that your software would start to break at around 10,000 users. The collision probability would be about 1%. It gets a lot worse very fast after that. At 100,000 users, the collision probability is 69%.
A 64-bit key, and a 10 billion users is another breaking point of about 2.7% collision rate.
For 100 billion users (a generous upper bound of the earth's population for the foreseeable future), a 96-bit key is a little risky in my opinion: collision chance is about one in 100 million. Really, you need a 128-bit key, which gives you a collision rate of about 1X10^-17.
128-bit keys are 128/4 = 32 hex characters long. If you wanted to use, a shorter key, for aesthetic purposes, you need to use 23 alphanumeric characters to exceed 128 bits. Or if you use printable characters (ASCII 32-126), you could get away with 20 characters.
So when you're talking about users, you need at least 128 bits for a collision-free random key, or a 20-32 character long string, or a 128/8 = 16 byte binary representation.

Design for max hash size given N-digit numerical input and collision related target

Assume a hacker obtains a data set of stored hashes, salts, pepper, and algorithm and has access to unlimited computing resources. I wish to determine a max hash size so that the certainty of determining the original input string is nominally equal to some target certainty percentage.
Constraints:
The input string is limited to exactly 8 numeric characters
uniformly distributed. There is no inter-digit relation such as a
checksum digit.
The target nominal certainty percentage is 1%.
Assume the hashing function is uniform.
What is the maximum hash size in bytes so there are nominally 100 (i.e. 1% certainty) 8-digit values that will compute to the same hash? It should be possible to generalize to N numerical digits and X% from the accepted answer.
Please include whether there are any issues with using the first N bytes of the standard 20 byte SHA1 as an acceptable implementation.
It is recognized that this approach will greatly increase susceptibility to a brute force attack by increasing the possible "correct" answers so there is a design trade off and some additional measures may be required (time delays, multiple validation stages, etc).
It appears you want to ensure collisions, with the idea that if a hacker obtained everything, such that it's assumed they can brute force all the hashed values, then they will not end up with the original values, but only a set of possible original values for each hashed value.
You could achieve this by executing a precursor step before your normal cryptographic hashing. This precursor step simply folds your set of possible values to a smaller set of possible values. This can be accomplished by a variety of means. Basically, you are applying an initial hash function over your input values. Using modulo arithmetic as described below is a simple variety of hash function. But other types of hash functions could be used.
If you have 8 digit original strings, there are 100,000,000 possible values: 00000000 - 99999999. To ensure that 100 original values hash to the same thing, you just need to map them to a space of 1,000,000 values. The simplest way to do that would be convert your strings to integers, perform a modulo 1,000,000 operation and convert back to a string. Having done that the following values would hash to the same bucket:
00000000, 01000000, 02000000, ....
The problem with that is that the hacker would not only know what 100 values a hashed value could be, but they would know with surety what 6 of the 8 digits are. If the real life variability of digits in the actual values being hashed is not uniform over all positions, then the hacker could use that to get around what you're trying to do.
Because of that, it would be better to choose your modulo value such that the full range of digits are represented fairly evenly for every character position within the set of values that map to the same hashed value.
If different regions of the original string have more variability than other regions, then you would want to adjust for that, since the static regions are easier to just guess anyway. The part the hacker would want is the highly variable part they can't guess. By breaking the 8 digits into regions, you can perform this pre-hash separately on each region, with your modulo values chosen to vary the degree of collisions per region.
As an example you could break the 8 digits thus 000-000-00. The prehash would convert each region into a separate value, perform a modulo, on each, concatenate them back into an 8 digit string, and then do the normal hashing on that. In this example, given the input of "12345678", you would do 123 % 139, 456 % 149, and 78 % 47 which produces 123 009 31. There are 139*149*47 = 973,417 possible results from this pre-hash. So, there will be roughly 103 original values that will map to each output value. To give an idea of how this ends up working, the following 3 digit original values in the first region would map to the same value of 000: 000, 139, 278, 417, 556, 695, 834, 973. I made this up on the fly as an example, so I'm not specifically recommending these choices of regions and modulo values.
If the hacker got everything, including source code, and brute forced all, he would end up with the values produced by the pre-hash. So for any particular hashed value, he would know that that it is one of around 100 possible values. He would know all those possible values, but he wouldn't know which of those was THE original value that produced the hashed value.
You should think hard before going this route. I'm wary of anything that departs from standard, accepted cryptographic recommendations.