Need a 9 char length unique ID - distributed-computing

My application uses a 9 digit number (it can be alphanumeric also). I can start with any number and then increments it at the beginning. But my application is not a single instance application, so if I run this exe as another instance, it should increment the latest value and the previous instance should again increment the latest value when it needs that value. I mean at all time, the value should be latest incremented value among all the instances that I open.
This is half of the problem. The other side is, exes can be run on any machine on the network and each instance should keep on incrementing (just like time never goes back) for another 2 years. My restrictions is that I can't use files to store and retrieve the latest value in common place.
How can I do that?
A 9 char/digit UNIQUE NUMBER also works for sure. The whole idea is to assign a number (String of 9 char length) to each "confidential file" and (encrypt it and whatever, which is not my job)
I tried with:
GUID which is unique in total 128 bits but not with last or first 9 chars
Tick count more than 9
MAC address unique only if 12 chars
ISBN (book numbering system)
And so on ...

I think the best approach might be to have unique number server which each instance of you application queries over the network to get unique numbers.

First, you need to remove the distributed aspect from the problem. Like user Hugo suggested, using the last 2 or 3 bytes of the IP address should work. Your problem is now reduced to a local problem for each single machine.
Your algorithm probably needs to be able to deal with a restart, and not start handing out the same numbers after a reboot. You state that you do not have the option to use a file to store and retrieve information about this mechanism via a file system. This means that a random number generator alone would not be good enough, and you need a time-based component in your number generator as well. If you use 4 bytes containing the number of seconds elapsed since some date you will have more than 100 years of uniqueness in that. However, ideally the time-scale to use here depends on the expected handout-frequency of your numbers. Your problem is now reduced to a local problem for each single machine for each single second.
The final 2 or 3 bytes are then available to ensure local uniqueness for the second. Depending on your requirements and operating system, there are multiple IPC mechanisms to manage this, like pipes, sockets or shared memory. Or you could think of more creative ways. If you know the number of participating processes on a node, you could assign a sequence number to each process at startup or configuration time, and 1 of the 2 or 3 bytes is used for that. Your uniqueness problem has now become local to your process to one second only, which should be doable.

Why does it have to be EXACTLY 9? UUIDs would be great if that didn't limit you.
In any case, your best shot is to generate a random number. If all your PCs are in the same network, use the host-digits of the IP address at the begining to avoid colission. This should be no more than 16 or 24 bits in most cases anyway, so you have 6 remaining digits.

Related

Is there a possibility of same ObjectIds be gernerated on MongoDbs on 2 different machines

The Object Id in MongoDB has 3 parts as per the official documentation:
a 4-byte timestamp value, representing the ObjectId’s creation, measured in seconds since the Unix epoch
a 5-byte random value
a 3-byte incrementing counter, initialized to a random value
In some other blogs and documentation , it is said that the 5 byte random value refers to 3-byte machine id and 2-byte process id combination to have uniqueness.
As per my Observation:
In my local Machine whenever I am writing in MongoDB through an application, the 5 bytes random value only changes if I restart the application and then write again, which is showing possibility of the 5-byte random no. to depend on the Process Id. But its not the case that first 3 bytes is not changing (Machine Address) and only last 2 bytes is changing(Process id). Instead, ** complete 5-byte is changing**
I want to know that is it just a random number , or it has some dependency on Machine Id and Process Id... If it has dependency on Machine Id + Process Id, then can we assume that it is highly unlikely that 2 Object Ids on different machines are same at a given time.
The "random value" part of ObjectId used to be machine id + process id, now it is simply a random number. (See the rationale in the spec for the official statement.)
When using virtualization it is not uncommon to end up with the same machine id across multiple servers. See for example here.
Process id can also be the same across multiple servers if they all launch from the same image, thus have exactly the same boot sequence.
For these reasons ObjectId generation now uses a random counter.
Random value(5 bytes) are a combination of Machine Identifier(3 bytes) and Process Id(2 bytes).
You need to get the same hash value for the combination of Machine Identifier and Process Id with the same random value across both the machine at the exact same second to get a duplicate ObjectId in MongoDB across two different machines.
To answer your question, it is highly unlikely but not impossible.
The same has been described in this MongoDB Blog where it is mentioned that it may not be globally unique in some edge cases.
Refer Point:3 of the accepted answer of this question for Possibility of duplicate Mongo ObjectId's being generated in two different collections?

Collision probability of ObjectId vs UUID in a large distributed system

Considering that an UUID rfc 4122 (16 bytes) is much larger than a MongoDB ObjectId (12 bytes), I am trying to find out how their collision probability compare.
I know that is something around quite unlikely, but in my case most ids will be generated within a large number of mobile clients, not within a limited set of servers. I wonder if in this case, there is a justified concern.
Compared to the normal case where all ids are generated by a small number of clients:
It might take months to detect a collision since the document creation
IDs are generated from a much larger client base
Each client has a lower ID generation rate
in my case most ids will be generated within a large number of mobile clients, not within a limited set of servers. I wonder if in this case, there is a justified concern.
That sounds like very bad architecture to me. Are you using a two-tier architecture? Why would the mobile clients have direct access to the db? Do you really want to rely on network-based security?
Anyway, some deliberations about the collision probability:
Neither UUID nor ObjectId rely on their sheer size, i.e. both are not random numbers, but they follow a scheme that tries to systematically reduce collision probability. In case of ObjectIds, their structure is:
4 byte seconds since unix epoch
3 byte machine id
2 byte process id
3 byte counter
This means that, contrary to UUIDs, ObjectIds are monotonic (except within a single second), which is probably their most important property. Monotonic indexes will cause the B-Tree to be filled more efficiently, it allows paging by id and allows a 'default sort' by id to make your cursors stable, and of course, they carry an easy-to-extract timestamp. These are the optimizations you should be aware of, and they can be huge.
As you can see from the structure of the other 3 components, collisions become very likely if you're doing > 1k inserts/s on a single process (not really possible, not even from a server), or if the number of machines grows past about 10 (see birthday problem), or if the number of processes on a single machine grows too large (then again, those aren't random numbers, but they are truly unique on a machine, but they must be shortened to two bytes).
Naturally, for a collision to occur, they must match in all these aspects, so even if two machines have the same machine hash, it'd still require a client to insert with the same counter value in the exact same second and the same process id, but yes, these values could collide.
Let's look at the spec for "ObjectId" from the documentation:
Overview
ObjectId is a 12-byte BSON type, constructed using:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
So let us consider this in the context of a "mobile client".
Note: The context here does not mean using a "direct" connection of the "mobile client" to the database. That should not be done. But the "_id" generation can be done quite simply.
So the points:
Value for the "seconds since epoch". That is going to be fairly random per request. So minimal collision impact just on that component. Albeit in "seconds".
The "machine identifier". So this is a different client generating the _id value. This is removing possibility of further "collision".
The "process id". So where that is accessible to seed ( and it should be ) then the generated _id has more chance of avoiding collision.
The "random value". So another "client" somehow managed to generate all of the same values as above and still managed to generate the same random value.
Bottom line is, if that is not a convincing enough argument to digest, then simply provide your own "uuid" entries as the "primary key" values.
But IMHO, that should be a fair convincing argument to consider that the collision aspects here are very broad. To say the least.
The full topic is probably just a little "too-broad". But I hope this moves consideration a bit more away from "Quite unlikely" and on to something a little more concrete.

is possible to bruteforce my Sha512 Authenication algorithm?

I have an authentication application and don't know how secure it is.
here is the algorithm.
1) A clientToken is generated by using SHA512 hash a new guid. I have about 1000 ClientsToken generated and store in the database.
every time the caller calling my web service it need to provide the clientToken, if the clienttoken does not exists in the database, then it is not valid client.
The problem is how long does it take to brute force to get the existing ClientToken?
A GUID is a 128 bit value, with 6 bits held constant, so a total of 122 bits available. Since this is your input to the hash, you're not going to have 2^512 unique hashes in your application. This is roughly 5.3*10^36 values to check.
Say your attacker is able to calculate 1,000,000 (10^6) hashes per second (I'm not sure how reasonable that is for SHA-512, but at this size, a few orders of magnitude won't influence things that much). This works out to about 5.3*10^30 seconds to check the space (For reference, this will be far beyond the time all stars have gone dark). Also, unless you have several billion clients, a birthday attack probably will not remove too many orders of magnitude from this.
But, just for fun, let's say the attacker has some trick that lets him reduce the number of hashes to check by half (or some combination of reduced space to check and increased speed), either by you having that many users, or some flaw in your GUID generator, or what have you. We're still looking at well over 100 million years to find a collision.
I think you're beyond safe and into somewhat overkill territory. Also note that hashing the GUID in effect does nothing, and that GUIDs probably are not generated via a secure random number generator. You'd actually be a bit better off just generating a 128 bits (16 bytes) of randomness via whatever secure random number generator your platform uses, and using that as the shared secret.

Concept of "block size" in a cache

I am just beginning to learn the concept of Direct mapped and Set Associative Caches.
I have some very elementary doubts . Here goes.
Supposing addresses are 32 bits long, and i have a 32KB cache with 64Byte block size and 512 frames, how much data is actually stored inside the "block"? If i have an instruction which loads from a value from a memory location and if that value is a 16-bit integer, is it that one of the 64Byte blocks now stores only a 16 bit(2Bytes) integer value. What of the other 62 bytes within the block? If i now have another load instruction which also loads a 16bit integer value, this value now goes into another block of another frame depending on the load address(If the address maps to the same frame of the previous instruction, then the previous value is evicted and the block again stores only 2bytes in 64 bytes). Correct?
Please forgive me if this seems like a very stupid doubt, its just that i want to get my concepts correctly.
I typed up this email for someone to explain caches, but I think you might find it useful as well.
You have 32-bit addresses that can refer to bytes in RAM.
You want to be able to cache the data that you access, to use them later.
Let's say you want a 1-MiB (220 bytes) cache.
What do you do?
You have 2 restrictions you need to meet:
Caching should be as uniform as possible across all addresses. i.e. you don't want to bias toward any particular kind of address.
How do you do this? Use remainder! With mod, you can evenly distribute any integer over whatever range you want.
You want to help minimize bookkeeping costs. That means e.g. if you're caching in blocks of 1 byte, you don't want to store 4 bytes of data just to keep track of where 1 byte belongs to.
How do you do that? You store blocks that are bigger than just 1 byte.
Let's say you choose 16-byte (24-byte) blocks. That means you can cache 220 / 24 = 216 = 65,536 blocks of data.
You now have a few options:
You can design the cache so that data from any memory block could be stored in any of the cache blocks. This would be called a fully-associative cache.
The benefit is that it's the "fairest" kind of cache: all blocks are treated completely equally.
The tradeoff is speed: To find where to put the memory block, you have to search every cache block for a free space. This is really slow.
You can design the cache so that data from any memory block could only be stored in a single cache block. This would be called a direct-mapped cache.
The benefit is that it's the fastest kind of cache: you do only 1 check to see if the item is in the cache or not.
The tradeoff is that, now, if you happen to have a bad memory access pattern, you can have 2 blocks kicking each other out successively, with unused blocks still remaining in the cache.
You can do a mixture of both: map a single memory block into multiple blocks. This is what real processors do -- they have N-way set associative caches.
Direct-mapped cache:
Now you have 65,536 blocks of data, each block being of 16 bytes.
You store it as 65,536 "rows" inside your cache, with each "row" consisting of the data itself, along with the metadata (regarding where the block belongs, whether it's valid, whether it's been written to, etc.).
Question:
How does each block in memory get mapped to each block in the cache?
Answer:
Well, you're using a direct-mapped cache, using mod. That means addresses 0 to 15 will be mapped to block 0 in the cache; 16-31 get mapped to block 2, etc... and it wraps around as you reach the 1-MiB mark.
So, given memory address M, how do you find the row number N? Easy: N = M % 220 / 24.
But that only tells you where to store the data, not how to retrieve it. Once you've stored it, and try to access it again, you have to know which 1-MB portion of memory was stored here, right?
So that's one piece of metadata: the tag bits. If it's in row N, all you need to know is what the quotient was, during the mod operation. Which, for a 32-bit address, is 12 bits big (since the remainder is 20 bits).
So your tag becomes 12 bits long -- specifically, the topmost 12 bits of any memory address.
And you already knew that the lowermost 4 bits are used for the offset within a block (since memory is byte-addressed, and a block is 16 bytes).
That leaves 16 bits for the "index" bits of a memory address, which can be used to find which row the address belongs to. (It's just a division + remainder operation, but in binary.)
You also need other bits: e.g. you need to know whether a block is in fact valid or not, because when the CPU is turned on, it contains invalid data. So you add 1 bit of metadata: the Valid bit.
There's other bits you'll learn about, used for optimization, synchronization, etc... but these are the basic ones. :)
I'm assuming you know the basics of tag, index, and offset but here's a short explanation as I have learned in my computer architecture class. Blocks are replaced in 64 byte blocks, so every time a new block is put into cache it replaces all 64 bytes regardless if you only need one byte. That's why when addressing the cache there is an offset that specifies the byte you want to get from the block. Take your example, if only 16 bit integer is being loaded, the cache will search for the block by the index, check the tag to make sure its the right data and then get the byte according to the offset. Now if you load another 16 bit value, lets say with the same index but different tag, it will replace the 64 byte block with the new block and get the info from the specified offset. (assuming direct mapped)
I hope this helps! If you need more info or this is still fuzzy let me know, I know a couple of good sites that do a good job of teaching this.

hash fragments and collisions cont

For this application I've mine I feel like I can get away with a 40 bit hash key, which seems awfully low, but see if you can confirm my reasoning (I want a small key because I want a small filename and the key will be converted to a filename):
(Note: only accidental collisions a concern - no security issues.)
A key point here is that the population in question is divided into groups, and a collision is only relevant if it occurs within the same group. A "group" is a directory on a user's system (the contents of files are hashed and a collision is only relevant if it occurs for files within the same directory). So with speculating roughly 100,000 potential users, say 2^17, that corresponds to 2^18 "groups" assuming 2 directories per user on average. So with a 40 bit key I can expect 2^(20+9) files created (among all users) before a collision occurs for some user somewhere. (Or IOW 2^((40+18)/2), due to the "birthday effect".) That's an average 4096 unique files created per user, for 2^17 users, before a single collision occurs for some user somewhere. And then that long again before another collision occurs somewhere (right?)
Your math looks reasonable, but I'm left wondering why you'd bother with this at all. If you want to create unique file names, why not just assign a number to each user, and keep a serial number for that user. When you need a file name, basically just concatentate the user number with the serial number (both padded to the correct number of digits). If you feel that you need to obfuscate those numbers, run that result though a 40-bit encryption (which will guarantee that a unique input produces a unique output).
If, for example, you assign 20 bits to each, you can have 220 users create 220 documents apiece before there's any chance of a collision at all.
If you don't mind serialized access to it, you could just use a single 40-bit counter instead. The advantage of this is that a single user wouldn't immediately use up 220 serial numbers, even though the average user is unlikely to ever create nearly that many documents.
Again, if you think you need to obfuscate this number for some reason, you can use a 40-bit encryption algorithm in counter mode (i.e. use a serial number, but encrypt it) which (again) guarantees that each input maps to a unique output. This guarantees no collision until/unless your users create 240 documents (i.e., the maximum possible with only 40 bits). Alternatively, you could create a 40-bit full-range linear feedback shift register to create your pseudo-random 40-bit numbers. This might be marginally less secure, but has the advantage of being faster and simpler to implement.