Using the HiLoIdGenerator in NoRM for MongoDB to create a unique identifier - mongodb

I have been struggling a little with the HiLoIdGenerator that comes with NoRM (http://normproject.org/); I want to use it to generate a unique identifier that I can use as a SLUG for my blog posts. At present I use the ObjectId to uniquely identify a document within MongoDB, but as this is GUID-like and it doesn't look very good in a URL, I would prefer to have something like www.myblog.com/posts/1243 and so this is why I have decided to use the HiLoIdGenerator.
I would like to generate my HiLo id's on the client-side and I read on stuart harris' blog http://red-badger.com/Blog/post/A-simple-IRepository3cT3e-implementation-for-MongoDB-and-NoRM.aspx that NoRM's new HiLo Id generator also allows this by allocating a range of integers to the client session that can be used with impunity (other clients will be using a different range) but when i opened the HiLoIdGenerator it said that the HiLoIdGenerator Class that generates a new identity value using the HILO algorithm. Only one instance of this class should be used in your project.
I really have three questions:
1) if I had multiple instances of the HiLoIdGenerator in my application (say I had an instance in my service class that called GenerateId for every new document) could I actually guarantee that all of my id's would be unique, given that the code for the HiLoIdGenerator class says that there should only be a single instance of this class in an application?
2) the HiLoIdGenerator constructor takes a capacity argument, and I would like to know what it does, I passed 0 and all of the generated Id's were the same, I then passed in 1 new HiLoIdGenerator(1) the Id's began at 1 and were incremented by 1; I don't really understand what it does but I am presuming that it has something to do with a range of potential values that the generator can generate, but I am not sure, and I would like to be. Could someone please explain this argument?
3) I think I understand the aim of the HiLo algorithm as explained here What's the Hi/Lo algorithm? but what I don't understand is whether I can have two instances of MongoDB with two different applications each looking at a different instance of a MongoDB but both containing the same collection types, whether generated id's are globally unique, i.e., could I use them the way I would a GUID, or are they simply unique within a given instance of MongoDB, therefore precluding a merge of both collections into a single instance of MongoDB at a later date?
thanks

See here for how to produce monotonically increasing ids:
http://www.mongodb.org/display/DOCS/Atomic+Operations#AtomicOperations-%22InsertifNotPresent%22

Yes they would be unique, each client (HiLoGenerator) would request a range of lo's that could be allocated but they would only be unique if they both used the same capacity
Capacity is the number of Id's that the client can assign with impunity, again if you have a different capacity amongst clients you have the potential to create non-unique values, if you are using monotonically increasing Id's you are only ever assigning a single sequential value, you do not need the HiLo algorithm, you just need a single place that contains a value that you can increment and assign to a new entity, see dm's answer for an implementation of this
Yes as long as both clients are both using the same collection that holds the Hi value, and as long as both clients use the same capcity for generating the lo's

Related

How do I model a queue on top of a key-value store efficiently?

Supposed I have a key-value database, and I need to build a queue on top of it. How could I achieve this without getting a bad performance?
One idea might be to store the queue inside an array, and simply store the array using a fixed key. This is a quite simple implementation, but is very slow, as for every read or write access the complete array must be loaded / saved.
I could also implement a linked list, with random keys, and there is one fixed key which acts as starting point to element 1. Depending on if I prefer a fast read or a fast write access, I could let point the fixed element to the first or the last entry in the queue (so I have to travel it forward / backward).
Or, to proceed with that - I could also have two fixed pointers: One for the first, on for the last item.
Any other suggestions on how to do this effectively?
Initially, key-value structure is extremely similar to the original memory storage where the physical address in computer memory plays as the key. So any type of data structure could be modeled upon key-value storage surely, including linked list.
Originally, a linked list is a list of nodes including the index information of previous node or following node. Then the node it self should also be viewed as a sub key-value structure. With additional prefix to the key, the information in the node could be separately stored in a flat table of key-value pairs.
To proceed with that, special suffix to the key could also make it possible to get rid of redundant pointer information. This pretend list might look something like this:
pilot-last-index: 5
pilot-0: Rei Ayanami
pilot-1: Shinji Ikari
pilot-2: Soryu Asuka Langley
pilot-3: Touji Suzuhara
pilot-5: Makinami Mari
The corresponding algrithm is also imaginable, I think. If you could have a daemon thread for manipulation these keys, pilot-5 could be renamed as pilot-4 in the above example. Even though, it is not allowed to have additional thread in some special situation, the result of the queue it self is not affected. Just some overhead would exist for the break point in sequence.
However which of the two above should be applied is the problem of balance between the cost of storage space or the overhead of CPU time.
The thread safe is exactly a problem however an ancient problem. Just like the class implementing the interface of ConcurrentMap in JDK, Atomic operation on key-value data is also provided perfectly. There are similar methods featured in some key-value middleware, like memcached, as well, which could make you update key or value separately and thread safely. However these implementation is the algrithm problem rather than the key-value structure it self.
I think it depends on the kind of queue you want to implement, and no solution will be perfect because a key-value store is not the right data structure for this kind of task. There will be always some kind of hack involved.
For a simple first in first out queue you could use a few kev-value stores like the folliwing:
{
oldestIndex:5,
newestIndex:10
}
In this example there would be 6 items in the Queue (5,6,7,8,9,10). Item 0 to 4 are already done whereas there is no Item 11 or so for now. The producer worker would increment newestIndex and save his item under the key 11. The consumer takes the item under the key 5 and increments oldestIndex.
Note that this approach can lead to problems if you have multiple consumer/producers and if the queue is never empty so you cant reset the index.
But the multithreading problem is also true for linked lists etc.

How do I generate a unique id from an auto incremented integer?

I have an auto incremented id (an int) that I want to convert in to something less "mine-able". Basically I don't want people to be able to access data/0, data/1, data/2, etc. and rip through our entire database. I was thinking of just hashing the ID but I wasn't sure if I could guarantee uniqueness.
Let's say the value range is from 1 to a couple hundred million. It may be that one of the hash algorithms can guarantee uniqueness within those parameters.
If not, what would be a good approach to take?
I did consider hashing and then appending the ID.
I'm trying to avoid using a GUID because it would require a lot of changes to existing code so I'd prefer to transform the data I have.
EDIT:
To further explain the situation - these are static resources that are being hit. I don't have to go to a database and reverse it or look it up against something else. Imagine a listing of products - a user might have a link to a specific page but I don't want them to be able to programatically go through every page so I need an non incrementing ID.
As far as I know hashing is intended to create unique ID based on some concrete data (e.g. name, surname etc.). Hashing auto incremented ID wont help you much. If someone searches through your database by entering an auto incremented ID, that ID will be passed to hash function as parameter and he will still get the data he wants. So I think that better solution would be to hash some other data in order to get a unique ID. If you do so then a person who searches through you database would have to know exact data that is stored in there (e.g. He would have to know exact name of you employee, or his SSN).
Hope that helps!
Use something pseudo-random to salt the value before hashing if there is no need for reverse lookup.

Algorithm to generate a user unique, 6-character confirmation code?

I'd like to be able to create an algorithm that generates a 6 character confirmation code (e.g. A1JU2Z) that will be unique for a given (user, code) pair. The reason is, I'd like to keep the code at 6 characters, but using a trimmed set of alphanumerics (to avoid confusion with 1 and I, etc) only allows for ~300 million codes before collisions occur. Sure I may never need 300 million codes, but if I do, it will be a huge pain to go back and fix this.
So is there a way to utilize the user ... say their username, to generic unique codes such that if the same user wants to generate another code, its guaranteed that it is unique for them? (This is of course assuming a single user doesn't generate over 300 mill codes)
Thanks!
If the ID is unique only to the current user, you can just generate each character of the ID randomly. As long as the user is not expected to generate a large number of such IDs, you will have reasonable chance of not generating the same ID more than once (you need to do some math to get exact numbers for the expected chance of collision as the number of generated ID grow).
If you must not have collision at all cost, you need to either keep all previously generated IDs and do a comparison for the new one, or keep the count of the generated IDs (this requires a scheme where the ID generation is deterministic based on the count, but also unique -- a very simple case would be {ID=count; ++count;})
I think you can use a simple password generator like this : http://www.webtoolkit.info/php-random-password-generator.html
in combination with a check algorithm to be sure it is not already used.
$pass=generate_password();
$found=find_password($pass);
while($found){
$pass=generate_password();
$found=find_password($pass);
}
save_password($user,$code,$pass);
generate_password() is the function refered in the link.
find_password() is a function you have to write to check already generated codes in a database.
save_password() is a function you have to write to store the generated code in a database.
The code is in PHP, but the logic is here.
The password generator in the link is easy to understand, you can get 6 chars long, with the character rules you want.

Are NSManagedObject objectIDs unique across space and time like a CFUUID?

The NSManagedObjectID documentation states:
An NSManagedObjectID object is a compact, universal, identifier for a managed object. This forms the basis for uniquing in the Core Data Framework. A managed object ID uniquely identifies the same managed object both between managed object contexts in a single application, and in multiple applications (as in distributed systems).
Translation in my head: "There is probably no way that any two NSManagedObjectIDs are ever the same across the set of all instances of my application."
The CFUUID documentation states:
UUIDs ... are 128-bit values
guaranteed to be unique. A UUID is
made unique over both space and time
by combining a value unique to the
computer on which it was
generated—usually the Ethernet
hardware address—and a value
representing the number of
100-nanosecond intervals since October
15, 1582 at 00:00:00.
Translation in my head: "There is definitely no way that any two CFUUIDs are ever the same across the set of all instances of my application."
The fact that NSManagedObjectIDs are described as a "universal identifier" makes me almost certain that they offer the same uniqueness as a CFUUID, whereas "unique across space and time" leaves absolutely no room for doubt. Can anybody with more Core Data experience than me confirm or deny my thoughts?
Beyond uniqueness, there is one case where the object ID will change, and that's if you query it before persisting the object to disk. After saving, it will have a different ID. Beyond that point, the ID will not change. I just wanted to point this out because it caused me a bit of confusion until I figured out what was happening.
I can't comment on the hashing used to generate the NSManagedObjectID, but it does seem like the odds of it matching another NSManagedObject are vanishingly small, based on looking at the IDs generated.

How can I generate a unique, small, random, and user-friendly key?

A few months back I was tasked with implementing a unique and random code for our web application. The code would have to be user friendly and as small as possible, but still be essentially random (so users couldn't easily predict the next code in the sequence).
It ended up generating values that looked something like this:
Af3nT5Xf2
Unfortunately, I was never satisfied with the implementation. Guid's were out of the question, they were simply too big and difficult for users to type in. I was hoping for something more along the lines of 4 or 5 characters/digits, but our particular implementation would generate noticeably patterned sequences if we encoded to less than 9 characters.
Here's what we ended up doing:
We pulled a unique sequential 32bit id from the database. We then inserted it into the center bits of a 64bit RANDOM integer. We created a lookup table of easily typed and recognized characters (A-Z, a-z, 2-9 skipping easily confused characters such as L,l,1,O,0, etc.). Finally, we used that lookup table to base-54 encode the 64-bit integer. The high bits were random, the low bits were random, but the center bits were sequential.
The final result was a code that was much smaller than a guid and looked random, even though it absolutely wasn't.
I was never satisfied with this particular implementation. What would you guys have done?
Here's how I would do it.
I'd obtain a list of common English words with usage frequency and some grammatical information (like is it a noun or a verb?). I think you can look around the intertubes for some copy. Firefox is open-source and it has a spellchecker... so it must be obtainable somehow.
Then I'd run a filter on it so obscure words are removed and that words which are too long are excluded.
Then my generation algorithm would pick 2 words from the list and concatenate them and add a random 3 digits number.
I can also randomize word selection pattern between verb/nouns like
eatCake778
pickBasket524
rideFlyer113
etc..
the case needn't be camel casing, you can randomize that as well. You can also randomize the placement of the number and the verb/noun.
And since that's a lot of randomizing, Jeff's The Danger of Naïveté is a must-read. Also make sure to study dictionary attacks well in advance.
And after I'd implemented it, I'd run a test to make sure that my algorithms should never collide. If the collision rate was high, then I'd play with the parameters (amount of nouns used, amount of verbs used, length of random number, total number of words, different kinds of casings etc.)
In .NET you can use the RNGCryptoServiceProvider method GetBytes() which will "fill an array of bytes with a cryptographically strong sequence of random values" (from ms documentation).
byte[] randomBytes = new byte[4];
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
rng.GetBytes(randomBytes);
You can increase the lengh of the byte array and pluck out the character values you want to allow.
In C#, I have used the 'System.IO.Path.GetRandomFileName() : String' method... but I was generating salt for debug file names. This method returns stuff that looks like your first example, except with a random '.xyz' file extension too.
If you're in .NET and just want a simpler (but not 'nicer' looking) solution, I would say this is it... you could remove the random file extension if you like.
At the time of this writing, this question's title is:
How can I generate a unique, small, random, and user-friendly key?
To that, I should note that it's not possible in general to create a random value that's also unique, at least if each random value is generated independently of any other. In addition, there are many things you should ask yourself if you want to generate unique identifiers (which come from my section on unique random identifiers):
Can the application easily check identifiers for uniqueness within the desired scope and range (e.g., check whether a file or database record with that identifier already exists)?
Can the application tolerate the risk of generating the same identifier for different resources?
Do identifiers have to be hard to guess, be simply "random-looking", or be neither?
Do identifiers have to be typed in or otherwise relayed by end users?
Is the resource an identifier identifies available to anyone who knows that identifier (even without being logged in or authorized in some way)?
Do identifiers have to be memorable?
In your case, you have several conflicting goals: You want identifiers that are—
unique,
easy to type by end users (including small), and
hard to guess (including random).
Important points you don't mention in the question include:
How will the key be used?
Are other users allowed to access the resource identified by the key, whenever they know the key? If not, then additional access control or a longer key length will be necessary.
Can your application tolerate the risk of duplicate keys? If so, then the keys can be completely randomly generated (such as by a cryptographic RNG). If not, then your goal will be harder to achieve, especially for keys intended for security purposes.
Note that I don't go into the issue of formatting a unique value into a "user-friendly key". There are many ways to do so, and they all come down to mapping unique values one-to-one with "user-friendly keys" — if the input value was unique, the "user-friendly key" will likewise be unique.
If by user friendly, you mean that a user could type the answer in then I think you would want to look in a different direction. I've seen and done implementations for initial random passwords that pick random words and numbers as an easier and less error prone string.
If though you're looking for a way to encode a random code in the URL string which is an issue I've dealt with for awhile then I what I have done is use 64-bit encoded GUIDs.
You could load your list of words as chakrit suggested into a data table or xml file with a unique sequential key. When getting your random word, use a random number generator to determine what words to fetch by their key. If you concatenate 2 of them, I don't think you need to include the numbers in the string unless "true randomness" is part of the goal.