Facebook user_id as MongoDB BSON ObjectId? - facebook

I'm rebuilding Lovers on Facebook with Sinatra & Redis. I like Redis because it doesn't have the long (12-byte) BSON ObjectIds and I am storing sets of Facebook user_ids for each user. The sets are requests_sent, requests_received, & relationships, and they all contain Facebook user ids.
I'm thinking of switching to MongoDB because I want to use it's geospatial indexing. If I do, I'd want to use the FB user ids as the _id field because I want the sets to be small and I want the JSON responses to be small. But, is the BSON ObjectId better (more efficient for MongoDB) to use than just an integer (fb user_id)?

There are no major efficiency differences as far as I know except in certain cases like ordering by date (since the ObjectId's have the datetime in them, etc.)
For example you'd lose the ability to simply order by the _id you'd also lose the benefits for sharding and distribution. Aside from that, while I'd still personally use the ObjectId's anyhow ... as long as the int is unquie (of course) ... you should be just fine.
Since the _id always "comes back" in a query I suppose you'd save a little time and data transfer (a bitty bit.)
You can even make your _id an array if you wanted, and it'll all index nicely see this answer (not that I'd necessarily recommend that most of the time.)
Also see: Optimizing Object IDs

Related

Are there any reasons why I should/shouldn't use ObjectId's in my RESTful url's

I'm using mongoDB for the first time in a RESTful service. Previously the id column in my SQL databases was an incrementing integer so my RESTful endpoints would look something like /rest/objectType/1. Is there any reason why I shouldn't just use mongoDB's ObjectId's in the same role, or is it wiser to maintain a separate incrementing integer id column and use this for urls?
Having used ObjectIds in RESTful APIs several times, the biggest downside is really that they are very noisy in terms of having a clean URL. You'll either leave it as a HEX number, or convert it to a very large integer number, both making for a somewhat unfriendly URL:
/rest/resource/52435dbecb970072ec3a780f
/rest/resource/25459211534898951476729247759
I've added a "title" to the URL (like StackOverflow does) to make them slightly more friendly:
/rest/resource/52435dbecb970072ec3a780f/FriendlyResourceName
Of course, the "title" is ignored in software, but the user sees it and can mentally ignore the crazy ID segment.
There's very little useful that could be learned from the infrastructure by exposing them:
Timestamp
Machine ID
Process ID
Random incrementing value
Other than potentially gathering Machine IDs (which generally would indicate the number of clients creating ObjectIds), there's not much there.
ObjectIds aren't random, so you couldn't use them for security. You'll always need to secure the data. While they may not increment in an obvious way, it would be easy to find other resources through brute force. However, if you were using auto-incrementing IDs before, this isn't a new problem for you.
If you know you aren't creating many new documents at any given time, it might be worth using one of the patterns here to create a simpler ID. In one app I wrote, I used an auto-inc technique for some of the document IDs that were shown in URLs, and for those that were Ajax-only, I used ObjectIds. I really wanted some URLs to be easily "typed". No form of an ObjectId is easily typed by an end user. That's one of the strengths of MongoDB -- that you can use any _id format you want. :)
It's wiser to use the ObjectIds, because keeping an incrementing counter can be a bottleneck. Also, since ObjectId contains a timestamp and is monotonic, they can be helpful in optimizing queries.
The ObjectIds can be guessed, but since that is definitely true for incrementing IDs, I suspect you didn't rely on security through obscurity before, so that's no trouble for you.
A downside, albeit a small one, is that the creation time on your server leaks to the user, i.e. if the user is able to identify this as an ObjectId, she can reverse-engineer the creation time of the object. That's the only potential issue I see.

Looking up submissions by ID with MongoDB/Mongoose

So I'm used to looking up submissions by an auto increment primary ID with MySQL, but after using the MongoDB ORM wrapper, Mongoose, I'm finding that since Mongo stores data within collections differently, there is not really any concept of a traditional auto-increment ID.
I'm stuck trying to figure out how to grab a submission now because normally I'd structure my URL like so:
submission/34/category/slug-goes-here.
Since the 34 now becomes an ugly string based UUID with Mongo, I don't necessary want to display that in my URLs, but I want a unique URL in order to look up my submissions.
I'm thinking of maybe having a set method that when I insert the submission into my database, it generates some kind of 6 character hash e.g. zhXk40 and looks it up like that.
I'm wondering if I do it like this what the performance trade-offs would be. If I made constraints on the slug and then looked them up with the slug, and verified that the category matched, would that be more efficient? Either way I'm going to have to check if the category and slug match, but I'm not sure if an ID is even really necessary in this case.
What's the best practice for creating a route + looking up some piece of data from the db based on that route?
The first thing you should know is:
The _id property doesn't necessarily is that "ugly" ObjectId string.
Actually, the _id just need to be unique within its collection, so if you want to use auto incrementing IDs, there are no problems, however...
If you plan to use sharding within your database, then using auto incrementing field as the _id is overkill.
Why? Read the accepted answer here: Should I implement auto-incrementing in MongoDB?
In my application, as we're not going to shard it, then we use a indexed, numeric ID just for easier usability to the final user, and internally all references are ObjectIds.
Also, here's an good tuto for creating a auto incrementing field in MongoDB: http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/
Sheesh! If we were able to truly know best practice here we'd all be better off. The talking heads are still talking and will be for some time.
How I would approach this is to go as pretty as I could. If I had a usable text string to make it semantic that'd be best case.
If I couldn't do that I'd go with the hash thing you suggested.
With both solutions the challenge will be to ensure that it remains unique. That means lookups before you save.
On performance, it's the same as SQL. Index what you use to look things up with. Mongo does good with composite indexes so category name and hash will look up pretty quick.

Working with ugly mongodb _ids on front end

A kinda subjective question but I have a few concerns about working with mongodb _ids on client side. I would better use something like s52ruf6wst or xR2ru286zjI for RESTful resources and working with small collections of items.
1) I'm starting to be dependent on proprietary implementation of backend database (_id field name and implementation). If I stick with this _ids it is harder to replace back-end db later.
2) I've got huge ugly URLs containing mongo _ids (even for REST endpoints - I don't like it)
3) For hackers and "curious users" it is become obvious which back-end db is used.
As I see most of web applications use their own conventions on how ids, uid, uuids should look like, and I would say to me it looks more professional (than using staightforward ugly implementation by db vendor).
So the question is when it is good to use standard mongo _ids and use them across back and front ends? And what can be done to improve the situation?
when it is good to use standard mongo _ids
Always. Except when you simply don't like it. But your personal preferences have nothing to do with security. Mongo's object ids are not inherently less safe than any other identifier type (integer, UUID, etc.)
ObjectID is designed to be unique across your cluster and this is very important, because mongodb is a distributed DB. It also has a nice property: values are monotonically increasing with time. This property may or may not be useful to you.
1) I'm starting to be dependent on proprietary implementation of backend database (_id field name and implementation). If I stick with this _ids it is harder to replace back-end db later.
This is where abstraction layers, frameworks and ODM (ORM)s come in. They provide a standardised layer (i.e. Doctrine 2) to query mutliple different types of database. As an exampe id translates in many ORMs as _id and ID and id depending on which database you are using.
As said before, the ObjectId has no inherent securiy flaws, it isn't even that useful in general to other users since even though the ObjectId has a time part this time part cannot be easily used to decide what the next object is (unlike an auto incrementing ID). The only way to do this reliably would be to test all times up until now and all pid numbers to detect if a hidden object exists. So it is not very easy at all to crawl ObjectId URLs and in fact are not very SEO friendly for that exact reason. But yes they could know what database you are using.
That being said, yes they are ugly but they are that long and ugly to be, as #Sergio says, unique. Making your own will be just as bad. I suppose you could shrink it a little by base64 encoding the hexadecimal representation of the ObjectId.
However I am unsure if you really need that.

strategy for creating MongoDB short ids that scale

I want to have a friendlier facing ids (ie Youtube style: /posts/cxB6Ey6) than MongoDB's ObjectID.
I read that for scalability its best to leave _id as an ObjectID so I thought about two solutions:
1) add an indexed postid field to each document
2) create a mapping collection between _id and the postid
in both cases use something like https://github.com/dylang/shortid to generate the short id, and while generating make sure that the id is unique by querying the database.
(can this query-generate-insert be an atomic operation?)
will those solutions have a noticeable impact on performance ?
what's the best strategy for doing this ?
The normal method of doing this is to base64 encode a unique id but:
add an indexed postid field to each document
You definitely want to go for this method. Out of the two I would say this method is easily the most scalable and performant, for one it would only need one round trip to get a short URLs details where as the second option would take 2. Another consideration is the shortage of index overhead of maintaining an extra collection, this is a bit of a no-brainer.
I would not replace the _id field within the document either since the default ObjectId could still be useful in the foreseeable future.
So this limits it down to a separate field and index (unique key) for the short code of a URL.
The next thing is that you don't want an ID which forces you to query the database for uniqueness prior to every insert. This is where the ObjectId shines. The ObjectId is good at being made within the client application while being unique in the database without having to specifically query those assumptions.
Unique ids that do not require querying the database first are normally time based. In PHP ( http://php.net/manual/en/function.uniqid.php ) and in the MongoDB Drivers ( http://docs.mongodb.org/manual/core/object-id/ ) and even the plug-in you linked on github ( https://github.com/dylang/shortid/blob/master/lib/shortid.js#L50 ) they all use time as a basis for being unique.
Considering the plug-in you linked does not query the database to check its own IDs uniqueness I would say that this plug-in probably is quite performant and if you use it with the first solution you stated you should get a good benchmark out of it.
If you want to replace build-in ObjectID with custom user-friendly short id's then do it. You can either use build-in _id field or add a new unique-indexed field id for your custom ids. The benefit of using build-in ObjectID's is than they won't duplicate even if your database is extremely large. So, by replacing them with short id's you take the risk of id duplication.
Now about the performance. I think that the best solution is not to query DB for id's, because with properly adjusted ids length the probability of duplication is extremely small. So, the best way to handle ids duplication in this model is to check Mongo responses. If it responded with "duplicate key error" then you shall generate a new one.
And now about scaling. To scale your custom ids you can just add a few more symbols to it. "Duplicate key error" shall be a trigger for making that change. Normally there shall be no such errors. So, if they started to appear then its time to scale.
I don't think that generating ObjectId for _id field affect directly scalability or performance. Whereby this can be happen?
Main difference is that ObjectIds are created by MongoDB and you don't burden yourself with responsibility for this. Otherwise you must by yourself to determine optimal size of id and to ensure unique value for each _id field of documents stored in collection. It's required because _id used as primary key. This can be justified if you have not very big collection and custom value of identifier is need for you.
But you have such additional benefits with _id field that stores ObjectId values as opportunity to create object id's from time and use this fact to your advantage in queries. Also you can get timestamp of ObjectId’s creation with getTimestamp() method. And sorting on _id in this case is equivalent to sorting by creation time.
But if you're going to use ObjectId in URLs or HTML then for security concerns you can encrypt it. To prevent leakage of information and access to object's creation time. It may be security risk.
About your solutions:
1) I suppose this's very convenient and flexible solution. In this case you can specify any value in postId which doesn't depend directly on _id.
But little disadvantage of this solution is that you have to have extra field and to create extra index. While _id is automatically indexed.
2) I don't think this's good solution from the point of view of performance and philosophy of noSQL approach.

Mongodb: object id as short primary key within a collection

How to make better use of objectId generate by MongoDB. I am not an expert user, but so far i ended up creating seperate id for my object (userid, postid) etc because the object id is too long and makes the url ugly if use as the main ID. I keep the _id intact as it help indexing etc. I was wondering about any better strategy so that one can use mongo objectId as more url friendly and easy to remember key. I read the key was a combination of date etc, so any of the part can be used unique within a collection for this purpose.
thanks,
bsr/
If you have an existing ID (say from an existing data set), then it's perfectly OK to override _id with the one you have.
...keeo the _id intact as it help indexing etc
MongoDB indexes the _id field by default. If you start putting integers in the _id field, they will be indexed like everything else.
So most RDBMs provide an "auto-increment" ID. This is nice for small datasets, but really poor in terms of scalability. If you're trying to insert data to 20 servers at once, how do you keep the "auto-increment" intact?
The normal answer is that you don't. Instead, you end up using things like GUIDs for those IDs. In the case of MongoDB, the ObjectId is already provided.
I was wondering about any better strategy so that one can use mongo objectId as more url friendly and easy to remember key
So the problem here is that "easy to remember" ID doesn't really mesh with "highly scalable database". When you have a billion documents, the IDs are not really "easy to remember".
So you have to make the trade-off here. If you have a table that can get really big, I suggest using the ObjectId. If you have a table that's relatively small and doesn't get updated often, (like a "lookup" table) then you can build your own auto-increment.
The choice is really up to you.
You can overwrite the _id yourself. There is no obligation for using the auto-generated object id. What is the problem with overriding _id inside your app according to your own needs?