Working with ugly mongodb _ids on front end - mongodb

A kinda subjective question but I have a few concerns about working with mongodb _ids on client side. I would better use something like s52ruf6wst or xR2ru286zjI for RESTful resources and working with small collections of items.
1) I'm starting to be dependent on proprietary implementation of backend database (_id field name and implementation). If I stick with this _ids it is harder to replace back-end db later.
2) I've got huge ugly URLs containing mongo _ids (even for REST endpoints - I don't like it)
3) For hackers and "curious users" it is become obvious which back-end db is used.
As I see most of web applications use their own conventions on how ids, uid, uuids should look like, and I would say to me it looks more professional (than using staightforward ugly implementation by db vendor).
So the question is when it is good to use standard mongo _ids and use them across back and front ends? And what can be done to improve the situation?

when it is good to use standard mongo _ids
Always. Except when you simply don't like it. But your personal preferences have nothing to do with security. Mongo's object ids are not inherently less safe than any other identifier type (integer, UUID, etc.)
ObjectID is designed to be unique across your cluster and this is very important, because mongodb is a distributed DB. It also has a nice property: values are monotonically increasing with time. This property may or may not be useful to you.

1) I'm starting to be dependent on proprietary implementation of backend database (_id field name and implementation). If I stick with this _ids it is harder to replace back-end db later.
This is where abstraction layers, frameworks and ODM (ORM)s come in. They provide a standardised layer (i.e. Doctrine 2) to query mutliple different types of database. As an exampe id translates in many ORMs as _id and ID and id depending on which database you are using.
As said before, the ObjectId has no inherent securiy flaws, it isn't even that useful in general to other users since even though the ObjectId has a time part this time part cannot be easily used to decide what the next object is (unlike an auto incrementing ID). The only way to do this reliably would be to test all times up until now and all pid numbers to detect if a hidden object exists. So it is not very easy at all to crawl ObjectId URLs and in fact are not very SEO friendly for that exact reason. But yes they could know what database you are using.
That being said, yes they are ugly but they are that long and ugly to be, as #Sergio says, unique. Making your own will be just as bad. I suppose you could shrink it a little by base64 encoding the hexadecimal representation of the ObjectId.
However I am unsure if you really need that.

Related

Are there any advantages to using a custom _id for documents in MongoDB?

Let's say I have a collection called Articles. If I were to insert a new document into that collection without providing a value for the _id field, MongoDB will generate one for me that is specific to the machine and the time of the operation (e.g. sdf4sd89fds78hj).
However, I do have the ability to pass a value for MongoDB to use as the value of the _id key (e.g. 1).
My question is, are there any advantages to using my own custom _ids, or is it best to just let Mongo do its thing? In what scenarios would I need to assign a custom _id?
Update
For anyone else that may find this. The general idea (as I understand it) is that there's nothing wrong with assigning your own _ids, but it forces you to maintain unique values within your application layer, which is a PITA, and requires an extra query before every insert to make sure you don't accidentally duplicate a value.
Sammaye provides an excellent answer here:
Is it bad to change _id type in MongoDB to integer?
Advantages with generating your own _ids:
You can make them more human-friendly, by assigning incrementing numbers: 1, 2, 3, ...
Or you can make them more human-friendly, using random strings: t3oSKd9q
(That doesn't take up too much space on screen, could be picked out from a list, and could potentially be copied manually if needed. However you do need to make it long enough to prevent collisions.)
If you use randomly generated strings they will have an approximately even sharding distribution, unlike the standard mongo ObjectIds, which tends to group records created around the same time onto the same shard. (Whether that is helpful or not really depends on your sharding strategy.)
Or you may like to generate your own custom _ids that will group related objects onto one shard, e.g. by owner, or geographical region, or a combination. (Again, whether that is desirable or not depends on how you intend to query the data, and/or how rapidly you are producing and storing it. You can also do this by specifying a shard key, rather than the _id itself. See the discussion below.)
Advantages to using ObjectIds:
ObjectIds are very good at avoiding collisions. If you generate your own _ids randomly or concurrently, then you need to manage the collision risk yourself.
ObjectIds contain their creation time within them. That can be a cheap and easy way to retain the creation date of a document, and to sort documents chronologically. (On the other hand, if you don't want to expose/leak the creation date of a document, then you must not expose its ObjectId!)
The nanoid module can help you to generate short random ids. They also provide a calculator which can help you choose a good id length, depending on how many documents/ids you are generating each hour.
Alternatively, I wrote mongoose-generate-unique-key for generating very short random ids (provided you are using the mongoose library).
Sharding strategies
Note: Sharding is only needed if you have a huge number of documents (or very heavy documents) that cannot be managed by one server. It takes quite a bit of effort to set up, so I would not recommend worrying about it until you are sure you actually need it.
I won't claim to be an expert on how best to shard data, but here are some situations we might consider:
An astronomical observatory or particle accelerator handles gigabytes of data per second. When an interesting event is detected, they may want to store a huge amount of data in only a few seconds. In this case, they probably want an even distribution of documents across the shards, so that each shard will be working equally hard to store the data, and no one shard will be overwhelmed.
You have a huge amount of data and you sometimes need to process all of it at once. In this case (but depending on the algorithm) an even distribution might again be desirable, so that all shards can work equally hard on processing their chunk of the data, before combining the results at the end. (Although in this scenario, we may be able to rely on MongoDB's balancer, rather than our shard key, for the even distribution. The balancer runs in the background after data has been stored. After collecting a lot of data, you may need to leave it to redistribute the chunks overnight.)
You have a social media app with a large amount of data, but this time many different users are making many light queries related mainly to their own data, or their specific friends or topics. In this case, it doesn't make sense to involve every shard whenever a user makes a little query. It might make sense to shard by userId (or by topic or by geographical region) so that all documents belonging to one user will be stored on one shard, and when that user makes a query, only one shard needs to do work. This should leave the other shards free to process queries for other users, so many users can be served at once.
Sharding documents by creation time (which the default ObjectIds will give you) might be desirable if you have lots of light queries looking at data for similar time periods. For example many different users querying different historical charts.
But it might not be so desirable if most of your users are querying only the most recent documents (a common situation on social media platforms) because that would mean one or two shards would be getting most of the work. Distributing by topic or perhaps by region might provide a flatter overall distribution, whilst also allowing related documents to clump together on a single shard.
You may like to read the official docs on this subject:
https://docs.mongodb.com/manual/sharding/#shard-key-strategy
https://docs.mongodb.com/manual/core/sharding-choose-a-shard-key/
I can think of one good reason to generate your own ID up front. That is for idempotency. For example so that it is possible to tell if something worked or not after a crash. This method works well when using re-try logic.
Let me explain. The reason people might consider re-try logic:
Inter-app communication can sometimes fail for different reasons, (especially in a microservice architecture). The app would be more resilient and self-healing by codifying the app to re-try and not give up right away. This rides over odd blips that might occur without the consumer ever being affected.
For example when dealing with mongo, a request is sent to the DB to store some object, the DB saves it, but just as it is trying to respond to the client to say everything worked fine, there is a network blip for whatever reason and the “OK” is never received. The app assumes it didn't work and so the app may end up re-trying the same data and storing it twice, or worse it just blows up.
Creating the ID up front is an easy, low overhead way to help deal with re-try logic. Of course one could think of other schemes too.
Although this sort of resiliency may be overkill in some types of projects, it really just depends.
I have used custom ids a couple of times and it was quite useful.
In particular I had a collection where I would store stats by date, so the _id was actually a date in a specific format. I did that mostly because I would always query by date. Keep in mind that using this approach can simplify your indexes as no extra index is needed, the basic cursor is sufficient.
Sometimes the ID is something more meaningful than a randomly generated one. For example, a user collection may use the email address as the _id instead. In my project I generate IDs that are much shorter than the ones Mongodb uses so that the ID shown in the URL is much shorter.
I'll use an example , i created a property management tool and it had multiple collections. For simplicity some fields would be duplicated for example the payment. And when i needed to update these record it had to happen simultaneously across all collections it appeared in so i would assign them a custom payment id so when the delete/query action is performed it changes all instances of it database wide

Are there any reasons why I should/shouldn't use ObjectId's in my RESTful url's

I'm using mongoDB for the first time in a RESTful service. Previously the id column in my SQL databases was an incrementing integer so my RESTful endpoints would look something like /rest/objectType/1. Is there any reason why I shouldn't just use mongoDB's ObjectId's in the same role, or is it wiser to maintain a separate incrementing integer id column and use this for urls?
Having used ObjectIds in RESTful APIs several times, the biggest downside is really that they are very noisy in terms of having a clean URL. You'll either leave it as a HEX number, or convert it to a very large integer number, both making for a somewhat unfriendly URL:
/rest/resource/52435dbecb970072ec3a780f
/rest/resource/25459211534898951476729247759
I've added a "title" to the URL (like StackOverflow does) to make them slightly more friendly:
/rest/resource/52435dbecb970072ec3a780f/FriendlyResourceName
Of course, the "title" is ignored in software, but the user sees it and can mentally ignore the crazy ID segment.
There's very little useful that could be learned from the infrastructure by exposing them:
Timestamp
Machine ID
Process ID
Random incrementing value
Other than potentially gathering Machine IDs (which generally would indicate the number of clients creating ObjectIds), there's not much there.
ObjectIds aren't random, so you couldn't use them for security. You'll always need to secure the data. While they may not increment in an obvious way, it would be easy to find other resources through brute force. However, if you were using auto-incrementing IDs before, this isn't a new problem for you.
If you know you aren't creating many new documents at any given time, it might be worth using one of the patterns here to create a simpler ID. In one app I wrote, I used an auto-inc technique for some of the document IDs that were shown in URLs, and for those that were Ajax-only, I used ObjectIds. I really wanted some URLs to be easily "typed". No form of an ObjectId is easily typed by an end user. That's one of the strengths of MongoDB -- that you can use any _id format you want. :)
It's wiser to use the ObjectIds, because keeping an incrementing counter can be a bottleneck. Also, since ObjectId contains a timestamp and is monotonic, they can be helpful in optimizing queries.
The ObjectIds can be guessed, but since that is definitely true for incrementing IDs, I suspect you didn't rely on security through obscurity before, so that's no trouble for you.
A downside, albeit a small one, is that the creation time on your server leaks to the user, i.e. if the user is able to identify this as an ObjectId, she can reverse-engineer the creation time of the object. That's the only potential issue I see.

Looking up submissions by ID with MongoDB/Mongoose

So I'm used to looking up submissions by an auto increment primary ID with MySQL, but after using the MongoDB ORM wrapper, Mongoose, I'm finding that since Mongo stores data within collections differently, there is not really any concept of a traditional auto-increment ID.
I'm stuck trying to figure out how to grab a submission now because normally I'd structure my URL like so:
submission/34/category/slug-goes-here.
Since the 34 now becomes an ugly string based UUID with Mongo, I don't necessary want to display that in my URLs, but I want a unique URL in order to look up my submissions.
I'm thinking of maybe having a set method that when I insert the submission into my database, it generates some kind of 6 character hash e.g. zhXk40 and looks it up like that.
I'm wondering if I do it like this what the performance trade-offs would be. If I made constraints on the slug and then looked them up with the slug, and verified that the category matched, would that be more efficient? Either way I'm going to have to check if the category and slug match, but I'm not sure if an ID is even really necessary in this case.
What's the best practice for creating a route + looking up some piece of data from the db based on that route?
The first thing you should know is:
The _id property doesn't necessarily is that "ugly" ObjectId string.
Actually, the _id just need to be unique within its collection, so if you want to use auto incrementing IDs, there are no problems, however...
If you plan to use sharding within your database, then using auto incrementing field as the _id is overkill.
Why? Read the accepted answer here: Should I implement auto-incrementing in MongoDB?
In my application, as we're not going to shard it, then we use a indexed, numeric ID just for easier usability to the final user, and internally all references are ObjectIds.
Also, here's an good tuto for creating a auto incrementing field in MongoDB: http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/
Sheesh! If we were able to truly know best practice here we'd all be better off. The talking heads are still talking and will be for some time.
How I would approach this is to go as pretty as I could. If I had a usable text string to make it semantic that'd be best case.
If I couldn't do that I'd go with the hash thing you suggested.
With both solutions the challenge will be to ensure that it remains unique. That means lookups before you save.
On performance, it's the same as SQL. Index what you use to look things up with. Mongo does good with composite indexes so category name and hash will look up pretty quick.

Pseudo primary keys in MongoDB - bad idea?

http://code.flickr.com/blog/page/4/
This blog post is from the devs at Flickr, and outlines their simplified approach to generating GUIDs for photos in a sharded database environment using mysql.
I am working on an app that uses MongoDB for data store that has a similar requirement for items stored in embedded documents. Basically, a document in the collection represents a list of items, and then individual items inside that document each need to have some kind of identifier as well for lookup purposes. I'd rather not put items in a different collection since the list keys that aren't items are really just metadata and don't need to have their own collection. Ideally it should be one document.
I was thinking the kind of approach detailed in the blog post could be implemented to solve this problem - one endpoint that generates GUIDs for these entries and saves the last used value. The problem is that I am not certain if this approach introduces problems when sharding the data store in mongo. I don't have any experience distributing Mongo over several machines. I assume I could have the application layer check this endpoint when the data is saved and set the _id key appropriate, but I don't know how this would affect queries against the data set.
Would be setting up this kind of GUID system be a flawed idea? I realize it runs counter to some of the principles of NoSQL in general, but since the documents are embedded, what alternative is there?
I think ObjectID is the way to go. They are stored much more compactly than GUID/UUID and maintain a roughly increasing order which has benefits for indexing. It is also designed to be generated client-side without the need for a ticket server as described in the article. The only real downside vs their solution is that they use 12 bytes while an int64 uses 8 (GUIDs/UUIDs use 16 in binary or 32 in hex plus a few bytes of overhead). One other potential downside (which is more likily to be a benefit in most cases) is that because the creation time is encoded in the ObjectId if they are used for publicly visible identifiers it can leak possibly unwanted information to users such as when another user signed up for your service.

CouchDB and MongoDB really search over each document with JavaScript?

From what I understand about these two "Not only SQL" databases. They search over each record and pass it to a JavaScript function you write which calculates which results are to be returned by looking at each one.
Is that actually how it works? Sounds worse than using a plain RBMS without any indexed keys.
I built my schemas so they don't require join operations which leaves me with simple searches on indexed int columns. In other words, the columns are in RAM and a quick value check through them (WHERE user_id IN (12,43,5,2) or revision = 4) gives the database a simple list of ID's which it uses to find in the actual rows in the massive data collection.
So I'm trying to imagine how in the world looking through every single row in the database could be considered acceptable (if indeed this is how it works). Perhaps someone can correct me because I know I must be missing something.
#Xeoncross
I built my schemas so they don't require join operations which leaves me with simple searches on indexed int columns. In other words, the columns are in RAM and a quick value check through them (WHERE user_id IN (12,43,5,2) or revision = 4)
Well then, you'll love MongoDB. MongoDB support indexes so you can index user_id and revision and this query will be able to return relatively quickly.
However, please note that many NoSQL DBs only support Key lookups and don't necessarily support "secondary indexes" so you have to do you homework on this one.
So I'm trying to imagine how in the world looking through every single row in the database could be considered acceptable (if indeed this is how it works).
Well if you run a query in an SQL-based database and you don't have an index that database will perform a table scan (i.e.: looking through every row).
They search over each record and pass it to a JavaScript function you write which calculates which results are to be returned by looking at each one.
So in practice most NoSQL databases support this. But please never use it for real-time queries. This option is primarily for performing map-reduce operations that are used to summarize data.
Here's maybe a different take on NoSQL. SQL is really good at relational operations, however relational operations don't scale very well. Many of the NoSQL are focused on Key-Value / Document-oriented concepts instead.
SQL works on the premise that you want normalized non-repeated data and that you to grab that data in big sets. NoSQL works on the premise that you want fast queries for certain "chunks" of data, but that you're willing to wait for data dependent on "big sets" (running map-reduces in the background).
It's a big trade-off, but if makes a lot of sense on modern web apps. Most of the time is spent loading one page (blog post, wiki entry, SO question) and most of the data is really tied to or "hanging off" that element. So the concept of grabbing everything you need with one query horizontally-scalable query is really useful.
It's the not the solution for everything, but it is a really good option for lots of use cases.
In terms of CouchDB, the Map function can be Javascript, but it can also be Erlang. (or another language altogether, if you pull in a 3rd Party View Server)
Additionally, Views are calculated incrementally. In other words, the map function is run on all the documents in the database upon creation, but further updates to the database only affect the related portions of the view.
The contents of a view are, in some ways, similar to an indexed field in an RDBMS. The output is a set of key/value pairs that can be searched very quickly, as they are stored as b-trees, which some RDBMSs use to store their indexes.
Think CouchDB stores the docs in a btree according to the "index" (view) and just walks this tree.. so it's not searching..
see http://guide.couchdb.org/draft/btree.html
You should study them up a bit more. It's not "worse" than and RDMBS it's different ... in fact, given certain domains/functions the "NoSQL" paradigm works out to be much quicker than traditional and in some opinions, outdated, RDMBS implementations. Think Google's Big Table platform and you get what MongoDB, Riak, CouchDB, Cassandra (Facebook) and many, many others are trying to accomplish. The primary difference is that most of these NoSQL solutions focus on Key/Value stores (some call these "document" databases) and have limited to no concept of relationships (in the primary/foreign key respect) and joins. Join operations on tables can be very expensive. Also, let's not forget the object relational impedance mismatch issue... You don't need an ORM to access MongoDB. It can actually store your code object (or document) as it is in memory. Can you imagine the savings in lines of code and complexity!? db4o is another lightweight solution that does this.
I don't know what you mean when you say "Not only SQL" database? It's a NoSQL paradigm - wherein no SQL is used to query the underlying data store of the system. NoSQL also means not an RDBMS which SQL is generally built on top of. Although, MongoDB does has an SQL like syntax that can be used from .NET when retrieving data - it's called NoRM.
I will say I've only really worked with Riak and MongoDB... I'm by no means familiar with Cassandra or CouchDB past a reading level and feature set comprehension. I prefer to use MongoDB over them all. Riak was nice too but not for what I needed. You should download a few of these NoSQL solutions and you will get the concept. Check out db4o, MongoDB and Riak as I've found them to be the easiest with more support for .NET based languages. It will just make sense for certain applications. All in all, the NoSQL or Document databse or OODBMS ... whatever you want to call it is very appealing and gaining lots of movement.
I also forgot about your javascript question... MongoDB has JavaScript "bindings" that enable it to be used as one method of searching for data. Riak handles data via a JSON format. MongoDB uses BSON I believe and I can't remember what the others use. In any case, the point is instead of SQL (structured query language) to "ask" the database for information some of these (MongoDB being one) use Javascript and/or RESTful syntax to ask the NoSQL system for data. I believe CouchDB and Riak can be queried over HTTP to which makes them very accessible. Not to mention, that's pretty frickin cool.
Do your research.... download them, they are all free and OSS.
db4o: http://www.db4o.com/ (Java & .NET versions)
MongoDB: mongodb.org/
Riak: http://www.basho.com/Riak.html
NoRM: http://thechangelog.com/post/436955815/norm-bringing-mongodb-to-net-linq-and-mono