What is considered as a document in mongoDB? - mongodb

I have been learning mongoDb for past two days,I've been confusing with documents and their limits,and how to overcome the limits.What is a differnce between documents and collection.

To roughly compare it with a traditional RDBMS:
RDBMS | MongoDB
_________________________________________
Database | Database
Table | Collection
Tuple/Row | Document
column | Field
Primary Key | Primary Key (Default key _id provided by mongodb itself)
Processes:
Database Server(Daemon process) and Client
Mysqld/Oracle | mongod
mysql/sqlplus | mongo
mongos is the mongo server used during load balancing in a sharding and/or replica set scenario.
Hence in mongodb, database consists of collections of documents and each document contains any number of key value pairs.
Different documents in the same collection can have different number and type of keys. You can say its schema-less; at the most basic part: everything is based on retrieving values from hash values or hash indices of keys.
Within a document, keys can have values ranging from normal primitive datatypes(string, int, binary data etc) to documents and arrays.
As for the document limit size, 16 MB is sufficient as mongodb stores your data in Binary JSON (BSON) compressed format. Also embedded documents are not stored separately wrt the parent document and are counted within the 16 MB limit. Here are few more links on document limit:
latest change of mongodb document limit
doc limit
link
deciding factor
more info

Related

MongoDB - Weird difference in _id Index size

I have two sharded collections on 12 shards, with the same number of documents. The shard key of Collection1 is compound (two fields are used), and its document consists of 4 fields. The shard key of Collection2 two is single, and its documents consists of 5 fields.
Via db.collection.stats() command, I get the information about the indexes.
What seems strange to me, is that for the Collection1, the total size of _id index is 1342MB.
Instead, the total size of the _id index for Collection2 is 2224MB. Is this difference reasonable? I was awaiting that the total size would be more less the same because of the same number of docucments. Note that the sharding key for both collections, does not integrate the _id field.
MongoDB uses prefix compression for indexes.
This means that if sequential values in the index begin with the same series of bytes, the bytes are stored for the first value, and subsequent values contain a tag indicating the length of the prefix.
Depending on the datatype of the _id value, this could be quite a bit.
There may also be orphaned documents causing one node to have more entries in its _id index.

MongoDB shard key as (ObjectId, ObjectId, RandomKey). Unbalanced collections

I am trying to shard a collection with approximately 6M documents. Following are some details about the sharded cluster
Mongod version 2.6.7, two shards, 40 % writes, 60% reads.
My database has a collection events with around 6M documents. The normal document looks like below:
{
_id : ObjectId,
sector_id : ObjectId,
subsector_id: ObjectId,
.
.
.
Many event specific fields go here
.
.
created_at: Date,
updated_at: Date,
uid : 16DigitRandomKey
}
Each sector has multiple (1,2, ..100) subsectors and each subsector has multiple events. There are 10000 such sectors, 30000 subsectors and 6M events. The numbers keep growing.
The normal read query includes sector_id, subsector_id. Every write operation includes sector_id, subsector_id, uid (randomly generated unique key) and rest of the data.
I tried/considered following shard keys and the results are described below:
a. _id:hashed --> will not provide query isolation, reason: _id is not passed to read query.
b. sector_id :1, subsector_id:1, uid:1 --> Strange distribution: Few sectors with old ObjectId goes to shard 1, Few sectors having sector_id of mid age(ObjectId) are well balanced and equally distributed among both shards. Few sectors with recent ObjectId stays on shard 0.
c. subsector_id: hashed --> results were same as shard key b.
d. subsector_id:1, uid:1 --> same as b.
e. subsector_id:hashed, uid:1 --> can not create such index
f. uid:1 --> writes are distributed but no query isolation
What may the reason for this uneven distribution? What can be the right shard key based upon given data.
I see it as an expected behaviour Astro, the sectorIds and subsectorIds are ObjectId type which contains the timestamp as the first 4 bytes which is monotonic in nature and would always go to the same chunk (and hence same shard) as it failed to provide the randomness which is also pointed by you in point (b).
the best way to choose a shard key is the key which has business meaning (unlike some ObjectId field) and should be mixed with some hash as the suffix to ensure a good random mix on that for equal distribution. if you have a sectorName and subsectorName then pls try out and let us know if its working using that.
you may consider this link to choose the right shard key.
MongoDB shard by date on a single machine
-$

Collection ID length in MongoDB

i am new to mongodb and stack overflow.
I want to know why on mongodb collection ID is of 24 hex characters?
what is importance of that?
Why is the default _id a 24 character hex string?
The default unique identifier generated as the primary key (_id) for a MongoDB document is an ObjectId. This is a 12 byte binary value which is often represented as a 24 character hex string, and one of the standard field types supported by the MongoDB BSON specification.
The 12 bytes of an ObjectId are constructed using:
a 4 byte value representing the seconds since the Unix epoch
a 3 byte machine identifier
a 2 byte process id
a 3 byte counter (starting with a random value)
What is the importance of an ObjectId?
ObjectIds (or similar identifiers generated according to a GUID formula) allow unique identifiers to be independently generated in a distributed system.
The ability to independently generate a unique ID becomes very important as you scale up to multiple application servers (or perhaps multiple database nodes in a sharded cluster). You do not want to have a central coordination bottleneck like a sequence counter (eg. as you might have for an auto-incrementing primary key), and you will want to insert new documents without risk that a new identifier will turn out to be a duplicate.
An ObjectId is typically generated by your MongoDB client driver, but can also be generated on the MongoDB server if your client driver or application code or haven't already added an _id field.
Do I have to use the default ObjectId?
No. If you have a more suitable unique identifier to use, you can always provide your own value for _id. This can either be a single value or a composite value using multiple fields.
The main constraints on _id values are that they have to be unique for a collection and you cannot update or remove the _id for an existing document.
Now mongoDB current version is 4.2. ObjectId size is still 12 bytes but consist of 3 parts.
ObjectIds are small, likely unique, fast to generate, and ordered.
ObjectId values are 12 bytes in length, consisting of:
a 4-byte timestamp value, representing the ObjectId’s creation, measured in seconds since the Unix epoch
a 5-byte random value
a 3-byte incrementing counter, initialized to a random value
Create ObjectId and get timestamp from it
> x = ObjectId()
ObjectId("5fdedb7c25ab1352eef88f60")
> x.getTimestamp()
ISODate("2020-12-20T05:05:00Z")
Reference
Read MongoDB official doc

MongoDB Sharded collection shard key confusion

Suppose I have a DB called 'mydb' and a collection in that DB called 'people' and documents in mydb.people all have a 5 digit US zip code field: ex) 90210. If I set up a sharded collection by splitting up this collection in to 2 shards using the zip code as the shard key, how would document insertion be handled?
So if I insert a document with zipcode = 00000 would that go to the first shard because this zip code value is less than the center zipcode value of 50000? And if I insert a document with zipcode = 99999 would it be inserted into the second shard?
I setup a sharded cluster according to http://docs.mongodb.org/manual/tutorial/deploy-shard-cluster/ with a collection with common key of zipcode sharded across 2 DB's and am not finding this even distribution of documents.
Also what do they mean by Chunk size? A chunk is basically a range of the shard index, right? Why do they talk about chunk sizes in sizes of MB and not in terms of range of the shard key?
Confusing

Mongo _id Insert Uniqueness Check

I have a medium to large Mongo collection containing image metadata for >100k images. I am generating a UUID for each image generated and using it as the _id field in the imageMeta.insert() call.
I know for a fact that these _id's are unique, or at least as unique as I can expect from boost's UUID implementation, but as the collection grows larger, the time to insert a record has grown as well.
I feel like to ensure uniqueness of the _id field Mongo must be double-checking these against the other _ids in the database. How is this implemented, and how should I expect the insert time to grow wrt. to the collection size?
The _id field in mongo is required to be unique and indexed. When an insert is performed, all indexes in the collection are updated, so it's expected to see insert time increase with the number of indexes and/or documents. Namely, all collections have at least one index (on the _id field), but you've likely created indexes on fields that you frequently query, and those indexes also get updated on every insert (adding to the latency).
One way to reduce perceived database latency is to specify a write concern to your driver. Note that the default write concern prior to November 2012 was 'unacknowledged', but it has since been changed to 'acknowledged'.