Can't figure out why mongo database becomes bigger after migration?

Can't figure out why mongo database becomes bigger after migration? - mongodb

I'm new to mongodb. I have a local server and a remote server. After migrate the mongo database from the local server to a remote server using mongodump/mongorestore tools, I found out that the size of database became bigger on remote server.
Here is my sample :
on local server (Ubuntu 14.04.2 LTS, mongo 3.0.5):
> show dbs
Daily_data 7.9501953125GB
Monthly_data 0.453125GB
Weekly_data 1.953125GB
on remote server (CentOS 6.7, mongo 2.4.3):
> show dbs
Daily_data 9.94921875GB
Monthly_data 0.953125GB
Weekly_data 3.9521484375GB
I also checked the status of one collection to compare, the count is the same but the size (like indexSize, totalIndexSize, etc) has changed:
this is the status of collection on the local server:
> db.original_prices.stats()
{
"ns" : "Daily_data.original_prices",
"count" : 9430984,
"size" : 2263436160,
"avgObjSize" : 240,
"numExtents" : 21,
"storageSize" : 2897301504,
"lastExtentSize" : 756662272,
"paddingFactor" : 1,
"paddingFactorNote" : "paddingFactor is unused and unmaintained in 3.0. It remains hard coded to 1.0 for compatibility only.",
"userFlags" : 1,
"capped" : false,
"nindexes" : 2,
"indexDetails" : {
},
"totalIndexSize" : 627777808,
"indexSizes" : {
"_id_" : 275498496,
"symbol_1_dateTime_1" : 352279312
},
"ok" : 1
}
this is the status of collection on the remote server:
> db.original_prices.stats()
{
"ns" : "Daily_data.original_prices",
"count" : 9430984,
"size" : 1810748976,
"avgObjSize" : 192.00000508960676,
"storageSize" : 2370023424,
"numExtents" : 19,
"nindexes" : 2,
"lastExtentSize" : 622702592,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 639804704,
"indexSizes" : {
"_id_" : 305994976,
"symbol_1_dateTime_1" : 333809728
},
"ok" : 1
}
If mongodump/mongorestore is a good save way to migrate the mongo database?

The problem here as you seem to have already noticed is the index as is clearly shown that it is the indexSize that has grown here, and there is a perfectly logical explanation.
When running the restore the indexes are rebuilt but in a way that avoids blocking the other write operations happening in the restore operation. This is similar to the process employed in Build Indexes in the Background as described in the documentation, not exactly the same but close.
In order to get the most optimal index size it is best to first drop indexes from the target database and use the --noIndexRestore option with the mongorestore command as this will prevent index building during the data load.
Then when complete you can run a regular createIndex exluding any usage of the "background" option so the indexes are created in the foreground. The result will be that the database will be blocked from read and write during index creation, but the resulting indexes will be of a smaller size.
As for the general practice, you will note that other data sizes will in fact come out "smaller" as in the process of "rebuilding" then any slack space present in the source will not be created when the data is restored.
The data from mongodump is in a binary format and should always be used in preference to the textual format of mongoexport and related mongoimport, when of course taking data from one MongoDB instance and to use on another, since that is not the purpose of those tools.
Other alternates ae file system copies such as an LVM snapshot, which will of course restore in exactly the same state as the backup copy was made.

Factors that can affect the disk size of your collection include the underlying hardware, filesystem, and configuration. In your case, the prevailing factor seems to be a difference in the storage engine used on the local and remote servers: your local server is running Mongo 3.0 while the remote is running an older version. This is apparent based on the presence of the paddingFactorNote property, however you can confirm by running db.version() in both environments.
Between Mongo 2.4/2.6 and Mongo 3.0 there were several important changes to how collections are stored, not least the addition of the WiredTiger storage engine as an alternative to the default mmapv1 storage engine. There were also changes to how the mmapv1 engine (which you are using) pads documents during allocation to accommodate growth in document size.
The other major reason for the size differences comes from your use of mongorestore. During normal usage, mongo databases are not stored in a way that minimizes disk usage. However, mongorestore rebuilds the database/collection in a compact way, which is why for the collection you posted, the remote storageSize is smaller.

Related

How to get current connection pool occupancy on client using mongo .net driver?

I want to monitor current connecton pool occupancy using .net mongo driver to produce stats each minute (like 5/MaxConnectionPoolSize is busy).
But I can't see any option to do it in my driver (2.8.1). Is it even possible?
I've found some answers for the similar question for js client but sadly can't apply it to my situation.
How to get the number of connections used (and free) to the MongoDB (from a client perspective)?

you can use this command:
db.serverStatus()['connections']
{
"current" : 18,
"available" : 999982,
"totalCreated" : 2175,
"active" : 8,
"exhaustIsMaster" : 6,
"awaitingTopologyChanges" : 6
}
to run it via the driver, you should use: var doc = db.RunCommand<BsonDocument>("{ serverStatus : 1 }");

Weird operation in db.currentOp() output in MongoDB

Just after starting my MongoDB server (standalone instance, version 4.2.2) if I run db.currentOp() I see this operation:
{
"type" : "op",
"host" : "menzo:27017",
"desc" : "waitForMajority",
"active" : true,
"currentOpTime" : "2020-05-06T16:16:33.077+0200",
"opid" : 2,
"op" : "none",
"ns" : "",
"command" : {
},
"numYields" : 0,
"waitingForLatch" : {
"timestamp" : ISODate("2020-05-06T14:02:55.895Z"),
"captureName" : "WaitForMaorityService::_mutex"
},
"locks" : {
},
"waitingForLock" : false,
"lockStats" : {
},
"waitingForFlowControl" : false,
"flowControlStats" : {
}
}
It seems that this operation is always there, no matter how long it passes. In addition, it is a weird operation in some aspects:
It has a very log opid number (2)
It's op is "none"
It doesn't have the usual secs_running or microsecs_running parameters
It mentions "majority" in some literals, but I'm not running a replica set but an standalone instance
I guess it should be some kind of internal operation (maybe a kind of "waiting thread"?) but I haven't found documentation about it in the currentOp command documentation.
Do anybody knows about this operation and/or could point to documentation where it is described, please? Thanks in advance!

Wait for majority service is defined here. Looking at the history of this file, it appears to have been added as part of work on improving startup performance.
Reading the ticket description, it seems that during startup, multiple operations may need to wait for a majority commit. Previously each may have created a separate thread for waiting; with the majority wait service, there is only one thread which is waiting for the most recent required commit.
Therefore:
It's op is "none"
The thread isn't performing an operation as defined in the docs.
It has a very log opid number (2)
This is because this thread is started when the server starts.
It mentions "majority" in some literals, but I'm not running a replica set but an standalone instance
It is possible to instruct mongod to run in replica set mode and point it at a data directory created by a standalone node, and vice versa. In these cases you'd probably expect the process to preserve the information already in the database, such as transactions that are pending (or need to be aborted). Hence the startup procedure may perform operations not intuitively falling under the operation mode requested.

MongoDB URL without replicaSet optional parameter

I have a mongoDB cluster
server1:27017
server2:27017
server3:27017
For historical reason, IT team could not provide the replicaSet name for this cluster.
My question is: without knowing the replicaSet name, is the following mongoDB url legal and will missing the optional replicaSet optional parameter cause any possible problems in future?
mongodb://username:password#server1:27017,server2:27017,server3:27017
I am using Java to setup MongoDB connection using the following
String MONGO_REPLICA_SET = "mongodb://username:password#server1:27017,server2:27017,server3:27017";
MongoClientURI mongoClientURI = new MongoClientURI(MONGODB_REPLICA_SET);
mongoClient = new MongoClient(mongoClientURI);

To clarify, although it may be functional to connect to the replica set it would be preferable to specify the replicaSet option.
Depending on the MongoDB Drivers that you're using it may behaves slightly differently. For example quoting the Server Discovery and Monitoring Spec for Initial Topology Type:
In the Java driver a single seed means Single, but a list containing one seed means Unknown, so it can transition to replica-set monitoring if the seed is discovered to be a replica set member. In contrast, PyMongo requires a non-null setName in order to begin replica-set monitoring, regardless of the number of seeds.
There are variations, and it's best to check whether the connection can still handle topology discovery and failover.
For historical reason, IT team could not provide the replicaSet name for this cluster.
If you have access to the admin database, you could execute rs.status() on mongo shell to find out the name of the replica set. See also replSetGetStatus for more information.

It should be possible to find out the name of the replica set, to avoid this worry. Open a connection to any one of your nodes (e.g. direct to server1:27017), and run rs.status(); that will tell you the name of your replica set as well as lots of other useful data such as the complete set of configured nodes and their individual statuses.
In this example of the output, "rsInternalTest" is the replica set name:
{
"set" : "rsInternalTest",
"date" : ISODate("2018-05-01T11:38:32.608Z"),
"myState" : 1,
"term" : NumberLong(123),
"heartbeatIntervalMillis" : NumberLong(2000),
"optimes" : {
...
},
"members" : [
{
"_id" : 1,
"name" : "server1:27017",
"health" : 1.0,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 1652592,
"optime" : {
"ts" : Timestamp(1525174711, 1),
"t" : NumberLong(123)
},
"optimeDate" : ISODate("2018-05-01T11:38:31.000Z"),
"electionTime" : Timestamp(1524371004, 1),
"electionDate" : ISODate("2018-04-22T04:23:24.000Z"),
"configVersion" : 26140,
"self" : true
}
...
],
"ok" : 1.0
}
Note that you will need the login of a high-level user account, otherwise you won't have permission to run rs.status().

MongoDB not showing collections information even though I am sure its there

So I am using MongoDB 3.2 version.
I created a db and its collection via a Clojure wrapper called monger
But when I connect to the mongo shell, and check if collections are created I can't see it.
Here's the code:
Primary> use db_name
PRIMARY> db.version()
3.2.3
PRIMARY> db.stats()
{
"db" : "db_name",
"collections" : 4,
"objects" : 0,
"avgObjSize" : 0,
"dataSize" : 0,
"storageSize" : 16384,
"numExtents" : 0,
"indexes" : 9,
"indexSize" : 36864,
"ok" : 1
}
PRIMARY> show collections
PRIMARY> db.coll1.getIndexes()
[ ]
PRIMARY> db.getCollectionInfos()
Tue May 24 16:29:44 TypeError: db.getCollectionInfos is not a function (shell):1
PRIMARY>
But when I check if collections are created via clojure I can see the information.
user=> (monger.db/get-collection-names mongo-db*)
#{"coll1" "coll2" "coll3" "coll4"}
What is going on?

Found the issue. So it turns out that if the mongo shell and running mongo instance are of two different versions then db.getCollectionNames() and db.collection.getIndexes() will return no output.
This can happen if you are connecting to a remote mongo instance and the instance via you are connecting to is running say 2.x shell version (you can see this when you start the shell) and the running mongo is 3.x version.
According to the documentation:
For MongoDB 3.0 deployments using the WiredTiger storage engine, if you run db.getCollectionNames() and db.collection.getIndexes() from a version of the mongo shell before 3.0 or a version of the driver prior to 3.0 compatible version, db.getCollectionNames() and db.collection.getIndexes() will return no data, even if there are existing collections and indexes. For more information, see WiredTiger and Driver Version Compatibility.
Spent almost an hour trying to figure this out, thought this might be helpful to others.

Why has my newly created mongodb local database grown to 24GB?

I setup a mongodb replica set a few days ago, I did some little test on it and everything working well. Today I just found its local collection grew to 24G !!!
rs0:PRIMARY> show dbs
local 24.06640625GB
test 0.203125GB
The other collections look normal except "oplog.rs":
rs0:PRIMARY> db.oplog.rs.stats()
{
"ns" : "local.oplog.rs",
"count" : 322156,
"size" : 119881336,
"avgObjSize" : 372.12200300475547,
"storageSize" : 24681987920,
"numExtents" : 12,
"nindexes" : 0,
"lastExtentSize" : 1071292416,
"paddingFactor" : 1,
"systemFlags" : 0,
"userFlags" : 0,
"totalIndexSize" : 0,
"indexSizes" : {
},
"capped" : true,
"max" : NumberLong("9223372036854775807"),
"ok" : 1
}
This is my mongodb.conf
dbpath=/data/db
#where to log
logpath=/var/log/mongodb/mongodb.log
logappend=true
port = 27017
fork = true
replSet = rs0
How can I solve it? Many thanks.

The oplog, which keeps an ongoing log for the operations on the primary and is used for the replication mechanism, is allocated by default at 5% of the available free disk space (for Linux/Unix systems, not Mac OS X or Windows). So if you have a lot of free disk space, MongoDB will make a large oplog, which means you have a large time window within which you could restore back to any point in time, for instance. Once the oplog reaches its maximum size, it simply rolls over (ring buffer).
You can specify the size of the oplog when initializing the database using the oplogSize option, see http://docs.mongodb.org/manual/core/replica-set-oplog/
Bottom line: Unless you are really short of disk space (which apparently you aren't, otherwise the oplog wouldn't have been created so big), don't worry about it. It provides extra security.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse