MongoDB/GridFS out of disk space - mongodb

I'm using MongoDB 2.0.6, and I ran out of disk space and can't seem to do anything to clear some up.
I have 3 shards (not replicated, since I can regenerate the photos).
I have one collection called photos, with sharding enabled. I store everything in gridfs so db.photos.files and db.photos.chunks.
All the files ended up on my primary shard and are not distributed eventhough sharding is enabled on my collection.
When I try:
> db.runCommand({ shardcollection : "photos.fs.chunks", key : { files_id : 1 }})
I get an error about needing to create the index.
When I try to create the index I get "Can't take a write lock while out of disk space".
When I try to delete things:
>db.fs.files.remove({$and: [{_id: {$gt: "1667"} }, { _id: {$lt: "5000"} }]});
Can't take a write lock while out of disk space
Any ideas?

Related

Mongo sharding not removing data of sharded collection in source shard

I have MongoDB 3.2.6 installed on 5 machines which all form sharded cluster consisting of 2 shards (each is replica set with primary-secondary-arbiter configuration).
I also have a database with very large collection (~50M records, 200GB) and it was imported through mongos which put it to primary shard along with other collections.
I generated hashed id on that collection which will be my shard key.
After thay I sharded collection with:
> use admin
> db.runCommand( { enablesharding : "my-database" } )
> use my-database
> sh.shardCollection("my-database.my-collection", { "_id": "hashed" } )
Comand returned:
{ "collectionsharded" : "my-database.my-collection", "ok" : 1 }
And it actually started to shard. Status of shard looks like this:
> db.my-collection.getShardingDistribution()
Totals
data : 88.33GiB docs : 45898841 chunks : 2825
Shard my-replica-1 contains 99.89% data, 99.88% docs in cluster, avg obj size on shard : 2KiB
Shard my-replica-2 contains 0.1% data, 0.11% docs in cluster, avg obj size on shard : 2KiB()
This all looks ok but problem is that when I count my-collection through mongos I see number is increasing.
When I log in to primary replica set (my-replica-1) I see that number of records in my-collection is not decreasing although number in my-replica-2 is increasing (which is expected) so I guess mongodb is not removing chunks from source shard while migrating to second shard.
Does anyone know is this normal and if not why is it happening?
EDIT: Actually now it started to decrease on my-replica-1 although it still grows when counting on mongos (sometimes it goes little down and then up). Maybe this is normal behaviour when migrating large collection, I don't know
Ivan
according to documentation here you are observing a valid situation.
When document is moved from a to b it is counted twice as long as a receive confirmation that relocation was successfule.
On a sharded cluster, db.collection.count() can result in an
inaccurate count if orphaned documents exist or if a chunk migration
is in progress.
To avoid these situations, on a sharded cluster, use the $group stage
of the db.collection.aggregate() method to $sum the documents. For
example, the following operation counts the documents in a collection:
db.collection.aggregate(
[
{ $group: { _id: null, count: { $sum: 1 } } }
]
)

collection count increasing after sharding mongodb

I am not able to understand why collection count increasing after sharding mongodb.
I have a collection of 20M records , when I sharded collection count keep increasing , plz help me out
clusture configration
3 shards
3 config sever
6 query routers
If I get it right, you mean that db.shardedCollection.count()returns more documents than you expect. This is a known bug (SERVER-3645).
TL;DR
The problem is that the way sharding works, it can happen that after a chunk migration so called orphaned documents exist. These are documents which exist as a duplicate on a shard not responsible for the key range that the document falls into. For almost all practical purposes, this is not a problem, since the mongos takes care of "sorting them out" (which is a bit simplified, but sufficient in this context).
However, when calling a db.collection.count() on a sharded collection, this query gets routed to all shards, since it does not contain the shard key.
Disclaimer from here on, it is my theory, deduced from the observed behavior
Since the orphaned documents still technically exist on a shard, they seem to get counted and the result of the count as a whole is reported back to the mongos, which simply sums up all the results. I assume .count() to take a shortcut on the individual shard, possibly simply counting the entries of the _id index for performance reasons.
Workaround
As written in the ticket, using an aggregation mitigates the problem:
db.collection.aggregate({$group:{_id:"uniqueDocs",count:{$sum:1}}})
However, this aggregation is not ideal, and should show better performance when changed as below if you have a shard key other than _id
db.books.aggregate([
{ $project:{ _id:0, yourShardKey: 1 }},
{ $group:{ _id:"uniqueDocs", count:{ $sum:1 }}}
])
or
db.books.aggregate([
{ $project:{ _id:1 }},
{ $group:{ _id:"uniqueDocs", count:{ $sum:1 }}}
])
if you use _id as your shard key.

Is a hashed shard key right for me?

Assume my documents look something like this:
{
"_id" : ObjectId("53d9560f2521e7a28f550a78"),
"tenantId" : "tenant1",
"body" : "Some text - it's the point of the document."
}
There are a couple of obviously bad shard key choices:
{tenantId : 1} This would eventually give me large, unsplittable chunks.
{_id : 1} There are a lot of writes and no updates. The ascending key would give me hotspots.
I think I'm left with two possibilities:
{tenantId : 1, _id : 1} The hotspot problem with _id is mitigated by the addition of tenantId. I can easily search with this full key.
{_id : "hashed"} No hotspots, but I have concerns....
My concern with the hashed key is that it's now random. In Scaling MongoDB, the author warns against random keys because:
The configuration server notices that Shard 2 has 10 more chunks than
Shard 1 and decides it should even things out. MongoDB now has to load
a random five chunks’ worth of data into memory and send it to Shard
1. This is data that wouldn’t have been in memory ordinarily, because it’s a completely random order of data. So, now MongoDB is going to be
putting a lot more pressure on RAM and there’s going to be a lot of
disk IO going on (which is always slow).
So, my question is: Are hashed keys only a good choice if your only other choice is a monotonically ascending key? In my case, would the combination of tenantId and _id be better?
Update: To answer a question in the comments, we only ever retrieve these documents one-by-one. So depending on which shard key we choose, queries would be like these:
{_id : "53d9560f2521e7a28f550a78"}
or
{_id : "53d9560f2521e7a28f550a78", tenantId : "tenant1"}

MongoDB collection locking - how does it work?

I have not so big collection, that has about 500k records, but it's mission critical.
I want to add one field and remove another one. I was wondering would it lock that collection from inserting/updating (I really don't want any downtime).
I've made an experiment, and it looks that it doesn't block it:
// mongo-console 1
use "my_db"
// add new field
db.my_col.update(
{},
{ $set:
{ foobar : "bizfoo"}
},
{ multi: true}
);
// mongo-console 2
use "my_db"
db.my_col.insert({_id: 1, foobar: 'Im in'});
db.my_col.findOne({_id: 1});
=>{ "_id" : 1, "foo" : "bar" }
Although I don't really understand why, because db.currentOp() shows that there are Write locks on it.
Also on the production system I have replica set, and I was curious how does it impact the migration.
Can someone answer these questions, or point me to some article where it's nicely explained.
Thanks!
(MongoDB version I use is 2.4)
MongoDB 2.4 locks on the database level per shard. You mentioned you have a replica set. Replica sets have no impact on the locking. Shards do. If you have your data sharded, when you perform an update, the lock will only lock the database on the shard where the data lives. If you don't have your data sharded, then the database is locked during the write operation.
In order to see impact, you'll need a test that does a significant amount of work.
You can read more at:
http://www.mongodb.org/display/DOCS/How+does+concurrency+work

mongodb- huge insert time on sharding

I am having a problem with insertion time of 300,000,000 documents into collection.
I was checking the performance of insertion time with a single node for the same count of documents. The time taken was approximately 23 minutes.
I create 2 shards - and trying to insert the same count of documents. The insert time is more than 25hours.
The two shards have a configuration of 8 GB RAM, 8 Cores machines.
The config and router are on the same machine which is of 4 GB RAM, 4 cores machine.
I am using C# driver - in my app- creating BSON documents for insertion.
The collection structure is:
Logs{
"_id"
"LID"
"Ver"
"Y"
"M"
"D"
"H"
"Min"
"Sec"
"MSec"
"FID"
}
The shard key is _id field. The chunkSize of sharding is set to 1.
What are the things i should check on where the performance is creating a problem?
Can anyone suggest me a solution or the things i should look into to find the factors which are increasing the insertion time
Thanks in advance.
I think that the problem is due to the chunk migration. Basically while you are inserting time the data is also moved from one shard to another. And then it might move back to the same shard. Also the thing can be that the indexes are eating some of your time (this is common thing that in databases creating index and then inserting data is slower then inserting data and creating index).
So if I were you, I would do the following:
create 1 node mongo and insert all data into it. Without db.coll.insert() but by using mongodump and mongorestore.
then create indexes on whatever fields are needed.
then shard your collection
Also you might try to disable balancer for the time of insertion.