Advise on mongodb sharding - mongodb

I read mongodb sharding guide, but I am not sure what kind of shard key suits my application. Any suggestions are welcome.
My application is a huge database of network events. Each document has a time filed and a couple of network related values like IP address and port number. I have the insert rate of 100-1000 items per seconds. In my experience with a single mongod, one single shard has no problem with this insert rate.
But I extensively use aggregation framework on huge amount of data. All the aggregations has a time limit--i.e. mostly the recent month or the recent week. I tested the aggregations on one singe mongod and a query with response time of 5 minutes while the insertion is off could take as long as two hours if the 200 insert per seconds is activated.
Can I improve mongodb aggregation query response time by sharding?
If yes, I think I have to use time as the shard key, because in my application, every query has to be run on a time limit (e.g. Top IP addresses in the recent month) and if we can separate the shard that inserting is take place and the shard that the query is working the mongodb could work much faster.
But the documenation says
"If the shard key is a linearly increasing field, such as time, then all requests for a given time range will map to the same chunk, and thus the same shard. In this situation, a small set of shards may receive the majority of requests and the system would not scale very well."
So what shall I do?

Can I improve mongodb aggregation query response time by sharding?
Yes.
If you shard your database on different machines this will provide parallel computing power on aggregation functions. The important thing is here the distribution of the data in your shards. It should be uniform distributed.
If you choose your shard key as a monotonicly increasing(or decreasing) field of the document-like time- and use hashed sharding, this will provide uniform distribution over the cluster.

Related

How to count number of read/write/delete operations in mongo for a period?

I need to find out how many entity writes/reads or deletes a mongo cluster does within a specific period for internal metrics.
Only found db.currentOp().inprog.length which counts current op log.
Obviously, would be great if I don't need to do it through code, but from the sharded cluster out of the box system.
Later edit: logging all queries will be another option, but not for my production DB as it's too much and I need to do it over a 30 days period at least to have a good average

picking a shardkey for mongodb

I want to shard my MongoDB database. I have a high insert rate and want to distribute my documents on two shards evenly.
I have considered rangebase sharding, because I have range queries; but I can not find a solution for picking a good shard key.
{
Timestamp : ISODate("2016-10-02T00:01:00.000Z"),
Machine_ID: "100",
Temperature:"50"
}
If this is my document and I have 100,000 different machines, would the Machine_ID be a suitable shardkey? And if so, how will MongoDB distribute it on the shards, i.e. do i have to specify the shard range myself? like put Machine_ID 0-49,999 on shard A, and 50,000-100,000 on shard B?
I think the Machine_ID would be a suitable shard key if your queries afterwards will be per Machine, i.e. get all the temperatures for a specific machine for a certain time range. Reading more about shard keys can be found here: Choosing shard key
MongoDB has two kinds of sharding: Hashed sharding and Range sharding which you can read more about here: Sharding strategies. Having said that, you don't need to specify the range of the shards yourself, mongo will take care of it. Especially when a time comes when you'll need to add a new shard, mongo will rearrange the chunks into the new shard.
If your cluster has only two shards, then it isn't difficult to design for. However, if your data will continue to grow and you end up having a lot more shards, then the choice of shard key is more difficult.
For example, if some machines have many more records than others (e.g. one machine has 3000 records i.e. 3% of the total), then that doesn't cause problems with only two shards. But if your data grows so that you need 100 shards, and one machine still has 3% of the total, then Machine_ID is no longer a good choice: because a single machine's records have to be a single chunk, and cannot be distributed across several shards.
In that case, a better strategy might be to use a hash of the Timestamp - but it depends on the overall shape of your dataset.

how to divide mongodb data without stopping service

I got a mongodb running online, which contains about 200 million object, and the file size is about 20GB. And, I found that the insert speed become very slow (about 2000 per sec, and this value is more than 10000 in the beginning). So I decide to divide the data to optimize the insert speed.
I would like the know if i can divide the mongodb data without stopping service, and how?
You just described "sharding". Luckily for you, MongoDB has nice sharding features out of the box.
Your migration will consist of:
Create 3 mongo config servers
Create more than 1 mongos router
Add your current replica set as a shard
Point your application to connect to your mongos
Configure shard keys your current collections
Then, you are set to add shards as needed
For detailed instructions, see 10gen's sharding overview
At MongoHQ, we convert replica sets to shards all the time. So, should be quick, painless, and without downtime.

Do Cassandra read latency stats count Secondary Index queries piecewise?

I am trying to understand whether read latency stats obtained through
nodetool cfstats
nodetool cfhistograms
, will count each read within secondary index queries separately.
I guess the answer might depend on whether secondary index queries are handled by thrift clients or internally by Cassandra. I don't know that either.
[1] Cassandra - cfstats and meaning of read/write latency
Cassandra read latency is calculated with calculating average time taken by each read query.
In that a single get query or multiget query is considered as a single read and average is calculated.[ StorageProxy.read() is the function where Cassandra is collecting time taken for each query ]

MongoS Call Distributions Analysis

I would like to see how well my shard key is and I thinking to monitor how many calls goes to each shard by the MongoS for each 100 parallel BatchInsert that I do. I probablly can do this at application layer, but is there a way to record this at monogS level?
I am using monogoStat but I wan the details of monogS. Also, the mongoS log does not say much from what I gather
Do you have trending graphs or some other form of monitoring software? 10gen actually provides a free one called MMS.
If you are monitoring the activity on your shards, that should correlate to the calls being made from mongos. The only caveat here is that activity is not broken out by collection or by database. So if you're sharding multiple DBs on the same instances this may not work.
Otherwise, just look at the activity on the shards and that should clearly tell you what's happening.
If you use mongostat --discover you can see the traffic per shard as well as the total traffic going through the mongos that mongostat is using. This should give you full insight unto your load distribution in real time.
Note that your shard key always "works" because MongoDB splits on your shard key median rather than simply split the data in two so provided your shard key has high enough cardinality your data will always be perfectly balanced (provided the balancer had time to balance the chunks appropriately)
In addition to MMS and mongostat, you can also see the overall status and health of your sharding cluster via the printShardingStatus() function, as outlined here