MongoDB chunk migration steps in changelog collection - mongodb

Can someone clarify all the steps for "moveChunk.from" and "moveChunk.to". I want to know at these steps, which operations are performed (I guess the value of the steps represents the time ms, it took for the step). This will help me to derive any slowest step that is occurring during chunk migration.
{
"_id" : "bdvlpabhishekk-2013-07-20T17:46:28-51eaccf40c5c5c12e0e451d5",
"server" : "bdvlpabhishekk",
"clientAddr" : "127.0.0.1:50933",
"time" : ISODate("2013-07-20T17:46:28.589Z"),
"what" : "moveChunk.from",
"ns" : "test.test",
"details" : {
"min" : {
"key1" : 151110
},
"max" : {
"key1" : 171315
},
"step1 of 6" : 0,
"step2 of 6" : 1,
"step3 of 6" : 60,
"step4 of 6" : 2067,
"step5 of 6" : 7,
"step6 of 6" : 0
}
}
{
"_id" : "bdvlpabhishekk-2013-07-20T17:46:31-51eaccf7d6a98a5663942b06",
"server" : "bdvlpabhishekk",
"clientAddr" : ":27017",
"time" : ISODate("2013-07-20T17:46:31.671Z"),
"what" : "moveChunk.to",
"ns" : "test.test",
"details" : {
"min" : {
"key1" : 171315
},
"max" : {
"key1" : 192199
},
"step1 of 5" : 0,
"step2 of 5" : 0,
"step3 of 5" : 1712,
"step4 of 5" : 0,
"step5 of 5" : 344
}
}

All these steps are explained in the "M202: MONGODB ADVANCED DEPLOYMENT AND OPERATIONS" course, which is available online for free (I can't post this link here because of the limitation on stackoverflow on a number of posted URLs, just try to find the course in google)
Related videos from this course are: Anatomy of a migration overview and Anatomy of a migration deep dive.
The explanation follows.
All time values are in milliseconds.
Let's say F is a "moveChunk.from" and T is a "moveChunk.to". The steps are F1..F6 and T1..T5.The steps are performed sequentially F1, F2, F3, F4: {T1, T2, T3, T4, T5}, F5, F6. Step F4 includes {T1..T5} and timing of F4 is the sum of T1..T5 (but there is no exact match).
F1 - mongos sends 'moveChunk' command to F (primary of the shard to migrate from)
F2 - sanity checking of the command
F3 - F sends the command to T (read this chunk from me)
F4,T1..T3 - transfer starts, T performs sanity checking, index checking etc
F4,T4 - catch up on subsequent operations (if there were inserts into the chunk while the transfer was going, send updates from F to T)
F4,T5 - steady state (changes are written to the primary log)
F5 - about to commit to config server changes about the new chunk location (critical secition)
F6 - clean up

All chunk migrations use the following procedure:
The balancer process sends the moveChunk command to the source shard.
The source starts the move with an internal moveChunk command. During the migration process, operations to the chunk route to the source shard. The source shard is responsible for incoming write operations for the chunk.
The destination shard begins requesting documents in the chunk and starts receiving copies of the data.
After receiving the final document in the chunk, the destination shard starts a synchronization process to ensure that it has the changes to the migrated documents that occurred during the migration.
When fully synchronized, the destination shard connects to the config database and updates the cluster metadata with the new location for the chunk.
After the destination shard completes the update of the metadata, and once there are no open cursors on the chunk, the source shard deletes its copy of the documents.
Taken from this.

Related

MongoDB query to compute percentage

I am new to MongoDB and kind of stuck at this query. Any help/guidance will be highly appreciated. I am not able to calculate the percentage in the desired way. There is something wrong with my pipeline where prerequisites of percentage are not computed correctly. Following I provide my unsuccessful attempt along with the desired output.
Single entry in the collection looks like below:
_id : ObjectId("602fb382f060fff5419fd0d1")
time : "2019/05/02 00:00:00"
station_id : 3544
station_name : "Underhill Ave &; Pacific St"
station_status : "In Service"
latitude : 40.6804836
longitude : -73.9646795
zipcode : 11238
borough : "Brooklyn"
neighbourhood : "Prospect Heights"
available_bikes : 5
available_docks : 21
The query I am trying to solve is:
Given a station_id (e.g., 522) and a num_hours (e.g., 3) passed as parameters:
- Consider only the measurements where the station_status = “In Service”.
- Consider only the measurements for that concrete
“station_id”.
- Compute the percentage of measurements with
available_bikes = 0 for each hour of the day (e.g., for the period
[8am, 9am) the percentage is 15.06% and for the period [9am, 10am)
the percentage is
27.32%).
- Sort the percentage results in decreasing order.
- Return the top “num_hours” documents.
The desired output is:
--- DOCUMENT 0 INFO ---
---------------------------------
hour : 19
percentage : 65.37
total_measurements : 283
zero_bikes_measurements : 185
---------------------------------
--- DOCUMENT 1 INFO ---
---------------------------------
hour : 21
percentage : 64.79
total_measurements : 284
zero_bikes_measurements : 184
---------------------------------
--- DOCUMENT 2 INFO ---
---------------------------------
hour : 00
percentage : 63.73
total_measurements : 284
zero_bikes_measurements : 181
My attempt is:
command_1 = {"$match": {"station_status": "In Service", "station_id": station_id, "available_bikes": 0}}
my_query.append(command_1)
command_2 = {"$group": {"_id": "null", "total_measurements": {"$sum": 1}}}
my_query.append(command_2)
command_3 = {"$project": {"_id": 0,
"station_id": 1,
"station_status": 1,
"hour": {"$substr": ["$time", 11, 2]},
"available_bikes": 1,
"total_measurements": {"$sum": 1}
}
}
my_query.append(command_3)
command_4 = {"$group": {"_id": "$hour", "zero_bikes_measurements": {"$sum": 1}}}
my_query.append(command_4)
command_5 = {"$project": {"percent": {
"$multiply": [{"$divide": ["$total_measurements", "$zero_bikes_measurements"]},
100]}}}
my_query.append(command_5)
I've taken a look at this and I'm going to offer some sincere advice:
Don't try and do this in an aggregate query. Just go back to basics and pull the numbers out using find()s and then calculate the numbers in python.
If you want to persist with an aggregate query, I will say that your match command filters on available_bikes equal to zero. You never have the total number of measurements, so you can never find the percentage. Also when you have done your first $group, your "lose" your projection so at that point in the pipeline you only have total_measurements and that's it (comment out the commands 3 to 5 to see what I mean).

Kafka consumer test and reported metrics

I want to understand how kafka consumer test works and how to interpret some of numbers reported,
below is the test i ran and the output i got. My questions are
values reported for rebalance.time.ms, fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec are 1593109326098, -1593108732333, -0.0003, -0.2800; can you explain how it can report such a high and negative numbers ? they dont make sense to me.
Everything reported from Metric Name Value line is reported due to --print-metrics flag. What is difference between metrics reported by default and with this flag? how they are calculated and where can i read about what do they mean?
No matter i scale total consumer running in parallel or scale network and io threads at broker, consumer-fetch-manager-metrics:fetch-latency-avg metrics remains almost same. can you explain this? with more consumers pulling data fetch latency should go higher; similarly for given consuming rate if i reduce io and network thread parameters at broker shouldnt latency scale higher?
here is the command i ran
[root#oak-clx17 kafka_2.12-2.5.0]# bin/kafka-consumer-perf-test.sh --topic topic_test8_cons_test1 --threads 1 --broker-list clx20:9092 --messages 500000000 --consumer.config config/consumer.properties --print-metrics
and results
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec, rebalance.time.ms,fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec
WARNING: Exiting before consuming the expected number of messages: timeout (10000 ms) exceeded. You can use the --timeout option to increase the timeout.
2020-06-25 11:22:05:814, 2020-06-25 11:31:59:579, 435640.7686, 733.6922, 446096147, 751300.8463, 1593109326098, -1593108732333, -0.0003, -0.2800
 
Metric Name Value
consumer-coordinator-metrics:assigned-partitions:{client-id=consumer-perf-consumer-25533-1} : 0.000
consumer-coordinator-metrics:commit-latency-avg:{client-id=consumer-perf-consumer-25533-1} : 2.700
consumer-coordinator-metrics:commit-latency-max:{client-id=consumer-perf-consumer-25533-1} : 4.000
consumer-coordinator-metrics:commit-rate:{client-id=consumer-perf-consumer-25533-1} : 0.230
consumer-coordinator-metrics:commit-total:{client-id=consumer-perf-consumer-25533-1} : 119.000
consumer-coordinator-metrics:failed-rebalance-rate-per-hour:{client-id=consumer-perf-consumer-25533-1} : 0.000
consumer-coordinator-metrics:failed-rebalance-total:{client-id=consumer-perf-consumer-25533-1} : 1.000
consumer-coordinator-metrics:heartbeat-rate:{client-id=consumer-perf-consumer-25533-1} : 0.337
consumer-coordinator-metrics:heartbeat-response-time-max:{client-id=consumer-perf-consumer-25533-1} : 6.000
consumer-coordinator-metrics:heartbeat-total:{client-id=consumer-perf-consumer-25533-1} : 197.000
consumer-coordinator-metrics:join-rate:{client-id=consumer-perf-consumer-25533-1} : 0.000
consumer-coordinator-metrics:join-time-avg:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:join-time-max:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:join-total:{client-id=consumer-perf-consumer-25533-1} : 1.000
consumer-coordinator-metrics:last-heartbeat-seconds-ago:{client-id=consumer-perf-consumer-25533-1} : 2.000
consumer-coordinator-metrics:last-rebalance-seconds-ago:{client-id=consumer-perf-consumer-25533-1} : 593.000
consumer-coordinator-metrics:partition-assigned-latency-avg:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:partition-assigned-latency-max:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:partition-lost-latency-avg:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:partition-lost-latency-max:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:partition-revoked-latency-avg:{client-id=consumer-perf-consumer-25533-1} : 0.000
consumer-coordinator-metrics:partition-revoked-latency-max:{client-id=consumer-perf-consumer-25533-1} : 0.000
consumer-coordinator-metrics:rebalance-latency-avg:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:rebalance-latency-max:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:rebalance-latency-total:{client-id=consumer-perf-consumer-25533-1} : 83.000
consumer-coordinator-metrics:rebalance-rate-per-hour:{client-id=consumer-perf-consumer-25533-1} : 0.000
consumer-coordinator-metrics:rebalance-total:{client-id=consumer-perf-consumer-25533-1} : 1.000
consumer-coordinator-metrics:sync-rate:{client-id=consumer-perf-consumer-25533-1} : 0.000
consumer-coordinator-metrics:sync-time-avg:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:sync-time-max:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:sync-total:{client-id=consumer-perf-consumer-25533-1} : 1.000
consumer-fetch-manager-metrics:bytes-consumed-rate:{client-id=consumer-perf-consumer-25533-1, topic=topic_test8_cons_test1} : 434828205.989
consumer-fetch-manager-metrics:bytes-consumed-rate:{client-id=consumer-perf-consumer-25533-1} : 434828205.989
consumer-fetch-manager-metrics:bytes-consumed-total:{client-id=consumer-perf-consumer-25533-1, topic=topic_test8_cons_test1} : 460817319851.000
consumer-fetch-manager-metrics:bytes-consumed-total:{client-id=consumer-perf-consumer-25533-1} : 460817319851.000
consumer-fetch-manager-metrics:fetch-latency-avg:{client-id=consumer-perf-consumer-25533-1} : 58.870
consumer-fetch-manager-metrics:fetch-latency-max:{client-id=consumer-perf-consumer-25533-1} : 503.000
consumer-fetch-manager-metrics:fetch-rate:{client-id=consumer-perf-consumer-25533-1} : 48.670
consumer-fetch-manager-metrics:fetch-size-avg:{client-id=consumer-perf-consumer-25533-1, topic=topic_test8_cons_test1} : 9543108.526
consumer-fetch-manager-metrics:fetch-size-avg:{client-id=consumer-perf-consumer-25533-1} : 9543108.526
consumer-fetch-manager-metrics:fetch-size-max:{client-id=consumer-perf-consumer-25533-1, topic=topic_test8_cons_test1} : 11412584.000
consumer-fetch-manager-metrics:fetch-size-max:{client-id=consumer-perf-consumer-25533-1} : 11412584.000
consumer-fetch-manager-metrics:fetch-throttle-time-avg:{client-id=consumer-perf-consumer-25533-1} : 0.000
consumer-fetch-manager-metrics:fetch-throttle-time-max:{client-id=consumer-perf-consumer-25533-1} : 0.000
consumer-fetch-manager-metrics:fetch-total:{client-id=consumer-perf-consumer-25533-1} : 44889.000
Exception in thread "main" java.util.IllegalFormatConversionException: f != java.lang.Integer
at java.base/java.util.Formatter$FormatSpecifier.failConversion(Formatter.java:4426)
at java.base/java.util.Formatter$FormatSpecifier.printFloat(Formatter.java:2951)
at java.base/java.util.Formatter$FormatSpecifier.print(Formatter.java:2898)
at java.base/java.util.Formatter.format(Formatter.java:2673)
at java.base/java.util.Formatter.format(Formatter.java:2609)
at java.base/java.lang.String.format(String.java:2897)
at scala.collection.immutable.StringLike.format(StringLike.scala:354)
at scala.collection.immutable.StringLike.format$(StringLike.scala:353)
at scala.collection.immutable.StringOps.format(StringOps.scala:33)
at kafka.utils.ToolsUtils$.$anonfun$printMetrics$3(ToolsUtils.scala:60)
at kafka.utils.ToolsUtils$.$anonfun$printMetrics$3$adapted(ToolsUtils.scala:58)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at kafka.utils.ToolsUtils$.printMetrics(ToolsUtils.scala:58)
at kafka.tools.ConsumerPerformance$.main(ConsumerPerformance.scala:82)
at kafka.tools.ConsumerPerformance.main(ConsumerPerformance.scala)
https://medium.com/metrosystemsro/apache-kafka-how-to-test-performance-for-clients-configured-with-ssl-encryption-3356d3a0d52b it's a post I've written on this.
Expectation according to this KIP is for rebalance.time.ms and fetch.time.ms to show total rebalance time for the consumer group and the total fetching time excluding the rebalance time.
As far I can tell, as of Apache Kafka version 2.6.0 this is still a work in progress and currently the output are timestamps in Unix epoch time.
fetch.MB.sec and fetch.nMsg.sec are intended to show the average quantity of messages consumed per second (in MB and as a count)
See https://kafka.apache.org/documentation/#consumer_group_monitoring for the consumer group metrics listed with --print-metrics flag
fetch-latency-avg (the average time taken for a fetch request) will vary but this depends a lot on the test setup.

print strings and variables in powershell for loop

I need to extract an alerts from rest api and sent it to a file with powershell
I was able to extract the alerts outputs looping the xml file:
foreach ($c in $temp){$c.timeOfAlertFormatted,$c.parent,$c.child,$c.category,$c.servicePlanDisplayName,$c.message}
Thu 09/19/2019 12:00:19 AM
IL
Servername
Phase Failure
Gold
One or more source luns do not have a remote target specified/mapped.
Wed 09/18/2019 02:18:25 PM
IL
Server2
Phase Failure
Gold
One or more source luns do not have a remote target specified/mapped
I am new to PS , what i want to achieve is to add descriptive string
to each filed, i.e:
Time: Thu 09/19/2019 12:00:19 AM
Country: IL
Server: servername
etc ,the rest of the fields.
i tried :
foreach ($c in $temp){Write-Host "Time : $($c.timeOfAlertFormatted)"}
Time :
Time :
Time :
Time :
Time :
Time :
Time :
Time :
Time :
Time :
Time :
Time :
Time : Thu 09/19/2019 12:00:19 AM
its printing empty "Time" fields
here is example of the xml:
It looks like you have already loaded the xml and filtered out the properties you need in a variable $temp.
I think what you want can be achieved by doing:
$temp | Select-Object #{Name = 'Time'; Expression = {$_.timeOfAlertFormatted}},
#{Name = 'Country'; Expression = {$_.parent}},
#{Name = 'ServerName'; Expression = {$_.child}},
Category,ServicePlanDisplayName, Message
The above should output something like
Time : Thu 09/19/2019 12:00:19 AM
Country : IL
ServerName : Servername
Category : Phase Failure
ServicePlanDisplayName : Gold
Message : One or more source luns do not have a remote target specified/mapped.
Time : Wed 09/18/2019 02:18:25 PM
Country : IL
ServerName : Server2
Category : General Failure
ServicePlanDisplayName : Gold
Message : One or more source luns do not have a remote target specified/mapped.
If your variable $temp is NOT what I suspect it to be, please edit your question and show us the XML aswell as the code you use to extract the alerts from it.

Handling hash tables in custom objects

Our monitoring system has a REST API that I am trying to retrieve properties from, then output them to a custom object. A couple of the property values are hash tables.
In addition to keeping the custom objects for use in Powershell, I would like to export them to CSV, so I need those hash tables to be something else, so I don't end up with Syste.Object[] in those columns.
The object returned from the API looks like this (truncated):
$allServices[0]
alertStatus : none
ignoreSSL : True
description : Test service
stopMonitoring : False
stopMonitoringByFolder : False
useDefaultLocationSetting : False
serviceProperties : {}
transition : 1
alertStatusPriority : 100000
serviceFolderId : 1
script :
disableAlerting : False
individualAlertLevel : warn
checkpoints : {#{id=1; geoInfo=Overall; smgId=0}, #{id=2; geoInfo=US - Los Angeles; smgId=1}, #{id=3; geoInfo=US - Washington DC; smgId=2}, #{id=4; geoInfo=US - San Francisco; smgId=3}...}
pageLoadAlertTimeInMS : 30000
sdtStatus : none-none-none
serviceStatus : alive
method : tabledriven
id : 1
Then checkpoints looks like this:
$allServices[0].checkpoints
id geoInfo smgId
-- ------- -----
1 Overall 0
2 US - Los Angeles 1
3 US - Washington DC 2
4 US - San Francisco 3
5 Europe - Dublin 4
6 Asia - Singapore 5
What is the best way to deal with the checkpoints property?
Thanks.
Convert checkpoints (and possibly serviceProperties) to a JSON string.
$allServicesCSV = foreach ($srv in $allServices) {
$srv = $srv.PSObject.Copy() # shallow copy
$srv.checkpoints = ConvertTo-Json $srv.checkpoints -Compress
$srv
}
Well, the requirements changed, so I only need to output to json/XML and not to CSV, making this much easier. Thanks.

Mongo staled with too much insertions

I trying to use mongodb to run a multiagent simulation.
I have one mongo instance in the same server that runs the simulation program, but when I have too much agents (~100.000 in 10 simulation steps) the mongodb becomes stalled during seconds.
The code for insert data in mongo is similar to:
if( mongo_client( &m_conn , m_dbhost.c_str(), m_dbport ) != MONGO_OK ) {
cout << "failed to connect '" << m_dbhost << ":" << m_dbport << "'\n";
cout << " mongo error: " << m_conn.err << endl;
return;
}
bson_init( &b );
bson_append_new_oid( &b, "_id" ) != BSON_OK );
bson_append_double( &b, "time", time );
bson_append_double( &b, "x", posx );
bson_append_double( &b, "y", posy );
bson_finish( &b );
if( mongo_insert( &m_conn , ns.c_str() , &b, NULL ) != MONGO_OK ){
cout << "failed to insert in mongo\n";
}
bson_destroy( &b );
mongo_disconnect( &m_conn );
Also, during the simulation, If I try to access using the mongo-shell, I also get errors:
$ mongo
MongoDB shell version: 2.4.1
connecting to: test
Wed Apr 3 10:10:24.870 JavaScript execution failed: Error: couldn't connect to server 127.0.0.1:27017 at src/mongo/shell/mongo.js:L112
exception: connect failed
After the simulation is ended, the mongo shell gets responsive again, and I can check that there is data in the database but it is discontinued. In the example, the agent m0n999 saved only 6 of 10 steps:
> show dbs
dB0B7F527F0FA45518712C8CB27611BD7 5.951171875GB
local 0.078125GB
> db.ins.m0n999.find()
{ "_id" : ObjectId("515bdf564c60ec1e000003e7"), "time" : 1, "x" : 1.1, "y" : 8.1 }
{ "_id" : ObjectId("515be0214c60ec1e0001075f"), "time" : 2, "x" : 1.2000000000000002, "y" : 8.2 }
{ "_id" : ObjectId("515be1c04c60ec1e0002da3a"), "time" : 4, "x" : 1.4000000000000004, "y" : 8.399999999999999 }
{ "_id" : ObjectId("515be2934c60ec1e0003b82c"), "time" : 5, "x" : 1.5000000000000004, "y" : 8.499999999999998 }
{ "_id" : ObjectId("515be3664c60ec1e000497cf"), "time" : 6, "x" : 1.6000000000000005, "y" : 8.599999999999998 }
{ "_id" : ObjectId("515be6cc4c60ec1e000824b2"), "time" : 10, "x" : 2.000000000000001, "y" : 8.999999999999996 }
>
How can solve this problem? How can avoid the lost of connections and recover from mongo stalls?
UPDATE
I'm getting in the global log errors like:
"Wed Apr 3 11:53:00.379 [conn1378573] error: hashtable namespace index max chain reached:1335",
"Wed Apr 3 11:53:00.379 [conn1378573] error: hashtable namespace index max chain reached:1335",
"Wed Apr 3 11:53:00.379 [conn1378573] error: hashtable namespace index max chain reached:1335",
"Wed Apr 3 11:53:00.379 [conn1378573] error: hashtable namespace index max chain reached:1335",
"Wed Apr 3 11:53:00.379 [conn1378573] end connection 127.0.0.1:40748 (1 connection now open)",
I solved the problem, it has two errors:
I was creating too much collections. I changed from one collection per Agent, to only on collection per simulation proccess.
I was creating too much connections. I changed from one connection per Agent iteration, to only one connection per simulation step.