apache kafka NoReplicaOnlineException - apache-kafka

Using Apache Kafka with a single node (1 Zookeeper, 1 Broker) I get this exception (repeated multiple times):
kafka.common.NoReplicaOnlineException: No replica in ISR for partition __consumer_offsets-2 is alive. Live brokers are: [Set()], ISR brokers are: [0]
What does it mean? Note, I am starting the KafkaServer programmatically, and I am able to send and consume from a topic using the CLI tools.
It seems I should tell this node that it is operation in standalone mode - how should I do this?
This seems to happen during startup.
Full exception:
17-11-07 19:43:44 NP-3255AJ193091.home ERROR [state.change.logger:107]
- [Controller id=0 epoch=54] Initiated state change for partition __consumer_offsets-16 from OfflinePartition to OnlinePartition failed
kafka.utils.ShutdownableThread.run ShutdownableThread.scala:
64
kafka.controller.ControllerEventManager$ControllerEventThread.doWork
ControllerEventManager.scala: 52
kafka.metrics.KafkaTimer.time KafkaTimer.scala: 31
kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply
ControllerEventManager.scala: 53 (repeats 2 times)
kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp
ControllerEventManager.scala: 53
kafka.controller.KafkaController$Startup$.process
KafkaController.scala: 1581
kafka.controller.KafkaController.elect KafkaController.scala:
1681
kafka.controller.KafkaController.onControllerFailover
KafkaController.scala: 298
kafka.controller.PartitionStateMachine.startup
PartitionStateMachine.scala: 58
kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange
PartitionStateMachine.scala: 81
scala.collection.TraversableLike$WithFilter.foreach
TraversableLike.scala: 732
scala.collection.mutable.HashMap.foreach
HashMap.scala: 130
scala.collection.mutable.HashMap.foreachEntry
HashMap.scala: 40
scala.collection.mutable.HashTable$class.foreachEntry
HashTable.scala: 236
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply
HashMap.scala: 130 (repeats 2 times)
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply
TraversableLike.scala: 733
kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply
PartitionStateMachine.scala: 81
kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply
PartitionStateMachine.scala: 84
kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange
PartitionStateMachine.scala: 163
kafka.controller.PartitionStateMachine.electLeaderForPartition
PartitionStateMachine.scala: 303
kafka.controller.OfflinePartitionLeaderSelector.selectLeader
PartitionLeaderSelector.scala: 65
kafka.common.NoReplicaOnlineException: No replica in ISR for partition
__consumer_offsets-16 is alive. Live brokers are: [Set()], ISR brokers are: [0]

Related

Flink task manager distributes task uneven

I'm using kafka streaming to handle calc works via kafka request message. Howver, I just have discovered that Flink task managers may have distributed kafka messages to worker in batches, therefore, some workers may have got more batch of works than others. Therefore, the entire job runs longer than expected as works not distributed equally to all works.
Do we know how can we change the batch size to a smaller number so that the flink task managers could do a better distribution?
There is the number:
Thread
Tasks
elapse (s)
thread 101
463
2217
thread 103
464
2757
thread 94
232
1493
thread 95
232
1376
thread 96
232
1277
thread 97
463
2098
thread 98
232
1008
thread 99
232
1252
We can see from the above table, some threads got more works than others, and took more time to complete all the tasks.
Is it possible to change the batch size? I couldn't find any params relavent to batch messages in flink.
Thanks!
How could we adjust the batch size used by the FlinkKafkaConsumer010 in flink?

Separate Apache Kafka clusters unreachable at the same time - kafka_network_socketserver_networkprocessoravgidlepercent goes to zero

We have 4 Kafka clusters:
ENV1: 6 brokers and 3 zookeepers
ENV2: 6 brokers and 3 zookeepers
ENV3: 8 brokers (on 2 DCs, 4-4 brokers) and 9 zookeepers (on 3 DCs, 3-3-3 nodes)
ENV4: 16 brokers (on 2 DCs, 8-8 brokers) and 9 zookeepers (on 3 DCs, 3-3-3 nodes)
All of the Kafka brokers are on version 2.7.0, and all of the ZK nodes are on version 3.4.13. Every Kafka brokers and ZK nodes are VMs. All the four environments are running in a separate subnet. Swap is turned off everywhere. All the clusters are Kerberized and are using a separate high available AD for it, which contains 7 kerberos servers.
VM parameters:
ENV1:
Kafka brokers:
16 GB RAM,
8 vCPU,
1120 GB Hard Disk,
RHEL 7.9
ZK nodes:
4 GB RAM,
2 vCPU,
105 GB Hard Disk,
RHEL 8.5
ENV2:
Kafka brokers:
16 GB RAM,
8 vCPU,
1120 GB Hard Disk,
RHEL 7.9
ZK nodes:
4 GB RAM,
2 vCPU,
105 GB Hard Disk,
RHEL 8.5
ENV3:
Kafka brokers:
24 GB RAM,
8 vCPU,
2120 GB Hard Disk,
RHEL 7.9
ZK nodes:
8 GB RAM,
2 vCPU,
200 GB Hard Disk,
RHEL 8.5
ENV4:
Kafka brokers:
24 GB RAM,
8 vCPU,
7145 GB Hard Disk,
RHEL 7.9
ZK nodes:
8 GB RAM,
2 vCPU,
200 GB Hard Disk,
RHEL 7.9
We have the following issue on every environments at the same time, 3-4 times a day for a few seconds: our kafka_network_socketserver_networkprocessoravgidlepercent metrics goes down to zero on every brokers, and our cluster becames unreachable, even our brokers cannot communicate with each other when this happens. Here is a picture of it from our Grafana dashboard:
We can see the following ERRORs in the server log, but we suppose all of them are just consequences:
ERROR Error while processing notification change for path = /kafka-acl-changes (kafka.common.ZkNodeChangeNotificationListener)
kafka.zookeeper.ZooKeeperClientExpiredException: Session expired either before or while waiting for connection
at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$3(ZooKeeperClient.scala:270)
at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:258)
at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$1(ZooKeeperClient.scala:252)
at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:252)
at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1730)
at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1700)
at kafka.zk.KafkaZkClient.retryRequestUntilConnected(KafkaZkClient.scala:1695)
at kafka.zk.KafkaZkClient.getChildren(KafkaZkClient.scala:719)
at kafka.common.ZkNodeChangeNotificationListener.kafka$common$ZkNodeChangeNotificationListener$$processNotifications(ZkNodeChangeNotificationListener.scala:83)
at kafka.common.ZkNodeChangeNotificationListener$ChangeNotification.process(ZkNodeChangeNotificationListener.scala:120)
at kafka.common.ZkNodeChangeNotificationListener$ChangeEventProcessThread.doWork(ZkNodeChangeNotificationListener.scala:146)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
ERROR [ReplicaManager broker=1] Error processing append operation on partition topic-4 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.OutOfOrderSequenceException: Out of order sequence number for producerId 461002 at offset 5761036 in partition topic-4: 1022 (incoming seq. number), 1014 (current end sequence number)
ERROR [KafkaApi-5] Number of alive brokers '0' does not meet the required replication factor '3' for the offsets topic (configured via 'offsets.topic.replication.factor'). This error can be ignored if the cluster is starting up and not all brokers are up yet. (kafka.server.KafkaApis)
ERROR [KafkaApi-11] Error when handling request: clientId=broker-12-fetcher-0, correlationId=8972304, api=FETCH, version=12, body={replica_id=12,max_wait_ms=500,min_bytes=1,max_bytes=10485760,isolation_level=0,session_id=683174603,session_epoch=47,topics=[{topic=__consumer_offsets,partitions=[{partition=36,current_leader_epoch=294,fetch_offset=1330675527,last_fetched_epoch=-1,log_start_offset=0,partition_max_bytes=1048576,_tagged_fields={}},{partition=25,current_leader_epoch=288,fetch_offset=3931235489,last_fetched_epoch=-1,log_start_offset=0,partition_max_bytes=1048576,_tagged_fields={}}],_tagged_fields={}}],forgotten_topics_data=[],rack_id=,_tagged_fields={}} (kafka.server.KafkaApis)
org.apache.kafka.common.errors.NotLeaderOrFollowerException: Leader not local for partition topic-42 on broker 11
ERROR [GroupCoordinator 5]: Group consumer-group_id could not complete rebalance because no members rejoined (kafka.coordinator.group.GroupCoordinator)
ERROR [Log partition=topic-1, dir=/kafka/logs] Could not find offset index file corresponding to log file /kafka/logs/topic-1/00000000000001842743.log, recovering segment and rebuilding index files... (kafka.log.Log)
[ReplicaFetcher replicaId=16, leaderId=10, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=16, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1439560581, epoch=1138570), rackId=) (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 10 was disconnected before the response was read at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100) at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:110) at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:211) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:301) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:136) at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:135) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:118) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
I will update the issue if needed, eg. with relevant Kafka configs.
We know that it might be some issue with our infrastructure, but can not see problem with the network, neither on the Kerberos side, so thats why I'm asking for help from you guys. Do you have any idea what may cause this issue? Every idea may be helpful, because we run out of it.
Thanks in advance!

Multiple Kafka producers with different `ack`

I'm wondering about some scenario. Let's assume we have Kafka cluster with 3 nodes (97, 98, 99), topic A with replication.factor=3, min.insync.replicas=2 and unclean.leader.election=false. We started two producers: one with ack=all and second with ack=1. Both are continously sending messages to topic A. What happens when:
t[0] broker 97 is partition leader
t[1] producer ack=all sends message to leader, message gets offset 12
t[2] message 12 is replicated to broker 98 but it's not replicated to broker 99
t[3] producer receives acknowledgement
t[4] brokers 97 and 98 goes down
t[5] producer ack=1 sends message
Is it possible that broker 99 will be elected as leader and in timestamp 6, producer receives acknowledgement?
If no - how Kafka knows that broker 99 cannot be elected as leader? Is it connected somehow with High Watermark?
If yes - what offset gets message sent in timetsamp 5? and what will happen when brokers 97 and 98 goes up afterwards?

Apache storm kafka spout only reading from half of a topic's partitions

A problem developed on our production Storm cluster that we cannot figure out or work around.
At some point it appears that the kafka spout stopped reading from half of the topic partitions. There are 40 partitions, and it is only reading from 20 of them. We cannot find any changes that we made to either the storm cluster or kafka at the time this started happening.
We changed the consumer group id and set the spout config startOffsetTime to OffsetRequest.LatestTime to try to get it reading fresh data from all partitions. It still only connects to the same 20 partitions. We've looked at the node /<topic>/<consumer_group> inside the Storm zookeeper and see only 20 partitions there.
We have verified that messages are being published to all 40 partitions.
Kafka version is 0.9.0.1,storm version is 1.1.0.
Any tips on how to debug or where to look would be greatly appreciated. Did I mention that this is happening in production? Did I mention it started a week ago, and we just noticed this morning? :(
Additional info: we found some errors in the Kafka state change log (partition 9 is one of the affected partitions and the timestamp in the log looks to be about the time that the problem started)
kafka.common.NoReplicaOnlineException: No replica for partition
[transcription-results,9] is alive. Live brokers are: [Set()], Assigned replicas are: [List(1, 4, 0)]
[2018-03-14 03:11:40,863] TRACE Controller 0 epoch 44 changed state of replica 1 for partition [transcription-results,9] from OnlineReplica to OfflineReplica (state.change.logger)
[2018-03-14 03:11:41,141] TRACE Controller 0 epoch 44 sending become-follower LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to broker 4 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,145] TRACE Controller 0 epoch 44 sending become-follower LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to broker 0 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,208] TRACE Controller 0 epoch 44 changed state of replica 4 for partition [transcription-results,9] from OnlineReplica to OnlineReplica (state.change.logger)
[2018-03-14 03:11:41,218] TRACE Controller 0 epoch 44 changed state of replica 1 for partition [transcription-results,9] from OfflineReplica to OnlineReplica (state.change.logger)
[2018-03-14 03:11:41,226] TRACE Controller 0 epoch 44 sending become-follower LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to broker 4 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,230] TRACE Controller 0 epoch 44 sending become-follower LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to broker 1 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,450] TRACE Broker 0 received LeaderAndIsr request (LeaderAndIsrInfo:Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44),ReplicationFactor:3),AllReplicas:1,4,0) correlation id 158 from controller 0 epoch 44 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,454] TRACE Broker 0 handling LeaderAndIsr request correlationId 158 from controller 0 epoch 44 starting the become-follower transition for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,455] ERROR Broker 0 received LeaderAndIsrRequest with correlation id 158 from controller 0 epoch 44 for partition [transcription-results,9] but cannot become follower since the new leader -1 is unavailable. (state.change.logger)
//... removed some TRACE statements here
[2018-03-14 03:11:41,908] WARN Broker 0 ignoring LeaderAndIsr request from controller 1 with correlation id 1 epoch 47 for partition [transcription-results,9] since its associated leader epoch 441 is old. Current leader epoch is 441 (state.change.logger)
[2018-03-14 03:11:41,982] TRACE Broker 0 cached leader info (LeaderAndIsrInfo:Leader:1,ISR:0,1,4,LeaderEpoch:441,ControllerEpoch:44),ReplicationFactor:3),AllReplicas:1,4,0) for partition [transcription-results,9] in response to UpdateMetadata request sent by controller 1 epoch 47 with correlation id 2 (state.change.logger)
[2018-03-22 14:43:36,098] TRACE Broker 0 received LeaderAndIsr request (LeaderAndIsrInfo:Leader:-1,ISR:,LeaderEpoch:444,ControllerEpoch:47),ReplicationFactor:3),AllReplicas:1,4,0) correlation id 679 from controller 1 epoch 47 for partition [transcription-results,9] (state.change.logger)
Possibly caused by this bug: https://issues.apache.org/jira/browse/KAFKA-3963
But how can we recover from it?
I'd start by looking in Kafka's Zookeeper under /brokers/topics to verify that all partitions are listed. That's where storm-kafka reads the partitions from.

Mesos master not elected as leader

I am deploying Mesos/Marathon/zookeeper on a cluster of one physical machine : is this a viable configuration ?
(I have two machines, but begin by the first one)
When launching Mesos I get the following result on the page : "No master is currently leading, This master is not a leader ..."
I put the quorum at 1, is it a possible value ? I tried 0 but mesos doesn't even start.
EDIT
cat mesos-master.INFO
Log file created at: 2015/10/19 15:58:15
Running on machine: 192.168.0.38
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1019 15:58:15.592771 8 logging.cpp:172] INFO level logging started!
I1019 15:58:15.593093 8 main.cpp:229] Build: 2015-10-12 20:57:28 by root
I1019 15:58:15.593111 8 main.cpp:231] Version: 0.25.0
I1019 15:58:15.593123 8 main.cpp:234] Git tag: 0.25.0
I1019 15:58:15.593135 8 main.cpp:238] Git SHA: 2dd7f7ee115fe00b8e098b0a10762a4fa8f4600f
I1019 15:58:15.593276 8 main.cpp:252] Using 'HierarchicalDRF' allocator
I1019 15:58:15.660604 8 leveldb.cpp:176] Opened db in 67.183194ms
I1019 15:58:15.678915 8 leveldb.cpp:183] Compacted db in 18.242065ms
I1019 15:58:15.678962 8 leveldb.cpp:198] Created db iterator in 14924ns
I1019 15:58:15.678982 8 leveldb.cpp:204] Seeked to beginning of db in 1323ns
I1019 15:58:15.678998 8 leveldb.cpp:273] Iterated through 0 keys in the db in 2556ns
I1019 15:58:15.679056 8 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
I1019 15:58:15.680054 30 log.cpp:238] Attempting to join replica to ZooKeeper group
I1019 15:58:15.680531 36 recover.cpp:449] Starting replica recovery
I1019 15:58:15.680735 36 recover.cpp:475] Replica is in EMPTY status
I1019 15:58:15.683269 8 main.cpp:465] Starting Mesos master
I1019 15:58:15.684293 35 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
I1019 15:58:15.684648 37 recover.cpp:195] Received a recover response from a replica in EMPTY status
I1019 15:58:15.685711 31 recover.cpp:566] Updating replica status to STARTING
I1019 15:58:15.688724 31 master.cpp:376] Master 74ee40e5-16b6-4a40-8288-4f563806b5cb (192.168.0.38) started on 192.168.0.38:5050
I1019 15:58:15.688765 31 master.cpp:378] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --cluster="" --framework_sorter="drf" --help="false" --hostname="192.168.0.38" --hostname_lookup="true" --initialize_driver_logging="true" --ip="192.168.0.38" --log_auto_initialize="true" --log_dir="/etc/mesos/logs" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="1" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://192.168.0.38:2181/mesos" --zk_session_timeout="10secs"
I1019 15:58:15.689028 31 master.cpp:425] Master allowing unauthenticated frameworks to register
I1019 15:58:15.689049 31 master.cpp:430] Master allowing unauthenticated slaves to register
I1019 15:58:15.689071 31 master.cpp:467] Using default 'crammd5' authenticator
W1019 15:58:15.689100 31 authenticator.cpp:505] No credentials provided, authentication requests will be refused
I1019 15:58:15.689422 31 authenticator.cpp:512] Initializing server SASL
I1019 15:58:15.695821 31 contender.cpp:149] Joining the ZK group
I1019 15:58:15.699548 34 group.cpp:331] Group process (group(2)#192.168.0.38:5050) connected to ZooKeeper
I1019 15:58:15.706737 30 master.cpp:1542] Successfully attached file '/etc/mesos/logs/mesos-master.INFO'
I1019 15:58:15.706755 34 group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0)
I1019 15:58:15.706826 34 group.cpp:403] Trying to create path '/mesos/log_replicas' in ZooKeeper
I1019 15:58:15.710538 35 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 24.699241ms
I1019 15:58:15.710598 35 replica.cpp:323] Persisted replica status to STARTING
I1019 15:58:15.710695 37 recover.cpp:475] Replica is in STARTING status
I1019 15:58:15.710979 32 replica.cpp:641] Replica in STARTING status received a broadcasted recover request
I1019 15:58:15.711148 31 recover.cpp:195] Received a recover response from a replica in STARTING status
I1019 15:58:15.711293 37 recover.cpp:566] Updating replica status to VOTING
I1019 15:58:15.723206 37 group.cpp:331] Group process (group(1)#192.168.0.38:5050) connected to ZooKeeper
I1019 15:58:15.730281 37 group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I1019 15:58:15.730325 37 group.cpp:403] Trying to create path '/mesos/log_replicas' in ZooKeeper
I1019 15:58:15.731256 33 group.cpp:331] Group process (group(4)#192.168.0.38:5050) connected to ZooKeeper
I1019 15:58:15.731343 33 group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I1019 15:58:15.731359 33 group.cpp:403] Trying to create path '/mesos' in ZooKeeper
I1019 15:58:15.731947 30 group.cpp:331] Group process (group(3)#192.168.0.38:5050) connected to ZooKeeper
I1019 15:58:15.733675 30 group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0)
I1019 15:58:15.734716 30 group.cpp:403] Trying to create path '/mesos' in ZooKeeper
I1019 15:58:15.734612 31 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 23.245997ms
I1019 15:58:15.734902 31 replica.cpp:323] Persisted replica status to VOTING
I1019 15:58:15.734987 32 recover.cpp:580] Successfully joined the Paxos group
I1019 15:58:15.735080 32 recover.cpp:464] Recover process terminated
I1019 15:58:15.745573 33 network.hpp:415] ZooKeeper group memberships changed
I1019 15:58:15.745687 32 group.cpp:674] Trying to get '/mesos/log_replicas/0000000030' in ZooKeeper
I1019 15:58:15.749068 35 network.hpp:463] ZooKeeper group PIDs: { log-replica(1)#192.168.0.38:5050 }
I1019 15:58:15.750468 36 contender.cpp:265] New candidate (id='30') has entered the contest for leadership
I1019 15:58:15.751054 33 detector.cpp:156] Detected a new leader: (id='30')
I1019 15:58:15.751231 34 group.cpp:674] Trying to get '/mesos/json.info_0000000030' in ZooKeeper
I1019 15:58:15.751909 33 detector.cpp:481] A new leading master (UPID=master#192.168.0.38:5050) is detected
I1019 15:58:15.752105 34 master.cpp:1603] The newly elected leader is master#192.168.0.38:5050 with id 74ee40e5-16b6-4a40-8288-4f563806b5cb
I1019 15:58:15.752182 34 master.cpp:1616] Elected as the leading master!
I1019 15:58:15.752208 34 master.cpp:1376] Recovering from registrar
I1019 15:58:15.752290 33 registrar.cpp:309] Recovering registrar
I1019 15:58:15.752581 35 log.cpp:661] Attempting to start the writer
I1019 15:58:15.753067 34 replica.cpp:477] Replica received implicit promise request with proposal 1
I1019 15:58:15.774773 34 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 21.627305ms
I1019 15:58:15.774910 34 replica.cpp:345] Persisted promised to 1
I1019 15:58:15.775125 34 coordinator.cpp:231] Coordinator attemping to fill missing position
I1019 15:58:15.775501 33 replica.cpp:378] Replica received explicit promise request for position 0 with proposal 2
I1019 15:58:15.790747 33 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 15.185008ms
I1019 15:58:15.790833 33 replica.cpp:679] Persisted action at 0
I1019 15:58:15.791260 33 replica.cpp:511] Replica received write request for position 0
I1019 15:58:15.791342 33 leveldb.cpp:438] Reading position from leveldb took 23444ns
I1019 15:58:15.803988 33 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 12.606014ms
I1019 15:58:15.804051 33 replica.cpp:679] Persisted action at 0
I1019 15:58:15.804256 32 replica.cpp:658] Replica received learned notice for position 0
I1019 15:58:15.815990 32 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 11.675285ms
I1019 15:58:15.816064 32 replica.cpp:679] Persisted action at 0
I1019 15:58:15.816087 32 replica.cpp:664] Replica learned NOP action at position 0
I1019 15:58:15.816222 34 log.cpp:677] Writer started with ending position 0
I1019 15:58:15.816537 32 leveldb.cpp:438] Reading position from leveldb took 9246ns
I1019 15:58:15.817867 36 registrar.cpp:342] Successfully fetched the registry (0B) in 65.515008ms
I1019 15:58:15.817951 36 registrar.cpp:441] Applied 1 operations in 11601ns; attempting to update the 'registry'
I1019 15:58:15.819144 30 log.cpp:685] Attempting to append 173 bytes to the log
I1019 15:58:15.819236 32 coordinator.cpp:341] Coordinator attempting to write APPEND action at position 1
I1019 15:58:15.819448 30 replica.cpp:511] Replica received write request for position 1
I1019 15:58:15.832018 30 leveldb.cpp:343] Persisting action (192 bytes) to leveldb took 12.520293ms
I1019 15:58:15.832092 30 replica.cpp:679] Persisted action at 1
I1019 15:58:15.832268 35 replica.cpp:658] Replica received learned notice for position 1
I1019 15:58:15.844065 35 leveldb.cpp:343] Persisting action (194 bytes) to leveldb took 11.769077ms
I1019 15:58:15.844130 35 replica.cpp:679] Persisted action at 1
I1019 15:58:15.844175 35 replica.cpp:664] Replica learned APPEND action at position 1
I1019 15:58:15.844462 31 registrar.cpp:486] Successfully updated the 'registry' in 26.468864ms
I1019 15:58:15.844506 30 log.cpp:704] Attempting to truncate the log to 1
I1019 15:58:15.844571 31 registrar.cpp:372] Successfully recovered registrar
I1019 15:58:15.844599 30 coordinator.cpp:341] Coordinator attempting to write TRUNCATE action at position 2
I1019 15:58:15.844714 31 master.cpp:1413] Recovered 0 slaves from the Registry (134B) ; allowing 10mins for slaves to re-register
I1019 15:58:15.844790 37 replica.cpp:511] Replica received write request for position 2
I1019 15:58:15.858352 37 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 13.469108ms
I1019 15:58:15.858502 37 replica.cpp:679] Persisted action at 2
I1019 15:58:15.858723 37 replica.cpp:658] Replica received learned notice for position 2
I1019 15:58:15.874608 37 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 15.810315ms
I1019 15:58:15.874725 37 leveldb.cpp:401] Deleting ~1 keys from leveldb took 27159ns
I1019 15:58:15.874747 37 replica.cpp:679] Persisted action at 2
I1019 15:58:15.874764 37 replica.cpp:664] Replica learned TRUNCATE action at position 2
I1019 15:58:19.851761 37 master.cpp:2179] Received SUBSCRIBE call for framework 'marathon' at scheduler-45acfb6e-2a61-46e8-bef3-7bc5d0e76567#192.168.0.38:33773
I1019 15:58:19.851974 37 master.cpp:2250] Subscribing framework marathon with checkpointing enabled and capabilities [ ]
I1019 15:58:19.852358 30 hierarchical.hpp:515] Added framework a927b696-0597-4762-9969-f1fe2a5d7a2e-0000
I1019 15:58:34.856995 34 master.cpp:4640] Performing implicit task state reconciliation for framework a927b696-0597-4762-9969-f1fe2a5d7a2e-0000 (marathon) at scheduler-45acfb6e-2a61-46e8-bef3-7bc5d0e76567#192.168.0.38:33773
w3m http://192.168.0.38:5050
[loading]
Toggle navigation Mesos
• Frameworks
• Slaves
• Offers
• {{state.cluster}}
No master is currently leading ...
× This master is not the leader, redirecting in {{redirect / 1000}} seconds ... go now
{{ alert.title }}
{{ alert.message }}
• {{ bullet }}