Bluemix hyperledger block height/blocks out of sync - ibm-cloud

I have a Blockchain implementation on Bluemix with 4 peers & I've been deploying new chaincode to it. However, most recently, peer 3 took a long time to deploy. Eventually, I thought stopping & restarting peer 3 would help. It didn't.
So while I've been deploying & invoking various chaincode, peer 3 is stale. Looks like new chaincode is only being run by 3 out of 4 peers.
I see errors in the sample logs below. How do I get peer 3 back in sync with the rest of the peers?
OUT - 18:34:30.273 [consensus/pbft] execDoneSync -> INFO 06b[0m Replica 3 finished execution 28, trying next
OUT - 18:48:07.588 [consensus/pbft] executeOne -> INFO 06c[0m Replica 3 executing/committing request batch for view=0/seqNo=29 and digest 5trDGesTKJPWIWy/RKjTq5vY2tIQZ/L/a7C7LvYurk/H2zYorDAN7zsTnbqq2kcR1HcqPcnpXK1Gqu8q1ItgFA==
OUT - 2017/02/20 18:54:10 transport: http2Client.notifyError got notified that the client transport was broken EOF.
OUT - [31m18:54:10.162 [peer] handleChat -> ERRO 06d[0m Error during Chat, stopping handler: stream error: code = 1 desc = "context canceled"
OUT - [31m18:54:10.162 [peer] handleChat -> ERRO 06e[0m Error during Chat, stopping handler: rpc error: code = 13 desc = transport is closing
OUT - [31m18:54:10.162 [peer] chatWithPeer -> ERRO 06f[0m Ending Chat with peer address 5cc24f88bbcc414a96962ea1c37c3aea-vp2.us.blockchain.ibm.com:30001 due to error: Error during Chat, stopping handler: rpc error: code = 13 desc = transport is closing
OUT - 2017/02/20 18:54:11 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 172.16.6.8:30001: getsockopt: connection refused"; Reconnecting to {"5cc24f88bbcc414a96962ea1c37c3aea-vp2.us.blockchain.ibm.com:30001" <nil>}
OUT - [31m18:54:11.668 [peer] handleChat -> ERRO 070[0m Error handling message: Peer FSM failed while handling message (DISC_HELLO): current state: created, error: transition canceled with error: Error registering Handler: Duplicate Handler error: {name:"vp2" 5cc24f88bbcc414a96962ea1c37c3aea-vp2.us.blockchain.ibm.com:30001 VALIDATOR `�ބ��M�U�d,��������9(ˑ(����}
OUT - [35m18:54:11.806 [consensus/pbft] recvCheckpoint -> CRIT 071[0m Network unable to find stable certificate for seqNo 30 (3 different values observed already)
OUT - panic: Network unable to find stable certificate for seqNo 30 (3 different values observed already)
OUT -
OUT - goroutine 71 [running]:
OUT - panic(0xc137a0, 0xc82032f9e0)
OUT - /opt/go/src/runtime/panic.go:464 +0x3e6
OUT - github.com/hyperledger/fabric/vendor/github.com/op/go-logging.(*Logger).Panicf(0xc8201ae4e0, 0x103cd40, 0x5d, 0xc8206863e0, 0x2, 0x2)
OUT - /opt/gopath/src/github.com/hyperledger/fabric/vendor/github.com/op/go-logging/logger.go:194 +0x11e
OUT - github.com/hyperledger/fabric/consensus/pbft.(*pbftCore).recvCheckpoint(0xc820069d40, 0xc8206863a0, 0x0, 0x0)
OUT - /opt/gopath/src/github.com/hyperledger/fabric/consensus/pbft/pbft-core.go:1185 +0xcc7
OUT - github.com/hyperledger/fabric/consensus/pbft.(*pbftCore).ProcessEvent(0xc820069d40, 0xdf2b40, 0xc8206863a0, 0x0, 0x0)
OUT - /opt/gopath/src/github.com/hyperledger/fabric/consensus/pbft/pbft-core.go:349 +0x571
OUT - github.com/hyperledger/fabric/consensus/pbft.(*obcBatch).ProcessEvent(0xc820220600, 0xdf2b40, 0xc8206863a0, 0x0, 0x0)
OUT - /opt/gopath/src/github.com/hyperledger/fabric/consensus/pbft/batch.go:429 +0x6b4
OUT - github.com/hyperledger/fabric/consensus/util/events.SendEvent(0x7f0e948fdbe0, 0xc820220600, 0xda32e0, 0xc82032f760)
OUT - /opt/gopath/src/github.com/hyperledger/fabric/consensus/util/events/events.go:113 +0x45
OUT - github.com/hyperledger/fabric/consensus/util/events.(*managerImpl).Inject(0xc820331920, 0xda32e0, 0xc82032f760)
OUT - /opt/gopath/src/github.com/hyperledger/fabric/consensus/util/events/events.go:123 +0x4f
OUT - github.com/hyperledger/fabric/consensus/util/events.(*managerImpl).eventLoop(0xc820331920)
OUT - /opt/gopath/src/github.com/hyperledger/fabric/consensus/util/events/events.go:132 +0xdb
OUT - created by github.com/hyperledger/fabric/consensus/util/events.(*managerImpl).Start
OUT - /opt/gopath/src/github.com/hyperledger/fabric/consensus/util/events/events.go:100 +0x35
OUT - 2017-02-20 18:54:11,817 INFO exited: start_peer (exit status 2; expected)
OUT - 2017-02-20 18:54:12,819 INFO spawned: 'start_peer' with pid 37
OUT - 18:54:12.869 [nodeCmd] serve -> INFO 001[0m Security enabled status: true
OUT - 18:54:12.869 [nodeCmd] serve -> INFO 002[0m Privacy enabled status: false
OUT - 18:54:12.869 [eventhub_producer] start -> INFO 003[0m event processor started
OUT - 18:54:12.869 [db] open -> INFO 004[0m Setting rocksdb maxLogFileSize to 10485760
OUT - 18:54:12.869 [db] open -> INFO 005[0m Setting rocksdb keepLogFileNum to 10
OUT - 18:54:12.960 [crypto] RegisterValidator -> INFO 006[0m Registering validator [peer3] with name [peer3]...
OUT - 18:54:12.961 [crypto] RegisterValidator -> INFO 007[0m Registering validator [peer3] with name [peer3]...done!
OUT - 18:54:12.962 [crypto] InitValidator -> INFO 008[0m Initializing validator [peer3]...
OUT - 18:54:12.964 [crypto] InitValidator -> INFO 009[0m Initializing validator [peer3]...done!
OUT - 18:54:12.965 [chaincode] NewChaincodeSupport -> INFO 00a[0m Chaincode support using peerAddress: 5cc24f88bbcc414a96962ea1c37c3aea-vp3.us.blockchain.ibm.com:30001
OUT - [33m18:54:12.965 [sysccapi] RegisterSysCC -> WARN 00b[0m Currently system chaincode does support security(noop,github.com/hyperledger/fabric/bddtests/syschaincode/noop)
OUT - 18:54:12.965 [state] loadConfig -> INFO 00c[0m Loading configurations...
OUT - 18:54:12.965 [state] loadConfig -> INFO 00d[0m Configurations loaded. stateImplName=[buckettree], stateImplConfigs=map[maxGroupingAtEachLevel:%!s(int=5) bucketCacheSize:%!s(int=100) numBuckets:%!s(int=1000003)], deltaHistorySize=[500]
OUT - 18:54:12.965 [state] NewState -> INFO 00e[0m Initializing state implementation [buckettree]
OUT - 18:54:12.965 [buckettree] initConfig -> INFO 00f[0m configs passed during initialization = map[string]interface {}{"numBuckets":1000003, "maxGroupingAtEachLevel":5, "bucketCacheSize":100}
OUT - 18:54:12.965 [buckettree] initConfig -> INFO 010[0m Initializing bucket tree state implemetation with configurations &{maxGroupingAtEachLevel:5 lowestLevel:9 levelToNumBucketsMap:map[6:8001 0:1 9:1000003 3:65 2:13 8:200001 7:40001 4:321 1:3 5:1601] hashFunc:0xab4dc0}
OUT - 18:54:12.966 [buckettree] newBucketCache -> INFO 011[0m Constructing bucket-cache with max bucket cache size = [100] MBs
OUT - 18:54:12.966 [buckettree] loadAllBucketNodesFromDB -> INFO 012[0m Loaded buckets data in cache. Total buckets in DB = [72]. Total cache size:=10240
OUT - 18:54:12.967 [consensus/controller] NewConsenter -> INFO 013[0m Creating consensus plugin pbft
OUT - 18:54:12.967 [consensus/pbft] newPbftCore -> INFO 014[0m PBFT type = *pbft.obcBatch
OUT - 18:54:12.967 [consensus/pbft] newPbftCore -> INFO 015[0m PBFT Max number of validating peers (N) = 4
OUT - 18:54:12.967 [consensus/pbft] newPbftCore -> INFO 016[0m PBFT Max number of failing peers (f) = 1
OUT - 18:54:12.967 [consensus/pbft] newPbftCore -> INFO 017[0m PBFT byzantine flag = false
OUT - 18:54:12.967 [consensus/pbft] newPbftCore -> INFO 018[0m PBFT request timeout = 30s
OUT - 18:54:12.967 [consensus/pbft] newPbftCore -> INFO 019[0m PBFT view change timeout = 30s
OUT - 18:54:12.967 [consensus/pbft] newPbftCore -> INFO 01a[0m PBFT Checkpoint period (K) = 10
OUT - 18:54:12.967 [consensus/pbft] newPbftCore -> INFO 01b[0m PBFT broadcast timeout = 1s
OUT - 18:54:12.967 [consensus/pbft] newPbftCore -> INFO 01c[0m PBFT Log multiplier = 4
OUT - 18:54:12.967 [consensus/pbft] newPbftCore -> INFO 01d[0m PBFT log size (L) = 40
OUT - 18:54:12.967 [consensus/pbft] newPbftCore -> INFO 01e[0m PBFT null requests disabled
OUT - 18:54:12.967 [consensus/pbft] newPbftCore -> INFO 01f[0m PBFT automatic view change disabled
OUT - 18:54:13.088 [consensus/pbft] restoreLastSeqNo -> INFO 020[0m Replica 3 restored lastExec: 28
OUT - 18:54:13.101 [consensus/pbft] restoreState -> INFO 021[0m Replica 3 restored state: view: 0, seqNo: 30, pset: 10, qset: 10, reqBatches: 10, chkpts: 1 h: 20
OUT - 18:54:13.101 [consensus/pbft] newObcBatch -> INFO 022[0m PBFT Batch size = 1000
OUT - 18:54:13.102 [consensus/pbft] newObcBatch -> INFO 023[0m PBFT Batch timeout = 1s
OUT - 18:54:13.102 [nodeCmd] serve -> INFO 024[0m Starting peer with ID=name:"vp3" , network ID=5cc24f88bbcc414a96962ea1c37c3aea, address=5cc24f88bbcc414a96962ea1c37c3aea-vp3.us.blockchain.ibm.com:30001, rootnodes=5cc24f88bbcc414a96962ea1c37c3aea-vp0.us.blockchain.ibm.com:30001,5cc24f88bbcc414a96962ea1c37c3aea-vp1.us.blockchain.ibm.com:30001,5cc24f88bbcc414a96962ea1c37c3aea-vp2.us.blockchain.ibm.com:30001, validator=true
OUT - 18:54:13.108 [rest] StartOpenchainRESTServer -> INFO 025[0m Initializing the REST service on 0.0.0.0:5001, TLS is enabled.
OUT - 18:54:13.109 [consensus/statetransfer] SyncToTarget -> INFO 026[0m Syncing to target 7f9573db0cae463b3f02b37312525e8f128d1415e05357d04751a88c01f831ff35e631a732c01c917aa9991a3c122a6e4be48ff50cf28f8e82b73729a4851087 for block number 28 with peers []
OUT - [31m18:54:13.180 [peer] handleChat -> ERRO 027[0m Error handling message: Peer FSM failed while handling message (DISC_HELLO): current state: created, error: transition canceled with error: Error registering Handler: Duplicate Handler error: {name:"vp2" 5cc24f88bbcc414a96962ea1c37c3aea-vp2.us.blockchain.ibm.com:30001 VALIDATOR `�ބ��M�U�d,��������9(ˑ(����}
OUT - [31m18:54:13.414 [peer] handleChat -> ERRO 028[0m Error handling message: Peer FSM failed while handling message (DISC_HELLO): current state: created, error: transition canceled with error: Error registering Handler: Duplicate Handler error: {name:"vp0" 5cc24f88bbcc414a96962ea1c37c3aea-vp0.us.blockchain.ibm.com:30001 VALIDATOR 2�)���J��;B���C��6U&�~ᑀ�A� }
OUT - [31m18:54:13.415 [peer] handleChat -> ERRO 029[0m Error handling message: Peer FSM failed while handling message (DISC_HELLO): current state: created, error: transition canceled with error: Error registering Handler: Duplicate Handler error: {name:"vp0" 5cc24f88bbcc414a96962ea1c37c3aea-vp0.us.blockchain.ibm.com:30001 VALIDATOR 2�)���J��;B���C��6U&�~ᑀ�A� }
OUT - [31m18:54:13.415 [peer] handleChat -> ERRO 02a[0m Error handling message: Peer FSM failed while handling message (DISC_HELLO): current state: created, error: transition canceled with error: Error registering Handler: Duplicate Handler error: {name:"vp0" 5cc24f88bbcc414a96962ea1c37c3aea-vp0.us.blockchain.ibm.com:30001 VALIDATOR 2�)���J��;B���C��6U&�~ᑀ�A� }
OUT - 18:54:13.478 [consensus/statetransfer] blockThread -> INFO 02b[0m Validated blockchain to the genesis block
OUT - 18:54:13.478 [consensus/pbft] ProcessEvent -> INFO 02c[0m Replica 3 application caught up via state transfer, lastExec now 28
OUT - [31m18:54:13.478 [consensus/pbft] Checkpoint -> ERRO 02d[0m Attempted to checkpoint a sequence number (28) which is not a multiple of the checkpoint interval (10)
OUT - [31m18:54:13.502 [peer] handleChat -> ERRO 02e[0m Error handling message: Peer FSM failed while handling message (DISC_HELLO): current state: created, error: transition canceled with error: Error registering Handler: Duplicate Handler error: {name:"vp1" 5cc24f88bbcc414a96962ea1c37c3aea-vp1.us.blockchain.ibm.com:30001 VALIDATOR �7��$iAG��zr-����8���f��8�q�<}
OUT - [31m18:54:13.526 [peer] handleChat -> ERRO 02f[0m Error handling message: Peer FSM failed while handling message (DISC_HELLO): current state: created, error: transition canceled with error: Error registering Handler: Duplicate Handler error: {name:"vp1" 5cc24f88bbcc414a96962ea1c37c3aea-vp1.us.blockchain.ibm.com:30001 VALIDATOR �7��$iAG��zr-����8���f��8�q�<}
OUT - [31m18:54:13.537 [peer] handleChat -> ERRO 030[0m Error handling message: Peer FSM failed while handling message (DISC_HELLO): current state: created, error: transition canceled with error: Error registering Handler: Duplicate Handler error: {name:"vp1" 5cc24f88bbcc414a96962ea1c37c3aea-vp1.us.blockchain.ibm.com:30001 VALIDATOR �7��$iAG��zr-����8���f��8�q�<}
OUT - 2017-02-20 18:54:28,551 INFO success: start_peer entered RUNNING state, process has stayed up for > than 15 seconds (startsecs)
OUT - /scripts/start.sh -network_id 5cc24f88bbcc414a96962ea1c37c3aea -peer_id vp3 -chaincode_host prod-us-01-chaincode-swarm-vp3.us.blockchain.ibm.com -chaincode_port 3383 -network_name us.blockchain.ibm.com -port_discovery 30001 -port_rest 5001 -port_event 31001 -peer_enrollid peer3 -chaincode_tls true -peer_tls true -num_peers 4
OUT - Enrollment secret is not passed calculating the default

This is certainly the behavior I would expect from the description of your steps above. The actual synchronization of the peers requires a little more explanation, and depends on some of the configuration parameters set on the blockchain.
By stopping vp3 you have effectively taken vp3 out of consensus and caused vp3 to advance his view. The Blockchain can proceed fine with only 3 peers participating in consensus, and that is what is currently happening. The other three peers are participating and proceeding as normal, they are happy with the state and view they are at. You may see some messages from vp2 to the other peers with a request for a view change, but as they are perfectly fine without him, they will ignore it for now.
From vp3's perspective he knows he is out of line and holds himself out of consensus because of it. If the network stays at its current state (vp3 out of consensus and 1 view ahead and vp0,vp1,vp2 in consensus, all in the same view but one behind vp2) then based on some PBFT configuration variables of vp3 (in the Starter Network you are using, it would be a 40 block window with 10 block checkpoints) he will not worry about synchronization. At 40 blocks behind he will initiate a catch-up via state transfer, but he uses the next two 10 block checkpoints to accomplish this. So you will see vp3 advance his chain only when he is 60 blocks behind the others at current configuration settings. Please note, this just ensures vp3 does not fall too far behind..it will not necessarily put him back in consensus.
You can find more information on the PBFT in general and the way it is implemented on the Starter Network plan here ==> https://console.ng.bluemix.net/docs/services/blockchain/etn_pbft.html
Now, as to re-synchronization, it can happen a few different ways.
1) The other peers for some reason decide to change their view. This could happen for a number of reasons, networking / communication issue between the peers, a heavy load where the participating peers decide to vote a new leader they believe might be faster (and hence view change), and some others. When they vote to change their view, they would advance to where vp3 is already waiting. Vp3 would rapidly synchronize himself with the Blockchain and start participating in consensus again. All peers would be in synchronization at this point. This could happen at any time and for various reasons.
2) You can try to "force" the issue of resynchronization. This would be an attempt to force the other peers to advance their view to meet vp3. One way of doing this would be to stop vp3. Then stop another peer (for example vp2). Advance the chain by doing an invoke directed to one of the remaining up peers. Then restart vp2, Then restart vp3. This can realign the peers in most cases, although timing can be a factor. There is a chance that either all 4 peers advance their views (still leaving vp3 ahead 1 view) or that the 3 peers will advance their views ahead of vp3 leaving vp3 one behind. If you are just looking to play around and see how the blockchain reacts in these situations you could try this.
3) If you had your own local blockchain using the published docker images here ==> https://hub.docker.com/r/ibmblockchain/fabric-peer/ You could set some configuration settings that would would force automatic view changes on certain time boundaries, this would bring an out of synchronization peer back in line on a more consistent basis, but that is not something you can do on the Starter Network on Bluemix you seem to be using (from your screenshot).
How this would all run, or would be configured, for a real solution environment would depend heavily on the your application and your intended use cases. Peer synchronization can be done at the cost of increased communication between peers, but the idea is not to keep all peers in synchronization as much as it is to insure that what is written to the blockchain has been agreed to through a consensus process.
Hope this helps!

Related

Losing connection to Kafka. What happens?

A jobmanager and taskmanager are running on a single VM. Also Kafka runs on the same server.
I have 10 tasks, all read from different kafka topics , process messages and write back to Kafka.
Sometimes I find my task manager is down and nothing is working. I tried to figure out the problem by checking the logs and I believe it is a problem with Kafka connection. (Or maybe a network problem?. But everything is on a single server.)
What I want to ask is, if for a short period I lose connection to Kafka what happens. Why tasks are failing and most importantly why task manager crushes?
Some logs:
2022-11-26 23:35:15,626 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-15] Disconnecting from node 0 due to request timeout.
2022-11-26 23:35:15,626 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-8] Disconnecting from node 0 due to request timeout.
2022-11-26 23:35:15,626 INFO org.apache.kafka.clients.NetworkClient [] - [Consumer clientId=cpualgosgroup1-1, groupId=cpualgosgroup1] Disconnecting from node 0 due to request timeout.
2022-11-26 23:35:15,692 INFO org.apache.kafka.clients.NetworkClient [] - [Consumer clientId=telefilter1-0, groupId=telefilter1] Cancelled in-flight FETCH request with correlation id 3630156 due to node 0 being disconnected (elapsed time since creation: 61648ms, elapsed time since send: 61648ms, request timeout: 30000ms)
2022-11-26 23:35:15,702 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-15] Cancelled in-flight PRODUCE request with correlation id 2159429 due to node 0 being disconnected (elapsed time since creation: 51069ms, elapsed time since send: 51069ms, request timeout: 30000ms)
2022-11-26 23:35:15,702 INFO org.apache.kafka.clients.NetworkClient [] - [Consumer clientId=cpualgosgroup1-1, groupId=cpualgosgroup1] Cancelled in-flight FETCH request with correlation id 2344708 due to node 0 being disconnected (elapsed time since creation: 51184ms, elapsed time since send: 51184ms, request timeout: 30000ms)
2022-11-26 23:35:15,702 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-15] Cancelled in-flight PRODUCE request with correlation id 2159430 due to node 0 being disconnected (elapsed time since creation: 51069ms, elapsed time since send: 51069ms, request timeout: 30000ms)
2022-11-26 23:35:15,842 WARN org.apache.kafka.clients.producer.internals.Sender [] - [Producer clientId=producer-15] Received invalid metadata error in produce request on partition tele.alerts.cpu-4 due to org.apache.kafka.common.errors.NetworkException: Disconnected from node 0. Going to request metadata update now
2022-11-26 23:35:15,842 WARN org.apache.kafka.clients.producer.internals.Sender [] - [Producer clientId=producer-8] Received invalid metadata error in produce request on partition tele.alerts.cpu-6 due to org.apache.kafka.common.errors.NetworkException: Disconnected from node 0. Going to request metadata update now
2
and then
2022-11-26 23:35:56,673 WARN org.apache.flink.runtime.taskmanager.Task [] - CPUTemperatureAnalysisAlgorithm -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_d0ae1ab03e621ff140fb6b0b0a2932f9_0_0) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for 8d57994a59ab86ea9ee48076e80a7c7f.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1702)
...
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 99d52303d7e24496ae661ddea2b6a372 timed out.
2022-11-26 23:35:56,682 INFO org.apache.flink.runtime.taskmanager.Task [] - Triggering cancellation of task code CPUTemperatureAnalysisAlgorithm -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_d0ae1ab03e621ff140fb6b0b0a2932f9_0_0).
2022-11-26 23:35:57,199 INFO org.apache.flink.runtime.taskmanager.Task [] - Attempting to fail task externally TemperatureAnalysis -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_15071110d0eea9f1c7f3d75503ff58eb_0_0).
2022-11-26 23:35:57,202 WARN org.apache.flink.runtime.taskmanager.Task [] - TemperatureAnalysis -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_15071110d0eea9f1c7f3d75503ff58eb_0_0) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for 8d57994a59ab86ea9ee48076e80a7c7f.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1702)
Why taskexecutor loses connection to JobManager?
If I dont care any data lost, how should I configure Kafka clients and flink recovery. I just want Kafka Client not to die. Especially I dont want my tasks or task managers to crush. If I lose connection, is it possible to configure Flink to just for wait? If we can`t read, wait and if we can't write back to Kafka, just wait?
The heartbeat of JobManager with id 99d52303d7e24496ae661ddea2b6a372 timed out.
Sounds like the server is somewhat overloaded. But you could try increasing the heartbeat timeout.

Hyperledger Fabric on Kubernetes - Restarting Peer throw error for few minutes

I have setup for 3 organisations on Kubernetes cluster and it is giving me following error when I restart peer node pod and tried command -> peer channel list on bash shell,
[comm.tls] ClientHandshake -> ERRO 026 Client TLS handshake failed after 2.997205009s with error: context canceled remoteaddress=10.0.94.178:7051
[grpc] WarningDepth -> DEBU 027 [core]grpc: addrConn.createTransport failed to connect to {peer0-org1:7051 peer0-org1:7051 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: context canceled". Reconnecting...
Error: error getting endorser client for channel: endorser client failed to connect to peer0-org1:7051: failed to create new connection: context deadline exceeded
After some time (nearly 10-15 minutes) if I tried same command,
[comm.tls] ClientHandshake -> DEBU 024 Client TLS handshake completed in 1.48399ms remoteaddress=10.0.94.178:7051
[grpc] InfoDepth -> DEBU 025 [core]Subchannel Connectivity change to READY
It seems working and give me channel list. I am not able to find out what is reason behind this, Please help for this.

Error: error getting endorser client for channel: endorser client failed to connect to peer-govt:7051: failed to create new connection: context

I have been trying to deploy a hyperledger fabric model with 3 CAs 1 orderer and 2 peer nodes. I am able to create the channel with OSADMIN command of fabric but when I try to join the channel with peer node, I get Error: error getting endorser client for channel: endorser client failed to connect to peer-govt:7051: failed to create new connection: context...... .
Here are the logs from terminal (local host machine):
2021-06-01 06:38:54.509 UTC [common.tools.configtxgen] main -> INFO 001 Loading configuration
2021-06-01 06:38:54.522 UTC [common.tools.configtxgen.localconfig] completeInitialization -> INFO 002 orderer type: etcdraft
2021-06-01 06:38:54.522 UTC [common.tools.configtxgen.localconfig] completeInitialization -> INFO 003 Orderer.EtcdRaft.Options unset, setting to tick_interval:"500ms" election_tick:10 heartbeat_tick:1 max_inflight_blocks:5 snapshot_interval_size:16777216
2021-06-01 06:38:54.522 UTC [common.tools.configtxgen.localconfig] Load -> INFO 004 Loaded configuration: /etc/hyperledger/clipod/configtx/configtx.yaml
2021-06-01 06:38:54.712 UTC [common.tools.configtxgen] doOutputBlock -> INFO 005 Generating genesis block
2021-06-01 06:38:54.712 UTC [common.tools.configtxgen] doOutputBlock -> INFO 006 Creating application channel genesis block
2021-06-01 06:38:54.712 UTC [common.tools.configtxgen] doOutputBlock -> INFO 007 Writing genesis block
cli-dd4cc5fbf-pdcgb
Status: 201
{
"name": "commonchannel",
"url": "/participation/v1/channels/commonchannel",
"consensusRelation": "consenter",
"status": "active",
"height": 1
}
cli-dd4cc5fbf-pdcgb
Error: error getting endorser client for channel: endorser client failed to connect to peer-govt:7051: failed to create new connection: context deadline exceeded
command terminated with exit code 1
Error: error getting endorser client for channel: endorser client failed to connect to peer-general:9051: failed to create new connection: context deadline exceeded
command terminated with exit code 1
One thing to note down here is I am using Kubernetes and service CLUSTER_IP for all the PODS.
here are logs from one of the PEER POD (same for other)
2021-06-01 06:38:42.180 UTC [nodeCmd] registerDiscoveryService -> INFO 01b Discovery service activated
2021-06-01 06:38:42.180 UTC [nodeCmd] serve -> INFO 01c Starting peer with ID=[peer-govt], network ID=[dev], address=[peer-govt:7051]
2021-06-01 06:38:42.180 UTC [nodeCmd] func6 -> INFO 01d Starting profiling server with listenAddress = 0.0.0.0:6060
2021-06-01 06:38:42.180 UTC [nodeCmd] serve -> INFO 01e Started peer with ID=[peer-govt], network ID=[dev], address=[peer-govt:7051]
2021-06-01 06:38:42.181 UTC [kvledger] LoadPreResetHeight -> INFO 01f Loading prereset height from path [/var/hyperledger/production/ledgersData/chains]
2021-06-01 06:38:42.181 UTC [blkstorage] preResetHtFiles -> INFO 020 No active channels passed
2021-06-01 06:38:56.006 UTC [core.comm] ServerHandshake -> ERRO 021 Server TLS handshake failed in 24.669µs with error tls: first record does not look like a TLS handshake server=PeerServer remoteaddress=172.17.0.1:13258
2021-06-01 06:38:57.007 UTC [core.comm] ServerHandshake -> ERRO 022 Server TLS handshake failed in 17.772µs with error tls: first record does not look like a TLS handshake server=PeerServer remoteaddress=172.17.0.1:29568
2021-06-01 06:38:58.903 UTC [core.comm] ServerHandshake -> ERRO 023 Server TLS handshake failed in 13.581µs with error tls: first record does not look like a TLS handshake server=PeerServer remoteaddress=172.17.0.1:32615
To overcome this issue, I tried disabling the TLS by setting CORE_PEER_TLS_ENABLED to FALSE
then the proposal gets submitted but the orderer POD throws the same error of TLS handshake failed.........
Here are the commands I am using to join the channel from cli pod:
kubectl -n hyperledger -it exec $CLI_POD -- sh -c "export FABRIC_CFG_PATH=/etc/hyperledger/clipod/config && export CORE_PEER_LOCALMSPID=GeneralMSP && export CORE_PEER_TLS_ROOTCERT_FILE=/etc/hyperledger/clipod/organizations/peerOrganizations/general.example.com/peers/peer0.general.example.com/tls/ca.crt && export CORE_PEER_MSPCONFIGPATH=/etc/hyperledger/clipod/organizations/peerOrganizations/general.example.com/users/Admin#general.example.com/msp && export CORE_PEER_ADDRESS=peer-general:9051 && peer channel join -b /etc/hyperledger/clipod/channel-artifacts/$CHANNEL_NAME.block -o orderer:7050 --tls --cafile /etc/hyperledger/clipod/organizations/ordererOrganizations/example.com/orderers/orderer.example.com/msp/tlscacerts/tlsca.example.com-cert.pem"
I am stuck on this problem, any help will be appreciated.
Thank you
I have fixed it. The issue I was facing was because of not setting the CORE_PEER_TLS_ENABLED = true for CLI pod.
One thing I have got learn from this whole model, whenever you see TLS issue, first to check for would be checking CORE_PEER_TLS_ENABLED variable. Make sure you have set it for all the pods or containers you are trying to interact with. The case can be false(for no TLS) or true(for using TLS) depending on your deployment.
Other things to keep in mind is using the correct variables of fabric including FABRIC_CFG_PATH, CORE_PEER_LOCALMSPID, CORE_PEER_TLS_ROOTCERT_FILE, CORE_PEER_MSPCONFIGPATH and some others depending on your command.

Hyperledger Fabric - Channel hasn't started yet

After an unfortunate docker outage, one of our test channels has stopped working properly. This is the output of a previously working “peer chaincode invoke” command:
Error: error sending transaction for invoke: got unexpected status: SERVICE_UNAVAILABLE -- will not enqueue, consenter for this channel hasn't started yet - proposal response: version:1 response:<status:200 payload:"be85bda14845a33cd07db9825d2e473dc65902e6986fdfccea30d8c32385f758" > payload:"\n \364\242+\t\222\216\361\020\024}d7\203\277WY04\233\225vA\376u\330r\2045\312\206\304\333\022\211\001\n3\022\024\n\004lscc\022\014\n\n\n\004strs\022\002\010\004\022\033\n\004strs\022\023\032\021\n\tkeepalive\032\004ping\032E\010\310\001\032#be85bda14845a33cd07db9825d2e473dc65902e6986fdfccea30d8c32385f758\"\013\022\004strs\032\0031.0" endorsement:<endorser:"\n\013BackboneMSP\022\203\007-----BEGIN CERTIFICATE----- etc. more output removed from here
I find this in the orderer's log:
2019-07-29 14:46:50.930 UTC [orderer/consensus/kafka] try -> DEBU 3c10 [channel: steel] Connecting to the Kafka cluster
2019-07-29 14:46:50.931 UTC [orderer/consensus/kafka] try -> DEBU 3c11 [channel: steel] Need to retry because process failed = kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
2019-07-29 14:46:56.967 UTC [common/deliver] Handle -> WARN 3c12 Error reading from 10.0.0.4:32800: rpc error: code = Canceled desc = context canceled
2019-07-29 14:46:56.967 UTC [orderer/common/server] func1 -> DEBU 3c13 Closing Deliver stream
2019-07-29 14:46:56.972 UTC [orderer/common/server] Deliver -> DEBU 3c14 Starting new Deliver handler
2019-07-29 14:46:56.972 UTC [common/deliver] Handle -> DEBU 3c15 Starting new deliver loop for 10.0.0.4:32802
2019-07-29 14:46:56.973 UTC [common/deliver] Handle -> DEBU 3c16 Attempting to read seek info message from 10.0.0.4:32802
2019-07-29 14:46:56.973 UTC [common/deliver] deliverBlocks -> WARN 3c17 [channel: steel] Rejecting deliver request for 10.0.0.4:32802 because of consenter error
2019-07-29 14:46:56.973 UTC [common/deliver] Handle -> DEBU 3c18 Waiting for new SeekInfo from 10.0.0.4:32802
2019-07-29 14:46:56.973 UTC [common/deliver] Handle -> DEBU 3c19 Attempting to read seek info message from 10.0.0.4:32802
2019-07-29 14:46:56.995 UTC [common/deliver] Handle -> WARN 3c1a Error reading from 10.0.0.23:49844: rpc error: code = Canceled desc = context canceled
2019-07-29 14:46:56.995 UTC [orderer/common/server] func1 -> DEBU 3c1b Closing Deliver stream
And this is from the endorser peer’s log:
2019-07-29 15:14:17.829 UTC [ConnProducer] DisableEndpoint -> WARN 3d6 Only 1 endpoint remained, will not black-list it
2019-07-29 15:14:17.834 UTC [blocksProvider] DeliverBlocks -> WARN 3d7 [steel] Got error &{SERVICE_UNAVAILABLE}
2019-07-29 15:14:27.839 UTC [blocksProvider] DeliverBlocks -> WARN 3d8 [steel] Got error &{SERVICE_UNAVAILABLE}
I use these docker images:
hyperledger/fabric-kafka:0.4.10
hyperledger/fabric-orderer:1.2.0
hyperledger/fabric-peer:1.2.0
Based on the above, I assume that the consistency between the orderer and the corresponding kafka topic is broken. It also doesn't help if I redirect requests to another orderer or force to change the kafka topic leader. Is it correct that if KAFKA_LOG_RETENTION_MS=-1 had been set, this error would probably have been prevented?
After reviewing the archives, I found that it is not possible to fix this error. As I see it, I can't shutdown only one channel, and I even have to stop all the peers subscribed to the channel if I don't want continuous error messages in the orderer logs. What is the best practice in cases like mine?
Regards;
Sandor
consenter error
which means before connection made b/w kafka & orderer you are trying
to do some operations.
Which means there is an error present in between kafka and orderer
Note: Probably you might have set up a connection which is not stable
Try to check the logs of orderer it should have a message posted
successfully.
whenever Kafka, orderer try to connect, orderer will post a message if it
successfully posted to a topic which means you have configured correctly
Make sure connection b/w kafka and orderer are configured correctly
2019-07-29 14:46:50.931 UTC [orderer/consensus/kafka] try -> DEBU 3c11 [channel: steel] Need to retry because process failed = kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
By above clue, it is complete with Kafka nothing todo with the orderer
try check this

Kafka gives Invalid receive size with Hyperledger Fabric Orderer connection

I was setting up a new cluster for Hyperledger Fabric on EKS. The cluster has 4 kafka nodes, 3 zookeeper nodes, 4 peers, 3 orderers, 1 CA. All the containers come up individually, and the kafka/zookeeper backend is also stable. I can SSH into any kafka/zookeeper and check for connections to any other nodes, create topics, post messages etc. The kafka is accessible via Telnet from all orderers.
When I try to create a channel I get the following error from the orderer:
2019-04-25 13:34:17.660 UTC [orderer.common.broadcast] ProcessMessage -> WARN 025 [channel: channel1] Rejecting broadcast of message from 192.168.94.15:53598 with SERVICE_UNAVAILABLE: rejected by Consenter: backing Kafka cluster has not completed booting; try again later
2019-04-25 13:34:17.660 UTC [comm.grpc.server] 1 -> INFO 026 streaming call completed grpc.service=orderer.AtomicBroadcast grpc.method=Broadcast grpc.peer_address=192.168.94.15:53598 grpc.code=OK grpc.call_duration=14.805833ms
2019-04-25 13:34:17.661 UTC [common.deliver] Handle -> WARN 027 Error reading from 192.168.94.15:53596: rpc error: code = Canceled desc = context canceled
2019-04-25 13:34:17.661 UTC [comm.grpc.server] 1 -> INFO 028 streaming call completed grpc.service=orderer.AtomicBroadcast grpc.method=Deliver grpc.peer_address=192.168.94.15:53596 error="rpc error: code = Canceled desc = context canceled" grpc.code=Canceled grpc.call_duration=24.987468ms
And the Kafka leader reports the following error:
[2019-04-25 14:07:09,453] WARN [SocketServer brokerId=2] Unexpected error from /192.168.89.200; closing connection (org.apache.kafka.common.network.Selector)
org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 369295617 larger than 104857600)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:132)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:93)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:231)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:192)
at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:528)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:469)
at org.apache.kafka.common.network.Selector.poll(Selector.java:398)
at kafka.network.Processor.poll(SocketServer.scala:535)
at kafka.network.Processor.run(SocketServer.scala:452)
at java.lang.Thread.run(Thread.java:748)
[2019-04-25 14:13:53,917] INFO [GroupMetadataManager brokerId=2] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
The error indicates that you are receiving messages larger than the permitted maximum size, that defaults to ~100MB. Try to increase the following property in server.properties file, so that it can fit larger receive (in this case at least 369295617 bytes):
# Set to 500MB
socket.request.max.bytes=500000000
and then restart your Kafka Cluster.
If this doesn't work for you, then I guess that you are trying to connect to a non-SSL listener. Therefore, you'd have to verify that broker's SSL listener port is 9092 (or the corresponding port in case you are not using the default one) . The following should do the trick:
listeners=SSL://:9092
advertised.listeners=SSL://:9092
inter.broker.listener.name=SSL