Nifi fail to initialize processor when one node in a cluster goes down - apache-zookeeper

I have set up 2 node Nifi cluster with embedded zookeeper. It was working fine. Today in the morning due to some reason Node1 went down(It is not available now). Now on node2 nifi is running and it is showing disconnected from cluster(see the screenshot):
Now if I execute this flow nifi is generating below error:
ListSFTP[id=174ee859-f525-316f-8160-5a745a63d3cf] Failed to properly
initialize Processor. If still scheduled to run, NiFi will attempt to
initialize and run the Processor again after the 'Administrative Yield
Duration' has elapsed. Failure is due to java.io.IOException: Failed
to obtain value from ZooKeeper for component with ID
174ee859-f525-316f-8160-5a745a63d3cf with exception code
CONNECTIONLOSS: java.io.IOException: Failed to obtain value from
ZooKeeper for component with ID 174ee859-f525-316f-8160-5a745a63d3cf
with exception code CONNECTIONLOSS
As far as I understand, if one node goes down in the nifi cluster other node will become the primary node and do the work then why I am receiving this error.
Please Note that this flow was working fine when all the nodes were up in cluster.

Related

ADF Dataflow stuck IN progress and fail with below errors

ADF Pipeline DF task is Stuck in Progress. It was working seamlessly last couple of months but suddenly Dataflow stuck in progress and Time out after certain time. We are using IR managed Virtual Network. I am using forereach loop to run data flow for multiple entities parallel, it always randomly get stuck on last Entity.
What can I try to resolve this?
Error in Dev Environment
Error Code 4508
Spark cluster not found
Error in Prod Environment:
Error code
5000
Failure type
User configuration issue
Details
[plugins.*** ADF.adf-ir-001 WorkspaceType:<ADF> CCID:<f289c067-7c6c-4b49-b0db-783e842a5675>] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
Images:
I tried below steps:
By changing IR configuring as below
Tried DF Retry and retry Interval
Also, tried For each loop one batch at a time instead of 4 batch parallel. None of the above trouble-shooting steps worked. These PL is running last 3-4 months without a single failure, suddenly they started to fail last 3 days consistently. DF flow always stuck in progress randomly for different entity and times out in one point by throwing above errors.
Error Code 4508 Spark cluster not found.
This error can cause because of two reasons.
The debug session is getting closed till the dataflow finish its transformation in this case recommendation is to restart the debug session
the second reason is due to resource problem, or an outage in that particular region.
Error code 5000 Failure type User configuration issue Details [plugins.*** ADF.adf-ir-001 WorkspaceType: CCID:] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
A temporary error is one that says "Livy job state dead caused by unknown error." At the backend of the dataflow, a spark cluster is used, and this error is generated by the spark cluster. to get the more information about error go to StdOut of sparkpool execution.
The backend cluster may be experiencing a network problem, a resource problem, or an outage.
If error persist my suggestion is to raise Microsoft support ticket here

EMR cluster automatically terminating after few days

I've a AWS EMR cluster executing a spark streaming job. It takes streaming data from Kinesis stream and process it. It works fine for few days but after 12-15 days the cluster terminates automatically. I checked in events tab, it shows
cluster has terminated with errors with a reason of STEP_FAILURE.
Anyone has any idea why step failure can occur when the step successfully ran for few days ?
Go to the EMR console, and check the step option. If it is set as follows:
Action on failure:Terminate cluster
then the cluster will be terminated when the step failed.

Warning when deleting processors using Nifi 1.6

I recently upgraded the test environment to Nifi 1.6 on Centos 7. I am running a three node cluster using the Nifi internal zookeeper version. If I drag in a new processor and then delete it I get the following stack trace in the logs.
WARN StandardStateManagerProvider Component with ID {} was removed from Nifi instance but failed to clear clustered state for the component.
java.lang.IllegalArgumentException: Path must start with / character
at org.apache.zookeeper.common.PathUtils.validationPath(PathUtils.java:51)
at org.apache.zookeeper.Zookeeper.setData(ZooKeeper.java:1257).

how to update cluster status in dataproc

I changed my initialization script after creating a cluster with 2 worker nodes for spark. Then I changed the script a bit and tried to update the cluster with 2 more worker nodes. The script failed because I simply forgot to apt-get update before apt-get install, so dataproc reports error and the cluster's status changed to ERROR. When I try to reduce the size back to 2 nodes again, it doesn't work anymore with the following message
ERROR: (gcloud.dataproc.clusters.update) Cluster 'cluster-1' must be running before it can be updated, current cluster state is 'ERROR'.
The two worker nodes are still added, but they don't seem to be detected by a running spark application at first because no more executors are added. I manually reset the two instances on the Google Compute Engine page, and then 4 executors are added. So it seems everything is working fine again except that the cluster's status is still ERROR, and I cannot increase or decrease the number of worker nodes anymore.
How can I update the cluster status back to normal (RUNNING)?
In your case ERROR indicates that workflow to re-configure the cluster has failed, and Dataproc is not sure of its health. At this point Dataproc cannot guarantee that another reconfigure attempt will succeed so further updates are disallowed. You can however submit jobs.
Your best bet is to delete it and start over.

Mesos-master: Shutdown failed on fd=25: Transport endpoint is not connected [107]

When I run 3 mesos-master with QUORUM=2, they fail 1 minute after being elected as the leader, giving errors:
E1015 11:50:35.539562 19150 socket.hpp:174] Shutdown failed on fd=25: Transport endpoint is not connected [107]
E1015 11:50:35.539897 19150 socket.hpp:174] Shutdown failed on fd=24: Transport endpoint is not connected [107]
They keep electing one another in a loop, consistently failing and re-electing.
If I set QUORUM=1, everything works well. What could be the reason for this?
One problem was that AWS firewall was blocking reaching public IPs of the server and zookeeper was broadcasting public IP (set in advertise_ip) so nobody was able to connect each other. Slaves also couldn't connect to the masters with the same error.
When I set local IP to advertise_ip (so that Zookeeper broadcasted local IPs), masters could communicate and QUORUM=2 worked. When I removed the firewall rule, slaves could connect to the master.
We had a similar problem yesterday, marathon was a little weird because some applications were not been deployed. The strange was that the application goes up but the health check never turns green, and so nixy wasn't updating nginx.
After a lot of investigation we came to this very same error:
E0718 18:51:05.836688 5049 socket.hpp:107] Shutdown failed on fd=46: Transport endpoint is not connected [107]
In the end we discovery that the problem was in the election, even that our QUORUM=1 (we have 2 masters) somehow it looses itself and one master wasn't communicating with the other.
To solve this we triggered a new election using Marathon API /v2/leader DELETE method and everything worked fine after that.
We had the same problem, the mesos-master log flooding with messages like:
mesos-master[27499]: E0616 14:29:39.310302 27523 socket.hpp:174] Shutdown failed on fd=67: Transport endpoint is not connected [107]
Turned out it was the loadbalancers health check to /stats.json