Druid services going down without any specific error - druid

I see this error message before services go down.
2020-07-26T18:48:25,466 INFO [Thread-131] org.apache.druid.java.util.common.lifecycle.Lifecycle - Lifecycle [module] running shutdown hook
2020-07-26T18:48:25,468 INFO [Thread-131] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2020-07-26T18:48:25,469 INFO [Thread-131] org.apache.druid.curator.announcement.Announcer - Unannouncingstrong text****
This happens for all the services. The services go down after 2-3 hours. I have set the log level to debug mode and still I do not see specific errors.
My services are configured to run like this
Zookeeper cluster runs on a 3 node cluster. These nodes are different from druid cluster nodes.
Druid 0.18.1 cluster runs on 6 node cluster
2 for data services - i3.4x large
2 for master - m5.2xlarge
2 for query - m5.2xlarge
Overlord runs within coordinator.
Any ideas ?

Related

Wildfly Singleton Service election in cluster gives Out of Memory/Metaspace Issues

Wildfly runs its Singleton Service in cluster in midnight or mid day like
2022-03-18 00:00:07,151 INFO [org.wildfly.clustering.server] (LegacyDistributedSingletonService - 1) WFLYCLSV0003: alp-esb-app02:masterdata-batch-02 elected as the singleton provider of the jboss.deployment.unit."masterdata-emp-org-powerdata-1.4.war".installer service
In 3 clusters, Many integrations in 02 jumps to 03 and vice versa and in between we come up with Metaspace? Basically it un deploys one integrations and deploys other integration from other server.
Why such behavior and why it always have metaspace and how could it be fixed?
This would happen if one of your cluster members is removed from the cluster (based on criteria of your JGroups failure detection configuration). Your logs should indicate the reason.

Kubernetes nodes on different datacenters causes slowness

We have 2 datacenters (A and B) on 2 different regions (Eastern and Central). I've setup K8S single cluster where master is on datacenter A and few nodes on both datacenters. Our application contains 3 images (AppServer, DB, ReportingServer). Consider these 2 scenarios when the application is deployed:
1 - All 3 pods are created on the nodes belong to datacenter A. Everything works fine.
2 - DB pod or AppServer pod is created on a node belong to datacenter B and the other 2 pods are created on datacenter A. In this case application is very slow, takes 10-15 mins for it to be in running state (instead of 2-3 mins), login page loads very slowly and logging to the application usually throws error due to timeout.
My question: Is it normal for K8S to behave like this if nodes are in different datacenters? Or my setup is wrong?

Microsoft Service Fabric - fabric:/System/ImageStoreService not running

I am trying to copy an app to the service fabric image store.
I am not able to copy the application via VS or Powershell (probably because of the fabric:/System/ImageStoreService being in Error state). The operation times out when done using Visual Studio and when done using Powershell - it just stays stuck indefinitely.
I don't know how to approach services that are not running on the Service Fabric Cluster. I have other services failing on the cluster as well - this is a new test cluster created using the Azure portal yesterday (see attached screenshot).
Error event: SourceId='System.FM', Property='State'. Partition is below target replica or instance count.
ImageStoreService 3 3 00000000-0000-0000-0000-000000003000
N/P InBuild _nodr_0 131636712421204228
(Showing 1 out of 1 replicas. Total available replicas: 0)

Running Flink Job when Taskmanager killed / loss

What I want to achieve is Flink cluster that will automatically re-allocate to run the job when there is a resource interruption , eg: Kubernetes pod scale down, loss of existing taskmanager.
I tested with a Flink cluster of :
one Jobmanager, 2 taskmanager (2 task slot each),
Restart Strategies-fixedDelayRestart(2, 2000),
checkpoint and state configured to HDFS.
The job started as 4 parallelism which utilized all the available slots.
This cluster will later be running on top of Kubernetes and manage by autoscaling.
Scenario :
When I kill one of the taskmanager, the Flink cluster will run with 1 JM and 1 TM, the Job will then restart, and failed eventually as it will start with previous state (4 parallelism) and complaint unavailable resource from the Flink cluster.
Is there a way for me to restart the job by dynamically re-allocate available resource instead of using previous state?
Appreciate if someone can shade some light on this.

Mesos cluster fails to elect master when using replicated_log

Test environment: multi-node mesos 0.27.2 cluster on AWS (3 x masters, 2 x slaves, quorum=2).
Tested persistence with zkCli.sh and it works fine.
If i start the masters with --registry=in_memory, it works fine, master is elected, i can start tasks via Marathon.
If i use the default (--registry=replicated_log) the cluster fails to elect a master:
https://gist.github.com/mitel/67acd44408f4d51af192
EDIT: apparently the problem was the firewall. Applied an allow-all type of rule to all my security groups and now i have a stable master. Once i figure out what was blocking the communication i'll post it here.
Discovered that mesos masters also initiate connections to other masters on 5050. After adding the egress rule to the master's security group, the cluster is stable, master election happens as expected. firewall rules
UPDATE: for those who try to build an internal firewall between the various components of mesos/zk/.. - don't do it. better to design the security as in Mesosphere's DCOS
First off, let me briefly clarify the flags meaning for posterity. --registry does not influence leader election, it specifies the persistence strategy for the registry (where Mesos tracks data that should be carried over failover). The in_memory value should not be used in production, it may even be removed in the future.
Leader election is performed by zookeeper. According to your log, you use the following zookeeper cluster: zk://10.1.69.172:2181,10.1.9.139:2181,10.1.79.211:2181/mesos.
Now, from your log, the cluster did not fail to elect the master, it actually did it twice:
I0313 18:35:28.257139 3253 master.cpp:1710] The newly elected leader is master#10.1.69.172:5050 with id edd3e4a7-ede8-44fe-b24c-67a8790e2b79
...
I0313 18:35:36.074087 3257 master.cpp:1710] The newly elected leader is master#10.1.9.139:5050 with id c4fd7c4d-e3ce-4ac3-9d8a-28c841dca7f5
I can't say why exactly the leader was elected twice, for that I would need logs from 2 other masters as well. According to your log, the last elected master is on 10.1.9.139:5050, which is most probably not the one you provided the log from.
One suspicious thing I see in the log is that master IDs differ for the same IP:port. Do you have an idea why?
I0313 18:35:28.237251 3244 master.cpp:374] Master 24ecdfff-2c97-4de8-8b9c-dcea91115809 (10.1.69.172) started on 10.1.69.172:5050
...
I0313 18:35:28.257139 3253 master.cpp:1710] The newly elected leader is master#10.1.69.172:5050 with id edd3e4a7-ede8-44fe-b24c-67a8790e2b79