Aurora Instances are always pending status after submitted Heron Topology - scheduled-tasks

I was deployed a Twitter Heron cluster with Aurora and Mesos. The components of the cluster as following list:
Scheduler: Aurora scheduler
State Manager: zookeeper
Uploader: HDFS
The instances of Aurora are always pending status after I submitted the example topology named WordCountTopology. The following is a screenshot of the cluster running.
Mesos agents:
Aurora scheduler:
Where is the problem? Is the machine's resources in the cluster can not meet the needs the tasks of togology? Thanks for your help.

Well, the message in the lower screen has a clear indication that you assigned not enough memory to the agents.
481 and 485mb doesn’t seem to be enough.

Definitely, it looks like a resourcing issue.

Related

Airflow fault tolerance

I have 2 questions:
first, what does it mean that the Kubernetes executor is fault tolerance, in other words, what happens if one worker nodes gets down?
Second question, is it possible that the whole Airflow server gets down? if yes, is there a backup that runs automatically to continue the work?
Note: I have started learning airflow recently.
Thanks in advance
This is a theoretical question that faced me while learning apache airflow, I have read the documentation
but it did not mention how fault tolerance is handled
what does it mean that the Kubernetes executor is fault tolerance?
Airflow scheduler use a Kubernetes API watcher to watch the state of the workers (tasks) on each change in order to discover failed pods. When a worker pod gets down, the scheduler detect this failure and change the state of the failed tasks in the Metadata, then these tasks can be rescheduled and executed based on the retry configurations.
is it possible that the whole Airflow server gets down?
yes it is possible for different reasons, and you have some different solutions/tips for each one:
problem in the Metadata: the most important part in Airflow is the Metadata where it's the central point used to communicate between the different schedulers and workers, and it is used to save the state of all the dag runs and tasks, and to share messages between tasks, and to store variables and connections, so when it gets down, everything will fail:
you can use a managed service (AWS RDS or Aurora, GCP Cloud SQL or Cloud Spanner, ...)
you can deploy it on your K8S cluster but in HA mode (doc for postgresql)
problem with the scheduler: the scheduler is running as a pod, and the is a possibility to lose depending on how you deploy it:
Try to request enough resources (especially memory) to avoid OOM problem
Avoid running it on spot/preemptible VMs
Create multiple replicas (minimum 3) for the scheduler to activate HA mode, in this case if a scheduler gets down, there will be other schedulers up
problem with webserver pod: it doesn't affect your workload, but you will not be able to access the UI/API during the downtime:
Try to request enough resources (especially memory) to avoid OOM problem
It's a stateless service, so you can create multiple replicas without any problem, if one gets down, you will access the UI/API using the other replicas

cassandra is logging timeout of node URGENT_MESSAGES

URGENT_MESSAGES-[no-channel] dropping message of type GOSSIP_DIGEST_SYN whose timeout expired before reaching the network
Thank you for your message. Yesterday we solved the Problem.
The reason was a "dead node" oviously leaved form a change in the kubernetes deployment.
So, allways look out for dead nodes, after changing something in the cluster deployment.
You didn't provide a lot of information but I'm assuming that your cluster is running into a known issue where gossip messages are being dropped during startup of a Cassandra node (CASSANDRA-16877).
The starting node sends GOSSIP_DIGEST_SYN with a high priority (URGENT_MESSAGES) but for large clusters, Cassandra 4.0 nodes cannot serialise the gossip state when the size of the state exceeds 128kb and no acknowledgement gets sent. Since a node can not gossip with other nodes, it fails to start.
This was urgently fixed in Cassandra 4.0.1 last year. Upgrade the binaries on the affected Cassandra 4.0 nodes and that should allow them to start successfully and join the ring. Cheers!

Apache flink on Kubernetes - Resume job if jobmanager crashes

I want to run a flink job on kubernetes, using a (persistent) state backend it seems like crashing taskmanagers are no issue as they can ask the jobmanager which checkpoint they need to recover from, if I understand correctly.
A crashing jobmanager seems to be a bit more difficult. On this flip-6 page I read zookeeper is needed to be able to know what checkpoint the jobmanager needs to use to recover and for leader election.
Seeing as kubernetes will restart the jobmanager whenever it crashes is there a way for the new jobmanager to resume the job without having to setup a zookeeper cluster?
The current solution we are looking at is: when kubernetes wants to kill the jobmanager (because it want to move it to another vm for example) and then create a savepoint, but this would only work for graceful shutdowns.
Edit:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-HA-with-Kubernetes-without-Zookeeper-td15033.html seems to be interesting but has no follow-up
Out of the box, Flink requires a ZooKeeper cluster to recover from JobManager crashes. However, I think you can have a lightweight implementation of the HighAvailabilityServices, CompletedCheckpointStore, CheckpointIDCounter and SubmittedJobGraphStore which can bring you quite far.
Given that you have only one JobManager running at all times (not entirely sure whether K8s can guarantee this) and that you have a persistent storage location, you could implement a CompletedCheckpointStore which retrieves the completed checkpoints from the persistent storage system (e.g. reading all stored checkpoint files). Additionally, you would have a file which contains the current checkpoint id counter for CheckpointIDCounter and all the submitted job graphs for the SubmittedJobGraphStore. So the basic idea is to store everything on a persistent volume which is accessible by the single JobManager.
I implemented a light version of file-based HA, based on Till's answer and Xeli's partial implementation.
You can find the code in this github repo - runs well in production.
Also wrote a blog series explaining how to run a job cluster on k8s in general and about this file-based HA implementation specifically.
For everyone interested in this, I currently evaluate and implement a similar solution using Kubernetes ConfigMaps and a blob store (e.g. S3) to persist job metadata overlasting JobManager restarts. No need to use local storage as the solution relies on state persisted to blob store.
Github thmshmm/flink-k8s-ha
Still some work to do (persist Checkpoint state) but the basic implementation works quite nice.
If someone likes to use multiple JobManagers, Kubernetes provides an interface to do leader elections which could be leveraged for this.

spark-shell on multinode spark cluster fails to spon executor on remote worker node

Installed spark cluster on standalone mode with 2 nodes on first node there is spark master running and on another node spark worker. When i try to run spark shell on worker node with word count code it runs fine but when i try to run spark shell on the master node it gives following output :
WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Executor is not triggered to run the job. Even though there is worker available to spark master its giving such a problem . Any help is appriciated , thanks
You use client deploy mode so the best bet is that executor nodes cannot connect to the driver port on the local machine. It could be firewall issue or problem with advertised IP / hostname. Please make sure that:
spark.driver.bindAddress
spark.driver.host
spark.driver.port
use expected values. Please refer to the networking section of Spark documentation.
Less likely it is a lack of resources. Please check if you don't request more resources than provided by workers.

DRP for postgres-xl

After installing and setting up a 2 node cluster of postgres-xl 9.2, where coordinator and GTM are running on node1 and the Datanode is set up on node2.
Now before I use it in production I have to deliver a DRP solution.
Does anyone have a DR plan for postgres-xl 9.2 architechture?
Best Regards,
Aviel B.
So from what you described you only have one of each node... What are you expecting to recover too??
Postgres-XL is a clustered solution. If you only have one of each node then you have no cluster and not only are you not getting any scaling advantage it is actually going to run slower than stand alone Postgres. Plus you have nothing to recover to. If you lose either node you have completely lost the database.
Also the docs recommend you put the coordinator and data nodes on the same server if you are going to combine nodes.
So for the simplest solution in Replication mode you would need something like
Server1 GTM
Server2 GTM Proxy
Server3 Coordinator 1 & DataNode 1
Server4 Coordinator 2 & DataNode 2
Postgres-XL has no fail over support so any failure will require manual intervention.
If you use the replication DISTRIBUTED BY option you would just remove the failing node from the cluster and restart everything.
If you used another DISTRIBUTED BY options then data is shared over multiple nodes which means if you lose any node you lose everything. So for this option you will need to have a slave instance of every data node and coordinator node you have. If one of the nodes fails then you would remove that node from the cluster and replace it with its slave backup node. Then restart it all.