How to get the worker info of a task id in JStorm? - worker

When I'm implementing a CustomStreamGrouping, the WorkerTopologyContext context can tell me getThisWorkerTasks() and one task id's component.
But how could I get not only ThisWorker's tasks, but I want to know every task id's worker info.
There seems no worker id to identify worker instance. So is there a way to get all the workers' hosts and ports as well? And a way to know one task id is in which worker?

Related

Is there a Cadence metric that can help spot overloads for each specific activity worker?

My company would like to automatically scale the activity workers and each workflow workers independently according to the load of a tasklist.
Reading the docs I have found the following metrics for activity workers:
cadence_activity_scheduled_to_start_latency_bucket
cadence_activity_scheduled_to_start_latency_count
cadence_activity_scheduled_to_start_latency_sum
However these seem to be global metrics for activity workers. Is there a Cadence metric that would allow me to spot overloads for each specific activity worker?
Example:
We have 4 different activity workers : A, B, C and D
We would like to scale independently A or B or C or D without impacting the others
Understand scheduled_to_start_latency
scheduled_to_start_latency is a measurement of the time from scheduled to started by worker. From scheduled to started, a task is transferred from matching service to an activity worker.
These are the potential hotspots when this latency got high:
The matching service is too hot to dispatch tasks -- in this case, need to confirm with CPU/memory of the matching nodes
The tasklist is overloaded because it defaults to have one partition which mapped to only one matching node: https://cadenceworkflow.io/docs/operation-guide/maintain/#scale-up-a-tasklist-using-scalable-tasklist-feature -- in this case, use task per second metrics to confirm the task rate of the tasklist
The activity worker is overloaded.
How to monitor activity worker being overloaded
CPU/memory/Thread usage/Garbage collection of the activity worker is usually enough to make sure an worker is not overloaded
You can also use scheduled_to_start_latency, but the high latency could mean different things like above. Use other metrics to rule out the causes.

Using Celery with multiple workers in different pods

What I'm trying to do is using Celery with Kubernetes. I'm using Redis as the message broker in a different pod and I have multiple pods for each queue of Celery.
Imagine if I have 3 queues, I would have 3 different pods (i.e workers) that can accept and handle the requests.
Everything is working fine so far but my question is, what would happen if I clone the pod of one of queues to have two pods for one single queue?
I think client (i.e Django) creates a new message using Redis to send to the worker and start the job but it's not clear to me what would happen because I have two pods listening to the same queue? Does the first pod accept the request and start the job and prevents the other pod to accept the request?
(I tried to search a bit on the documentation of Celery to see if I can find any clues but I couldn't. That's why I'm asking this question)
I guess you are using basic task type, which employs 'direct' queue type, not 'fanout' or 'topic' queue, the latter two have much difference, which will not be discussed here.
While using Redis as broker transport, celery/kombu use a Redis list object as a storage of queue (source), use command LPUSH to publish message, BRPOP to consume the message.
In short, BRPOP(doc) blocks the connection when there are no elements to pop from the given lists, if the list is not empty, an element is popped from the tail of the given list. It is guaranteed that this operation is atomic, no two connection could get the same element.
Celery leverage this feature to guarantees at-least-once message delivery. use of acknowledgment doesn't affect this guarantee.
In your case, there are multiple celery workers across multiple pods, but all of them connected to one same Redis server, all of them blocked for the same key, try to pop an element from the same list object. when new message arrived, there will be one and only one worker could get that message.
A task message is not removed from the queue until that message has been acknowledged by a worker. A worker can reserve many messages in advance and even if the worker is killed – by power failure or some other reason – the message will be redelivered to another worker.
More: http://docs.celeryproject.org/en/latest/userguide/tasks.html
The two workers (pods) will receive tasks and complete them independently. It's like have a single pod, but processing task at twice the speed.

Distributed queue consumers in an unstable net

I'm working on the design of a distributed system. The system consists of multiple producers, distributed queue and multiple consumers aka workers.
Workers instances resides within datacentres in different locations. Sometimes one location is manually disconnected.
In such a case, the issue is the worker from the disconnected location got some task from the queue and is then shutting down before task completion. I want:
workers from an alive location be able to got such a task and complete it
when a disconnected worker finally turns on, it should determine if the task was already completed by another worker and decide what to do with it
What is a convenient way to solve such an issue?
This design might help you. Every time a worker consumes a task, move the task from queue to some other distributed list of consumed tasks. In this list of tasks, maintain a timestamp with every task.
Then the worker that consumed the task should send some kind of still alive message every second or so (similar to Hadoop's hearbeat message) that updates the timestamp of a task in consumed tasks list. This is to indicate that the worker who consumed this task is still alive and received a message from him recently.
Now, implement a daemon to monitor this consumed tasks list and move the tasks back to queue whose timestamp is older than a threshold number of seconds (considering message losses).

Communication with two agents within a single block in Anylogic

As seen below in my flowchart I am trying to model jobs that are being sent to servers. In the service block, my resource pool is servers.
My current model has Agent 'Jobs' being created in the source. they are then sent to the Queue and to the Service block where the Service block will seize a server(Server Agent) from the resource pool.
I have set out my simulation so that servers are deleted at random times.
My trouble is: When a server that is currently working on a Job is deletes (at a random time), how is it possible to send the Job back to the queue.
I'm having an issue getting the service block/server pool accessing the Jobs agent
I'm not sure how you're deleting your servers but if you're doing so by reducing the capacity of the resource pool my answer will work as you desire.
For you to return the job back to queue, first you'll need some changes to your flowchart. (See Image)
Then, in your service block, change your settings to match mine:
And voilá, that's it. If you're using a different type of deletion and this approach doesn't work, let me know.
Cheers,
Luís Pereira

Spark: How to debug/log a task at specific index

I have one process that stucks at the same point. The information that I know is the Task's index at the Details pages (referring to the Dashboard UI).
How can I debug/log exactly that task at specific index?
Based on then answer in:
How to get ID of a map task in Spark?
I can see how to get task info. But what are the IDs in the UI dashboard referred to in that object?
is ID = org.apache.spark.scheduler.TaskInfo.id and Index = org.apache.spark.schedulerTaskInfo.partionId ?
The IDs in the dashboard refers to partitions in spark. Whenever a job is launched, your input data is partitioned and depending on the number of partitions, you'll have them mapped to task IDs.
It's not a trivial task to debug spark jobs as they're map reduce tasks of your data done by your algorithm. It's fairly easy though, to add logs to debug your job after the fact. The logs would have to be collected on the workers, or in each of the executor's working directory.