Parallelism static job with static work assignment - kubernetes

This document describes options for running parallel jobs. The one I'm interested in is "Single job with static work assignment" Where I would create a job, set parallelism > 0 and completions == count of my work items.
The problem is, I don't know how would the pods know which item they should process? environment variables look identical... not sure if k8s passes some sequence number I can use... ideas?

The best approach in your case is to use some MQ services.
Your actions order may be the next:
Start MQ service
Create a queue and fill it with messages
Start a Job that works on tasks from the queue
Here you can find the latest documentation on Job Patterns.
Also look at kube examples that you can find here and here.
In these examples/tasks, Kubernetes uses RabbitMQ.

Related

How to configure channels and AMQ for spring-batch-integration where all steps are run as slaves on another cluster member

Followup to Configuration of MessageChannelPartitionHandler for assortment of remote steps
Even though the first question was answered (I think well), I think I'm confused enough that I'm not able to ask the right questions. Please direct me if you can.
Here is a sketch of the architecture I am attempting to build. Today, we have a job that runs a step across the cluster that works. We want to extend the architecture to run n (unbounded and different) jobs with n (unbounded and different) remote steps across the cluster.
I am not confusing job executions and job instances with jobs. We already run multiple job instances across the cluster. We need to be able to run other processes that are scalable in hte same way as the one we have defined already.
The source data is all coming from database which are known to the steps. The partitioner is defining the range of data for the "where" clause in the source database and putting that in the stepContext. All of the actual work happens in the stepContext. The jobContext simply serves to spawn steps, wait for completion, and provide the control API.
There will be 0 to n jobs running concurrently, with 0 to n steps from however many jobs running on the slave VM's concurrently.
Does each master job (or step?) require its own request and reply channel, and by extension its own OutboundChannelAdapter? Or are the request and reply channels shared?
Does each master job (or step?) require its own aggregator? By implication this means each job (or step) will have its own partition handler (which may be supported by the existing codebase)
The StepLocator on the slave appears to require a single shared replyChannel across all steps, but it appears to me that the messageChannelpartitionHandler requires a separate reply channel per step.
What I think is unclear (but I can't tell since it's unclear) is how the single reply channel is picked up by the aggregatedReplyChannel and then returned to the correct step. Of course I could be so lost I'm asking the wrong questions.
Thank you in advance
Does each master job (or step?) require its own request and reply channel, and by extension its own OutboundChannelAdapter? Or are the request and reply channels shared?
No, there is no need for that. StepExecutionRequests are identified with a correlation Id that makes it possible to distinguish them.
Does each master job (or step?) require its own aggregator? By implication this means each job (or step) will have its own partition handler (which may be supported by the existing codebase)
That should not be the case, as requests are uniquely identified with a correlation ID (similar to the previous point).
The StepLocator on the slave appears to require a single shared replyChannel across all steps, but it appears to me that the messageChannelpartitionHandler requires a separate reply channel per step.
The messageChannelpartitionHandler should be step or job scoped, as mentioned in the Javadoc (see recommendation in the last note). As a side note, there was an issue with message crossing in a previous version due to the reply channel being instance based, but it was fixed here.

k8s job with an unlimited/unknown number of work-items (completions)

I have a queue which filled (consistently) with work-item by one of my k8s pod.
I want to use k8s job to process each work item as I want each job to be handle by new pod as I want to use multiple container as suggest here, but jobs don't seem to support an infinite number of completions.
When I use spec.completions: null I got BackoffLimitExceeded.
Any Idea how to implement job without the need to specify the number of work item?
Is there alternative to job for implementing background worker in k8s ?
Thank you
My suggestion is to use Kubernetes resources in the way they have been designed: Job resources are one-off tasks that can ben triggered many times but they differ from the background jobs you're eager to implement.
If your application is popping jobs from a queue/backend, it's better to put in a Deployment with a for loop (YMMV according to programming language) and, eventually, scale it to down with another component if you don't want to allocate unused resources.
Another solution could be the specialization of each Job, using UUID as the name of the jobs and labelling them in order to group them: anyway, first suggestion of using Kubernetes in the Kubernetes way is strongly recommended.

Reliably running hundreds of scheduled functions every minute

I am building an application that will need to run hundreds of short running tasks every minute. These functions are not doing anything special other than making calls to an HTTP endpoint. I need a reliable mechanism for scheduling these invocations every minute indefinitely. Failures to run at the scheduled time cannot be tolerated. I have considered the following options for the scheduler:
AWS Lambda
Mesosphere Chronos
Cron
Python Celery
Obviously there is a trade off between cost, maintainability (I will need to update the logic of these functions every once in a while), and reliability.
My question is, which of these options would be the most appropriate if I am most concerned about consistency/reliability? Are there options I'm missing that I should consider?
As you already mentioned, there are multiple technologies that could help you do this, I would say that the trick is more to find the logic flow/model to use.
For example, If the number of tasks are not fixed, a publish/subscribe pattern could apply, for this something like rabbitMQ or AWS SQS could be used.
There are multiple ways about how to submit a task to the queue and also how to de-queue, you could have multiple workers reading/waiting for events in where they could read one by one or by chunks (based on the num of cores per server) all this bound to the speed and precision you may want.
Scaling I would say is easier since if need more speed (precision to do all tasks every minute) just need to add more workers.
For more ideas check this article Using AWS Lambda with Amazon DynamoDB it covers a stream-based model / event-sourcing.

Kubernetes - identical jobs, different parameters

What's the easiest way to run a configurable number of identical jobs on Kubernetes but give each of them different parameters (like job number)?
1) You could either just have a template job and use bash expansions to have multiple job specifications based of that initial template.
As shown in the official Parallel Processing using Expansions user guide:
mkdir ./jobs
for i in apple banana cherry
do
cat job.yaml.txt | sed "s/\$ITEM/$i/" > ./jobs/job-$i.yaml
done
kubectl create -f ./jobs
2) Or you could create a queue and have a specified number of parallel workers/jobs to empty the queue. The contents of the queue would then be the input for each worker and Kubernetes could spawn parallel jobs. That's best described in the Coarse Parallel Processing using a Work Queue user guide.
The first approach is simple and straight forward but lacks flexibility
The second requires a message queue as "overhead" but you'll gain flexibility

Apache Spark - How does internal job scheduler in spark define what are users and what are pools

I am sorry about being a little general here, but I am a little confused about how job scheduling works internally in spark. From the documentation here I get that it is some sort of implementation of Hadoop Fair Scheduler.
I am unable to come around to understand that who exactly are users here (are the linux users, hadoop users, spark clients?). I am also unable to understand how are the pools defined here. For example, In my hadoop cluster I have given resource allocation to two different pools (lets call them team 1 and team 2). But in spark cluster, wont different pools and the users in them instantiate their own spark context? Which again brings me to question that what parameters do I pass when I am setting property to spark.scheduler.pool.
I have a basic understanding of how driver instantiates a spark context and then splits them into task and jobs. May be I am missing the point completely here but I would really like to understand how Spark's internal scheduler works in context of actions, tasks and job
I find official documentation quite thorough and covering all your questions. However, one might find it hard to digest from the first time.
Let us put some definitions and rough analogues before we delve into details. application is what creates SparkContext sc and may be referred to as something you deploy with spark-submit. job is an action in spark definition of transformation and action meaning anything like count, collect etc.
There are two main and in some sense separate topics: Scheduling Across applications and Scheduling Within application. The former relates more to Resource Managers including Spark Standalone FIFO only mode and also concept of static and dynamic allocation.
The later, Scheduling Within Spark application is the matter of your question, as I understood from your comment. Let me try to describe what happens there at some level of abstraction.
Suppose, you submitted your application and you have two jobs
sc.textFile("..").count() //job1
sc.textFile("..").collect() //job2
If this code happens to be executed in the same thread there is no much interesting happening here, job2 and all its tasks get resources only after job1 is done.
Now say you have the following
thread1 { job1 }
thread2 { job2 }
This is getting interesting. By default, within your application scheduler will use FIFO to allocate resources to all the tasks of whichever job happens to appear to scheduler as first. Tasks for the other job will get resources only when there are spare cores and no more pending tasks from more "prioritized" first job.
Now suppose you set spark.scheduler.mode=FAIR for your application. From now on each job has a notion of pool it belongs to. If you do nothing then for every job pool label is "default". To set the label for your job you can do the following
sc.setLocalProperty("spark.scheduler.pool", "pool1").textFile("").count() // job1
sc.setLocalProperty("spark.scheduler.pool", "pool2").textFile("").collect() // job2
One important note here is that setLocalProperty is effective per thread and also all spawned threads. What it means for us? Well if you are within the same thread it means nothing as jobs are executed one after another.
However, once you have the following
thread1 { job1 } // pool1
thread2 { job2 } // pool2
job1 and job2 become unrelated in the sense of resource allocation. In general, properly configuring each pool in fairscheduler file with minShare > 0 you can be sure that jobs from different pools will have resources to proceed.
However, you can go even further. By default, within each pool jobs are queued up in a FIFO manner and this situation is basically the same as in the scenario when we have had FIFO mode and jobs from different threads. To change that you you need to change the pool in the xml file to have <schedulingMode>FAIR</schedulingMode>.
Given all that, if you just set spark.scheduler.mode=FAIR and let all the jobs fall into the same "default" pool, this is roughly the same as if you would use default spark.scheduler.mode=FIFO and have your jobs be launched in different threads. If you still just want single "default" fair pool just change config for "default" pool in xml file to reflect that.
To leverage the mechanism of pools you need to define the concept of user which is the same as setting "spark.scheduler.pool" from a proper thread to a proper value. For example, if your application listens to JMS, then a message processor may set the pool label for each message processing job depending on its content.
Eventually, not sure if the number of words is less than in the official doc, but hopefully it helps is some way :)
By default spark works with FIFO scheduler where jobs are executed in FIFO manner.
But if you have your cluster on YARN, YARN has pluggable scheduler, it means in YARN you can scheduler of your choice. If you are using YARN distributed by CDH you will have FAIR scheduler by deafult but you can also go for Capacity scheduler.
If you are using YARN distributed by HDP you will have CAPACITY scheduler by default and you can move to FAIR if you need that.
How Scheduler works with spark?
I'm assuming that you have your spark cluster on YARN.
When you submit a job in spark, it first hits your resource manager. Now your resource manager is responsible for all the scheduling and allocating resources. So its basically same as that of submitting a job in Hadoop.
How scheduler works?
Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time(using preemption killing all over used tasks). Unlike the default Hadoop scheduler(FIFO), which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also a reasonable way to share a cluster between a number of users. Finally, fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that each job should get.
The CapacityScheduler is designed to allow sharing a large cluster while giving each organization a minimum capacity guarantee. The central idea is that the available resources in the Hadoop Map-Reduce cluster are partitioned among multiple organizations who collectively fund the cluster based on computing needs. There is an added benefit that an organization can access any excess capacity no being used by others. This provides elasticity for the organizations in a cost-effective manner.
Spark internally uses FIFO/FCFS job scheduler. But, when you talk about the tasks, it works in a Round Robin fashion. It will be clear if we concentrate on the below example:
Suppose, the first job in Spark's own queue doesn't require all the resources of the cluster to be utilized; so, immediately second job in the queue will also start getting executed. Now, both jobs are running simultaneously. Each job has few tasks to be executed in order to execute the whole job. Assume, the first job assigns 10 tasks and the second one assigns 8. Then, those 18 tasks will share the CPU cycles of the whole cluster in a preemptive manner. If you want to further drill down, lets start with executors.
There will be few executors in the cluster. Assume the number is 6. So, in an ideal condition, each executor will be assigned 3 tasks and those 3 tasks will get same CPU time of the executors(separate JVM).
This is how spark internally schedules the tasks.