Running certain Hadoop Jobs only on a chosen node and not in the others, managing the process with Oozie - workflow

Is that even possible? I've searched quite a bit and I'd say it's not possible, but I think it's so strange a so basilar functionality has not been foreseen.
If i have a cluster of 3 machine and 1 is relative to a part (Let's say an action i Oozie) of the bigger process, can't i say to Oozie to run that job only on node X and not in the other nodes?

I don not think you can enforce Oozie launcher mapper to run on a certain node.

Related

Does assigning more nodes to a job on a SLURM server increase available RAM?

I am working with a program that needs a lot RAM. Currently I am running it on a SLURM cluster. Each node has 125GB RAM. When submitting the job to a single node it eventually fails as it runs out of memory. My rather naive question, as I am new to working on servers, is:
Does assigning more nodes with the command --nodes flag increase available RAM for the submitted job?
For example:
When assigning 10 nodes instead of 1, with the command below, the program fails at the same point as with with one node.
#SBATCH --nodes=10
Is there some other way to combine RAM from multiple nodes for a single job?
Any and all advice is welcome!
That depends on your program, but most likely no.
To use multiple nodes on a Slurm Cluster (or any cluster, for that matter), your program needs to be set up in very specific way, ie. you need inter node communictaion. This is usually done via MPI and the whole program has to be designed around it.
So if your program uses MPI it may be able to split the workload over several nodes. And even that does not guarantee lower memory as that is usually not the goal of such a parallelization.

Is it possible to run a single container Flink cluster in Kubernetes with high-availability, checkpointing, and savepointing?

I am currently running a Flink session cluster (Kubernetes, 1 JobManager, 1 TaskManager, Zookeeper, S3) in which multiple jobs run.
As we are working on adding more jobs, we are looking to improve our deployment and cluster management strategies. We are considering migrating to using job clusters, however there is reservation about the number of containers which will be spawned. One container per job is not an issue, but two containers (1 JM and 1 TM) per job raises concerns about memory consumption. Several of the jobs need high-availability and the ability to use checkpoints and restore from/take savepoints as they aggregate events over a window.
From my reading of the documentation and spending time on Google, I haven't found anything that seems to state whether or not what is being considered is really possible.
Is it possible to do any of these three things:
run both the JobManager and TaskManager as separate processes in the same container and have that serve as the Flink cluster, or
run the JobManager and TaskManager as literally the same process, or
run the job as a standalone JAR with the ability to recover from/take checkpoints and the ability to take a savepoint and restore from that savepoint?
(If anyone has any better ideas, I'm all ears.)
One of the responsibilities of the job manager is to monitor the task manager(s), and initiate restarts when failures have occurred. That works nicely in containerized environments when the JM and TMs are in separate containers; otherwise it seems like you're asking for trouble. Keeping the TMs separate also makes sense if you are ever going to scale up, though that may moot in your case.
What might be workable, though, would be to run the job using a LocalExecutionEnvironment (so that everything is in one process -- this is sometimes called a Flink minicluster). This path strikes me as feasible, if you're willing to work at it, but I can't recommend it. You'll have to somehow keep track of the checkpoints, and arrange for the container to be restarted from a checkpoint when things fail. And there are other things that may not work very well -- see this question for details. The LocalExecutionEnvironment wasn't designed with production deployments in mind.
What I'd suggest you explore instead is to see how far you can go toward making the standard, separate container solution affordable. For starters, you should be able to run the JM with minimal resources, since it doesn't have much to do.
Check this operator which automates the lifecycle of deploying and managing Flink in Kubernetes. The project is in beta but you can still get some idea about how to do it or directly use this operator if it fits your requirement. Here Job Manager and Task manager is separate kubernetes deployment.

Apache Spark standalone settings

I have an Apache spark standalone set up.
I wish to start 3 workers to run in parallel:
I use the commands below.
./start-master.sh
SPARK_WORKER_INSTANCES=3 SPARK_WORKER_CORES=2 ./start-slaves.sh
I tried to run a few jobs and below are the apache UI results:
Ignore the last three applications that failed: Below are my questions:
Why do I have just one worker displayed in the UI despite asking spark to start 3 each with 2 cores?
I want to partition my input RDD for better performance. So for the first two jobs with no partions, I had a time of 2.7 mins. Here my Scala source code had the following.
val tweets = sc.textFile("/Users/soft/Downloads/tweets").map(parseTweet).persist()
In my third job (4.3 min) I had the below:
val tweets = sc.textFile("/Users/soft/Downloads/tweets",8).map(parseTweet).persist()
I expected a shorter time with more partitions(8). Why was this the opposite of what was expected?
Apparently you have only one active worker, which you need to investigate why other workers are not reported by checking the spark logs.
More partitions doesn't always mean that the application runs faster, you need to check how you are creating partitions from source data, the amount of data parition'd and how much data is being shuffled, etc.
In case you are running on a local machine it is quite normal to just start a single worker with several CPU's as shown in the output. It will still split you task of the available CPU's in the machine.
Partitioning your file will happen automatically depending on the amount of available resources, it works quite well most of the time. Spark (and partitioning the files) comes with some overhead, so often, especially on a single machine Spark adds so much overhead it will slowdown you process. The added values comes with large amounts of data on a cluster of machines.
Assuming that you are starting a stand-alone cluster, I would suggest using the configuration files to setup a the cluster and use start-all.sh to start a cluster.
first in your spark/conf/slaves (copied from spark/conf/slaves.template add the IP's (or server names) of you worker nodes.
configure the spark/conf/spark-defaults.conf (copied from spark/conf/spark-defaults.conf.template Set at least the master node to the server that runs your master.
Use the spark-env.sh (copied from spark-env.sh.template) to configure the cores per worker, memory etc:
export SPARK_WORKER_CORES="2"
export SPARK_WORKER_MEMORY="6g"
export SPARK_DRIVER_MEMORY="4g"
export SPARK_REPL_MEM="4g"
Since it is standalone (and not hosted on a Hadoop environment) you need to share (or copy) the configuration (or rather the complete spark directory) to all nodes in your cluster. Also the data you are processing needs to be available on all nodes e.g. directly from a bucket or a shared drive.
As suggested by the #skjagini checkout the various log files in spark/logs/ to see what's going on. Each node will write their own log files.
See https://spark.apache.org/docs/latest/spark-standalone.html for all options.
(we have a setup like this running for several years and it works great!)

Spark fails with too many open files on HDInsight YARN cluster

I am running into the same issue as in this thread with my Scala Spark Streaming application: Why does Spark job fail with "too many open files"?
But given that I am using Azure HDInsights to deploy my YARN cluster, and I don't think I can log into that machine and update the ulimit in all machines.
Is there any other way to solve this problem? I cannot reduce the number of reducers by too much either, or my job will become much slower.
You can ssh into all nodes from the head node (ambari ui show fqdn of all nodes).
ssh sshuser#nameofthecluster.azurehdinsight.net
You can the write a custom action that alters the settings on the necessary nodes if you want to automate this action.

How to force condor to submit job to all nodes in the cluster?

I have a condor cluster with multiple nodes active.
But when I submit a job, it only runs on a single node (i.e Master node). I'm aware that Condor automatically distributes job based on available resources.
But what if I want to force condor to make use of all the nodes? Just for the sake of evaluating process time when running on multiple nodes vs single node?
I have tried adding requirements = Machine == "hostname1" && Machine == "hostname2" in the submit file, but isn't working.
Depending on what you're trying to do, you might want to use the parallel universe as outlined here: http://research.cs.wisc.edu/htcondor/manual/current/2_9Parallel_Applications.html
With a parallel universe job you indicate the machine count via machine_count and only need to queue a single task.
I am afraid that I not fully understanding what you are asking. Let's see if I can help somehow. I can see a few scenarios:
Condor is only scheduling your jobs to run on the master node, regardless of how many machines are available.
Condor is scheduling jobs on all available machines. However what you are trying to do is get a particular job to make use of more than one machine.
In case 1. something fishy is going on with either your submit file or your pool setup. I will assume that condor_status returns more than one machine and that your pool setup is OK. The typical gotcha in this case is the following: if you do not specify a Requirement for your job, Condor will insert one for you. By default Condor will request that job runs on a machine that has the same OS and architecture of the submit node. This one did bite me a few times with heterogeneous pools ;-)
In case 2. you will have to make sure that your executable can make use of multiple machines (e.g. by way of MPI) and you need to tell Condor about it. One way to do that is to use the Parallel universe. Another way is to use a classic master/worker architecture where the workers are persistent Condor jobs.
Condor is limited in a way that it can only execute (system()) a command. If your program does not create many subtasks, you will not experience any speed improvement.
Please post a short snippet of your job description (file).