Config files in Datastage - datastage

We can have multiple config files in a project. Even we can run a parallel job on different config files. But can we run a parallel job on multiple config files at a time or can a parallel job use multiple config files at the same time?

One job will use one configuration file but you can limit ressources for certain stages by running them on certain nodes(node pools).
Check out node pool and resource constraints located at the Advanced tab of the stages.
This means you can prepare a config file to support multiple scenarios and multiple levels of parallelism.

Related

Is it possible in airflow to run a single task on multiple worker nodes i.e running a task in distributed way

I am using spring batch to create a workflow of batch job. The single batch job takes 2 hrs to complete(data to be processed ~ 1 million) so decided to run in distributed way where one task will be distributed across multiple worker nodes, that way it can execute in short time. The other jobs (all are working in distributed manner) in workflow need to run in sequential manner one after other. The jobs are multi node distributed jobs(master/slave architecture) that need to run one after another.
Now, I was considering to deploy the workflow on airflow. So, while exploring that I could not find any way to run a single task that distributes across multiple machine. Is it possible in airflow?
Yes, you can create a task using Spark framework. Spark allows you to process the data on multiple nodes in a distributed fashion.
You can then use SparkSubmitOperator to align the task in your DAG.

Apache Spark standalone settings

I have an Apache spark standalone set up.
I wish to start 3 workers to run in parallel:
I use the commands below.
./start-master.sh
SPARK_WORKER_INSTANCES=3 SPARK_WORKER_CORES=2 ./start-slaves.sh
I tried to run a few jobs and below are the apache UI results:
Ignore the last three applications that failed: Below are my questions:
Why do I have just one worker displayed in the UI despite asking spark to start 3 each with 2 cores?
I want to partition my input RDD for better performance. So for the first two jobs with no partions, I had a time of 2.7 mins. Here my Scala source code had the following.
val tweets = sc.textFile("/Users/soft/Downloads/tweets").map(parseTweet).persist()
In my third job (4.3 min) I had the below:
val tweets = sc.textFile("/Users/soft/Downloads/tweets",8).map(parseTweet).persist()
I expected a shorter time with more partitions(8). Why was this the opposite of what was expected?
Apparently you have only one active worker, which you need to investigate why other workers are not reported by checking the spark logs.
More partitions doesn't always mean that the application runs faster, you need to check how you are creating partitions from source data, the amount of data parition'd and how much data is being shuffled, etc.
In case you are running on a local machine it is quite normal to just start a single worker with several CPU's as shown in the output. It will still split you task of the available CPU's in the machine.
Partitioning your file will happen automatically depending on the amount of available resources, it works quite well most of the time. Spark (and partitioning the files) comes with some overhead, so often, especially on a single machine Spark adds so much overhead it will slowdown you process. The added values comes with large amounts of data on a cluster of machines.
Assuming that you are starting a stand-alone cluster, I would suggest using the configuration files to setup a the cluster and use start-all.sh to start a cluster.
first in your spark/conf/slaves (copied from spark/conf/slaves.template add the IP's (or server names) of you worker nodes.
configure the spark/conf/spark-defaults.conf (copied from spark/conf/spark-defaults.conf.template Set at least the master node to the server that runs your master.
Use the spark-env.sh (copied from spark-env.sh.template) to configure the cores per worker, memory etc:
export SPARK_WORKER_CORES="2"
export SPARK_WORKER_MEMORY="6g"
export SPARK_DRIVER_MEMORY="4g"
export SPARK_REPL_MEM="4g"
Since it is standalone (and not hosted on a Hadoop environment) you need to share (or copy) the configuration (or rather the complete spark directory) to all nodes in your cluster. Also the data you are processing needs to be available on all nodes e.g. directly from a bucket or a shared drive.
As suggested by the #skjagini checkout the various log files in spark/logs/ to see what's going on. Each node will write their own log files.
See https://spark.apache.org/docs/latest/spark-standalone.html for all options.
(we have a setup like this running for several years and it works great!)

In a Spring-XD job module, how to write data to two files respectively at the same path in two servers where XD containers are running?

My XD cluster has two node servers, so there're two XDadmins and two XDcontainers. When I launch the job in XD-shell through one Admin server, the job writes data to the file at one node. At the other node, no data is written.
I want the job to write data to these two servers under the installation directory, such as /app/spring-xd-1.3.1.RELEASE/xd/files/.
In the job, I use java.util.Properties class to write data to files.
// file is the path where I want to put the file
properties.store(new FileOutputStream(file), "update properties");
My job is like
job create --name Generate_properties_to_file --definition "GenProperties --proFile=/app/spring-xd-1.3.1.RELEASE/xd/files/test.properties --otherparameters=''" --deploy
How could I modify the job module to implement the function?

How can i identify processed files in Data flow Job

How can I identify processed files in Data flow Job? I am using a wildcard to read files from cloud storage. but every time when the job runs, it re-read all files.
This is a batch Job and following is sample reading TextIO that I am using.
PCollection<String> filePColection = pipeline.apply("Read files from Cloud Storage ", TextIO.read().from("gs://bucketName/TrafficData*.txt"));
To see a list of files that match your wildcard you can use gsutils, which is the Cloud Storage command line utility. You'd do the following:
gsutils ls gs://bucketName/TrafficData*.txt
Now, when it comes to running a batch job multiple times, your pipeline has no way to know which files it has analyzed already or not. To avoid analyzing new files you could do either of the following:
Define a Streaming job, and use TextIO's watchForNewFiles functionality. You would have to leave your job to run for as long as you want to keep processing files.
Find a way to provide your pipeline with files that have already been analyzed. For this, every time you run your pipeline you could generate a list of files to analyze, put it into a PCollection, read each with TextIO.readAll(), and store the list of analyzed files somewhere. Later, when you run your pipeline again you can use this list as a blacklist for files that you don't need to run again.
Let me know in the comments if you want to work out a solution around one of these two options.

Copy a file from HDFS to a local directory for multiple tasks on a node?

So, basically, I have a read only file (several GBs big, so broadcasting is no option) that must be copied to a local folder on the node, as each task internally runs a program (by using os.system in python or ! operator in scala) that reads from a local file (can't read from HDFS). The problem however is that several tasks would be running on one node. If the file is not already there on the node, it should be copied from the HDFS to a local directory. But how could I have one task get the file from the HDFS, while other tasks wait for it (note that each task would be running in parallel on a node). Which file synchronization mechanism could I use in Spark for that purpose?