In a Spring-XD job module, how to write data to two files respectively at the same path in two servers where XD containers are running? - spring-batch

My XD cluster has two node servers, so there're two XDadmins and two XDcontainers. When I launch the job in XD-shell through one Admin server, the job writes data to the file at one node. At the other node, no data is written.
I want the job to write data to these two servers under the installation directory, such as /app/spring-xd-1.3.1.RELEASE/xd/files/.
In the job, I use java.util.Properties class to write data to files.
// file is the path where I want to put the file
properties.store(new FileOutputStream(file), "update properties");
My job is like
job create --name Generate_properties_to_file --definition "GenProperties --proFile=/app/spring-xd-1.3.1.RELEASE/xd/files/test.properties --otherparameters=''" --deploy
How could I modify the job module to implement the function?

Related

Config files in Datastage

We can have multiple config files in a project. Even we can run a parallel job on different config files. But can we run a parallel job on multiple config files at a time or can a parallel job use multiple config files at the same time?
One job will use one configuration file but you can limit ressources for certain stages by running them on certain nodes(node pools).
Check out node pool and resource constraints located at the Advanced tab of the stages.
This means you can prepare a config file to support multiple scenarios and multiple levels of parallelism.

Reading a file from local file system after reading it from hadoop file system

I am trying to read a file from my local EMR file system. It is there as a file under the folder /emr/myFile.csv. However, I keep getting a FileNotFoundException. Here is the line of code that I use to read it:
val myObj: File = new File("/emr/myFile.csv")
I added a file://// prefix to the file path as well because I have seen that work for others, but that still did not work. So I also try to read directly from the hadoop file system where it is stored in the folder: /emr/CNSMR_ACCNT_BAL/myFile.csv because I thought it was maybe checking by default in hdfs. However, that also results in a FileNotFoundException. Here is the code for that:
val myObj: File = new File("/emr/CNSMR_ACCNT_BAL/myFile.csv")
How can I read this file into a File?
For your 1st problem:
When you submit a hadoop job application master can get created on any of your worker node including master node (depending on your configuration).
If you are using EMR, your application master by default gets created on any of your worker node (CORE node) but not on master.
When you say file:///emr/myFile.csv this file exists on your local file system (I'm assuming that means on master node), your program will search for this file on that node where the application master is and its definitely not on your master node because for that you wouldn’t get any error.
2nd problem:
When you try to access a file in HDFS using java File.class, it won’t be able to access that file.
You need to use hadoop FileSystem api (org.apache.hadoop.fs.FileSystem) to interact with a HDFS file.
Also use HDFS file tag hdfs://<namenode>:<port>/emr/CNSMR_ACCNT_BAL/myFile.csv.
If your core-site.xml contains value of fs.defaultFS then you don’t need to put namenode and port info just simply hdfs:///emr/CNSMR_ACCNT_BAL/myFile.csv
So what's better option here while accessing file in hadoop cluster?
The answer depends upon your use case, but most cases putting it in HDFS it much better, because you don’t have to worry about where your application master is. Each and every node have access to the hdfs.
Hope that resolves your problem.

How to cache jars for DataProc Spark job submission

I am submitting a Spark job to Dataproc using either gcloud or Google Cloud DataProc API. One of the arguments is '--jars' (or its Java API equivalent), where I supply comma separated list of jar files to be provided to the executor and driver classpaths:
gs://google-storage-bucket/lib/x1.jar,gs://google-storage-bucket/lib/x2.jar, etc...
Same JAR files are copied from Google storage bucket to the working directory for each SparkContext on the executor nodes every time I submit a job and it takes about 2 minutes, before the job really starts execution (I can see that on the Google Cloud console - https://console.cloud.google.com/dataproc/jobs/...).
Is it possible to somehow cache these jar files on Spark nodes and use them in the classpath with every job submission? That would save about 50% of the run time.
Thanks,
Victor
Indeed, if you pass in arguments of the form file:///your/path/on/the/cluster/nodes/filesystem then it will be interpreted as referring to files on the cluster nodes themselves.
You can either copy files from GCS into the nodes at cluster creation time using an initiailization action or try to run some kind of Spark job to do it on an existing cluster and/or manually SSH'ing in to stage those jars.

DASK with local files on WORKER systems

I am working with mutiple systems as workers.
Each worker system has a part of the data locally stored. And I want the computation done by each worker on its respective file only.
I have tried using :
distributed.scheduler.decide_worker()
send_task_to_worker(worker, key)
but I could not automate assigning the task for each file.
Also, is there anyway I can access local files of the worker? Using tcp address, I only have access to a temp folder of the worker created for dask.
You can target computations to run on certain workers using the workers= keyword to the various methods on the client. See http://distributed.readthedocs.io/en/latest/locality.html#user-control for more information.
You might run a function on each of your workers that tells you which files are present:
>>> client.run(os.listdir, my_directory)
{'192.168.0.1:52523': ['myfile1.dat', 'myfile2.dat'],
'192.168.0.2:4244': ['myfile3.dat'],
'192.168.0.3:5515': ['myfile4.dat', 'myfile5.dat']}
You might then submit computations to run on those workers specifically.
future = client.submit(load, 'myfile1.dat', workers='192.168.0.1:52523')
If you are using dask.delayed you can also pass workers= to the `persist method. See http://distributed.readthedocs.io/en/latest/locality.html#user-control for more information

Copy a file from HDFS to a local directory for multiple tasks on a node?

So, basically, I have a read only file (several GBs big, so broadcasting is no option) that must be copied to a local folder on the node, as each task internally runs a program (by using os.system in python or ! operator in scala) that reads from a local file (can't read from HDFS). The problem however is that several tasks would be running on one node. If the file is not already there on the node, it should be copied from the HDFS to a local directory. But how could I have one task get the file from the HDFS, while other tasks wait for it (note that each task would be running in parallel on a node). Which file synchronization mechanism could I use in Spark for that purpose?