Running multiple tasks in parallel with NAnt - nant

I am trying to run multiple task in parallel to reduce time for my builds. But it seems like tasks are executing one after another sequentially. This is taking too much time for my builds. Is there any way to run multiple tasks in parallel in NAnt?

Looks like this has been asked before. See NAnt: parallel threads for execution of tasks; does this exist?
Not the best question/answer, since the answer was moved out of a comment, which links to a blog that does have code. See AsyncExec and WaitForExit Speeding Up The Build To Do More
Even that only gets to the example of how to use the code at Async Tasks zip file
I have not used the code, but was searching NAnt. I had only used the parallelism of multiple build machines.

Have a look at https://github.com/NAntCrossCompile/NAnt.Parallel There is a plugin in development which allows the parallel execution of tasks based on sets of files, folders, strings...

Related

Talend dynamic jobs slower than tFilterRow

I've got a parent-job reading files/messages and based on the type of message it should run a child-job that is contains all logic required for processing that type of message. Using Dynamic jobs makes it incredibly slow and I don't know why.
Currently we have this growing tree of specific jobs, which I have trouble with grouping back together for a finishing task that they will all have similar.
I tried to use a dynamic job because the job that needs to be called is known in the model,
however this makes the job painfully slow.
Are there 'better' solutions, ways to speed this up or anything that I could be missing?

Running Dymola parallel on the cluster

I am trying to run Dymola on the cluster so that everyone in my research group could submit a model and simulate jobs, is it possible to run Dymola on a cluster and utilize the power of HPC?
I could use some flags to make Dymola run parallel on a many-cores computer, but how to run a parallel simulation on many computers?
Parallelization on a single computer:
Parallelizing a Modelica model is possible, but the the model needs to be
suitable by nature (which doesn't happen too often, at least to my experience), for some examples where it works well see e.g. here
modified manually by the modeler to allow parallelization, e.g. by introducing delay blocks, see here or some similar approach here.
Often Dymola will output No Parallelization in the translation log, presumably due to the model not allowing parallelization (efficiently). Also the manual states: It should be noted that for many kinds of models the internal dependencies don’t allow efficient parallelization for getting any substantial speed-up.
I'm not an expert on this, but as to my understanding HPC depends on massive parallelization. Therefore, models generated by Dymola do not seem to be a very good application to be run on HPC-clusters.
Dymola on multiple computers:
Running a single Dymola-simulation on multiple computers in parallel is not possible as far as I know.
I think there are several answers to this question.
The flags under Translation all refer to parallelized code inside a
single simulation executable. If you mention HPC I do not think you
need to consider this.
To run multiple simulations on a single
multi-core computer there is built-in support in Dymola. The relevant
function is simulateModelMulti, etc. The Sweep Parameters feature uses this automatically.
There is no built-in support
to distribute the simulation on several computers in a cluster.
However, if you generate your dymosim.exe with the Binary Model
Export option enabled, it can be run on other computers. You need to
distribute dymosim.exe, dsin.txt and and data files you read across
the cluster. I'm sure your HPC cluster has tools for that.

Complex Data Pipeline Migration Plan Question

My plan:
Move all data processing to Spark (PySpark preferably) with final output (consumer facing) data going to Redshift only. Spark seems to connect to all the various sources well (Dynamo DB, S3, Redshift). Output to Redshift/S3 etc depending on customer need. This avoids having multiple Redshift clusters, broken/overusing internal unsupported ETL tools, copy of the same data across clusters, views and tables etc (which is the current setup).
Use Luigi to build a web UI to daily monitor pipelines and visualise the dependency tree, and schedule ETL's. Email notifications should be an option for failures also. An alternative is AWS data pipeline, but, Luigi seems to have a better UI for what is happening where many dependencies are involved (some trees are 5 levels deep, but perhaps this can also be avoided with better Spark code).
Questions:
Does Luigi integrate with Spark (I have only used PySpark before, not Luigi, so this is a learning curve for me). The plan was to schedule 'applications' and Spark actually has ETL too I believe, so unsure how Luigi integrates here?
How to account for the fact that some pipelines may be 'real time' - would I need to spin up the Spark / EMR job hourly for example then?
I'm open to thoughts / suggestions / better ways of doing this too!
To answer your questions directly,
1) Yes, Luigi does play nicely with PySpark, just like any other library. We certainly have it running without issue -- the only caveat is that you have to be a little careful with imports and have them within the functions of the Luigi class as, in the background, it is spinning up new Python instances.
2) There are ways of getting Luigi to slurp in streams of data, but it is tricky to do. Realistically, you'd fall back to running an hourly cron cycle to just call the pipeline and process and new data. This sort of reflects Spotify's use case for Luigi where they run daily jobs for calculate top artist, etc.
As #RonD suggests, if I was building a new pipeline now, I'd skip Luigi and go straight to AirFlow. If nothing else, look at the release history. Luigi hasn't really been significantly worked on for a long time (because it works for the main dev). Whereas AirFlow is actively being incubated by Apache.
Instead of Luigi use Apache Airflow for workflow orchestration (code is written in Python). It has a lot of operators and hooks built in which you can call in DAGs (Workflows). For example create task to call operator to start up EMR cluster, another to run PySpark script located in s3 on cluster, another to watch the run for status. You can use tasks to set up dependencies etc too.

How to do parallel pipeline?

I have built a scala application in Spark v.1.6.0 that actually combines various functionalities. I have code for scanning a dataframe for certain entries, I have code that performs certain computation on a dataframe, I have code for creating an output, etc.
At the moment the components are 'statically' combined, i.e., in my code I call the code from a component X doing a computation, I take the resulting data and call a method of component Y that takes the data as input.
I would like to get this more flexible, having a user simply specify a pipeline (possibly one with parallel executions). I would assume that the workflows are rather small and simple, as in the following picture:
However, I do not know how to best approach this problem.
I could build the whole pipeline logic myself, which will probably result in quite some work and possibly some errors too...
I have seen that Apache Spark comes with a Pipeline class in the ML package, however, it does not support parallel execution if I understand correctly (in the example the two ParquetReader could read and process the data at the same time)
there is apparently the Luigi project that might do exactly this (however, it says on the page that Luigi is for long-running workflows, whereas I just need short-running workflows; Luigi might be overkill?)?
What would you suggest for building work/dataflows in Spark?
I would suggest to use Spark's MLlib pipeline functionality, what you describe sounds like it would fit the case well. One nice thing about it is that it allows Spark to optimize the flow for you, in a way that is probably smarter than you can.
You mention it can't read the two Parquet files in parallel, but it can read each separate file in a distributed way. So rather than having N/2 nodes process each file separately, you would have N nodes process them in series, which I'd expect to give you a similar runtime, especially if the mapping to y-c is 1-to-1. Basically, you don't have to worry about Spark underutilizing your resources (if your data is partitioned properly).
But actually things may even be better, because Spark is smarter at optimising the flow than you are. An important thing to keep in mind is that Spark may not do things exactly in the way and in the separate steps as you define them: when you tell it to compute y-c it doesn't actually do that right away. It is lazy (in a good way!) and waits until you've built up the whole flow and ask it for answers, at which point it analyses the flow, applies optimisations (e.g. one possibility is that it can figure out it doesn't have to read and process a large chunk of one or both of the Parquet files, especially with partition discovery), and only then executes the final plan.

Clarifications in Electric commander and tutorial

I was searching for tutorials on Electric cloud over the net but found nothing. Also could not find good blogs dealing with it. Can somebody point me in right directions for this?
Also we are planning on using Electric cloud for executing perl scripts in parallel. We are not going to build software. We are trying to test our hardware in parallel by executing the same perl script in parallel using electric commander. But I think Electric commander might not be the right tool given its cost. Can you suggest some of the pros and cons of using electric commander for this and any other feature which might be useful for our testing.
Thanks...
RE #1: All of the ElectricCommander documentation is located in the Electric Cloud Knowledge Base located at https://electriccloud.zendesk.com/entries/229369-documentation.
ElectricCommander can also be a valuable application to drive your tests in parallel. Here are just a few aspects for consideration:
Subprocedures: With EC, you can just take your existing scripts, drop them into a procedure definition and call that procedure multiple times (concurrently) in a single procedure invocation. If you want, you can further decompose your scripts into more granular subprocedures. This will drive reuse, lower cost of administration, and it will enable your procedures to run as fast as possible (see parallelism below).
Parallelism: Enabling a script to run in parallel is literally as simple as checking a box within EC. I'm not just referring to running 2 procedures at the same time without risk of data collision. I'm referring to the ability to run multiple steps within a procedure concurrently. Coupled with the subprocedure capability mentioned above, this enables your procedures to run as fast as possible as you can nest suprocedures within other subprocedures and enable everything to run in parallel where the tests will allow it.
Root-cause Analysis: Tests can generate an immense amount of data, but often only the failures, warnings, etc. are relevant (tell me what's broken). EC can be configured to look for very specific strings in your test output and will produce diagnostic based on that configuration. So if your test produces a thousand lines of output, but only 5 lines reference errors, EC will automatically highlight those 5 lines for you. This makes it much easier for developers to quickly identify root-cause analysis.
Results Tracking: ElectricCommander's properties mechanism allows you to store any piece of information that you determine to be relevant. These properties can be associated with any object in the system whether it be the procedure itself or the job that resulted from the invocation of a procedure. Coupled with EC's reporting capabilities, this means that you can produce valuable metrics indicating your overall project health or throughput without any constraint.
Defect Tracking Integration: With EC, you can automatically file bugs in your defect tracking system when tests fail or you can have EC create a "defect triage report" where developers/QA review the failures and denote which ones should be auto-filed by EC. This eliminates redundant data entry and streamlines overall software development.
In short, EC will behave exactly they way you want it to. It will not force you to change your process to fit the tool. As far as cost goes, Electric Cloud provides a version known as ElectricCommander Workgroup Edition for cost-sensitive customers. It is available for a small annual subscription fee and something that you may want to follow up on.
I hope this helps. Feel free to contact your account manager or myself directly if you have additional questions (dfarhang#electric-cloud.com).
Maybe you could execute the same perl script on several machines by using r-commands, or cron, or something similar.
To further address the parallel aspect of your question:
The command-line interface lets you write scripts to construct
procedures, including this kind of subprocedure with parallel steps.
So you are not limited, in the number of parallel steps, to what you
wrote previously: you can write a procedure which dynamically sizes
itself to (for example) the number of steps you would like to run in
parallel, or the number of resources you have to run steps in
parallel.