google Dataprep: number of instances and architecture optimisation - google-cloud-dataprep

I have noticed that every destination in Google dataprep (be it manual or scheduled) spins up a compute engine instance. Limit quota for a normal account is 8 instances max.
look at this flow:
dataprep flow
Since datawrangling is composed by multiple layers and you might want to materialize intermediate steps with exports, what is the best approach/ architecture to run dataprep flows?
Option A
run 2 separate flows and schedule them with a 15 min. discrepancy:
first flow will export only the final step
other flow will export intermediate steps only
this way you´re not hitting the quota limit but you´re still calculating early stages of the same flow multiple times
Option B
leave the flow as it is and request for more Compute Engine Quota: Computational effort is the same, I will just have more instances running in parallel instead of sequentially
Option C
each step has his own flow + create reference dataset:
this way each flow will only run one single step.
E.g.
when I run the job "1549_first_repo" I will no longer calculate the 3 previous steps but only the last one: the transformations between the referenced "5912_first" table and "1549_first_repo".
This last option seems to me the most reasonable as each transformation is run once at most, Am I missing something?
and also, is there a way to run each export sequentially instead of in parallel?
-- EDIT 30. May--
it turns out option C is not the way to go as "referencing" is a pure continuation of the previous flow. You could imagine the flow before the referenced dataset and after the referenced dataset as a single flow.
Still trying to figure out how to achieve modularity without redundantly calculating the same operations.

Both options A and B are good, the difference being the quota increase. If you are expecting to upgrade sooner or later, might as well do it sooner.
And other option, if you are familiar with java or python and Dataflow, is to create a pipeline having a combination of numWorkers, workerMachineType, and maxNumWorkers that fits within your trial limit of 8 cores (or virtual CPUs). Here are the pipeline option and here is a tutorial that can give you a better view of the product.

Related

Performance issues when merging large files into one single file

I have a pipeline contains multiple copy activity, and the main purpose of these activities is to merge multiples files into one single file.
the problem of this pipeline is, it takes about 4 hours to executes (to merge the files).is there any way to reduce the duration please.
thanks for your reply .
If the copy operation is being performed on an Azure integration
runtime, the following steps must be followed:
For Data Integration Units (DIU) and parallel copy settings, start with the
default values.
If you're using a self-hosted integration runtime, you'll need to do
the following:
Would recommend that you run IR on a separate computer. The machine should
be kept isolated from the data store server. Start using the default
defaults for parallel copy settings and the self-hosted IR on a single
node.
Else you may leverage:
A Data Integration Unit (DIU)
It is a measure that represents the power of a single unit in Azure Data Factory and Synapse pipelines. Power is a combination of CPU, memory, and network resource allocation. DIU only applies to Azure integration runtime. DIU does not apply to self-hosted integration runtime.
Parallel Copy
Could set the parallel Copies property to indicate the parallelism you want the copy activity to use. Think of this property as the maximum number of threads within the copy activity. The threads operate in parallel. The threads either read from your source, or write to your sink data stores.
Here, is the MSFT Document to Troubleshoot copy activity performance.
When copying data into Azure Table, default parallel copy is 4.he
range of DIU setting is 2-256.However, specific behaviors of DIU in
different copy scenarios are different even though you set the number
as you want.
Please see the table list here,especially for the below part
DIU has some limitations as you seen,so you could choose the optimal setting with your custom scenario.
If you are trying to copy 1GB data, thus somehow DIU never crossed 4.
But when If you try to copy 10GB data, then you could notice DIU started scaling up beyond 4.
Here is the list of the Data Integration Units.

Is Apache Beam the right tool for feature pre processing?

So this is a bit of a weird question as it isn't related to how to use the tool but more about why to use it.
I'm deploying a model and thinking of using Apache-beam to run the feature processing tasks using its python API. Documentation is pretty big and complex but I went through most of it, even built a small working pipeline, and it is still not clear this would be the right tool for me.
An example of what I need is the following:
Input data structure:
ID | Timestamp | category
output needed:
category | category count for last 30 minutes (feature example)
This process needs to run every 5 minutes and update the counts.
===> What I fail to understand is if apache can run this pipeline every 5 minutes, read whichever new input data was generated and update the counts of the previous time it ran. And if so, can someone point me in the right direction?
Thank you!
When you run a Beam pipeline manually, it's expected to be started only once. Then it could be either a Bounded (Batch) or Unbounded (Streaming) pipeline. In the first case, it will be stopped after the all your bounded amount of data has been processed, in the second case it will run continuously and expect new data arrival (until it will be stopped manually).
Usually, the type of pipeline depends on data source that you have (Beam IO connectors). For example, if you read from files, then, by default, it's assumed to be a bounded source (limited number of files), but it could be unbounded source as well if you expect to have more new files to arrive and want to read them in the same pipeline.
Also, you can run your batch pipeline periodically with automated tools, like Apache Airflow (or just unix crontab). So, it all depends on your needs and type or data source. I could probably give more specific advice if you could share more details of your data pipeline - type of your data source and environment, an example of input and output results, how often your input data can be updated and so on.

google dataprep (clouddataprep by trifacta) tip: jobs will not be able to run if they are to large

During my cloud dataprep adventures I have come across yet another very annoying bug.
The problem occurs when creating complex flow structures which need to be connected through reference datasets. If a certain limit is crossed in performing a number of unions or a joins with these sets, dataflow is unable to start a job.
I have had a lot of contact with support and they are working on the issue:
"Our Systems Engineer Team was able to determine the root cause resulting into the failed job. They mentioned that the job is too large. That means that the recipe (combined from all datasets) is too big, and Dataflow rejects it. Our engineering team is still investigating approaches to address this.
A workaround is to split the job into two smaller jobs. The first run the flow for the data enrichment, and then use the output as input in the other flow. While it is not ideal, this would be a working solution for the time being."
I ran into the same problem and have a fairly educated guess as to the answer. Keep in mind that DataPrep simply takes all your GUI based inputs and translates it into Apache Beam code. When you pass in a reference data set, it probably writes some AB code that turns the reference data set into a side-input (https://beam.apache.org/documentation/programming-guide/). DataFlow will perform a Parellel Do (ParDo) function where it takes each element from a PCollection, stuffs it into a worker node, and then applies the side-input data for transformation.
So I am pretty sure if the reference sets get too big (which can happen with Joins), the underlying code will take an element from dataset A, pass it to a function with side-input B...but if side-input B is very big, it won't be able to fit into the worker memory. Take a look at the Stackdriver logs for your job to investigate if this is the case. If you see 'GC (Allocation Failure)' in your logs this is a sign of not enough memory.
You can try doing this: suppose you have two CSV files to read in and process, file A is 4 GB and file B is also 4 GB. If you kick off a job to perform some type of Join, it will very quickly outgrow the worker memory and puke. If you CAN, see if you can pre-process in a way where one of the files is in the MB range and just grow the other file.
If your data structures don't lend themselves to that option, you could do what the Sys Engs suggested, split one file up into many small chunks and then feed it to the recipe iteratively against the other larger file.
Another option to test is specifying the compute type for the workers. You can iteratively grow the compute type larger and larger to see if it finally pushes through.
The other option is to code it all up yourself in Apache Beam, test locally, then port to Google Cloud DataFlow.
Hopefully these guys fix the problem soon, they don't make it easy to ask them questions, that's for sure.

Parallel processing input/output, queries, and indexes AS400

IBM V6.1
When using the I system navigator and when you click System values the following display.
By default the Do not allow parallel processing is selected.
What will the impact be on processing in programs when you choose multiple processes, we have allot of rpgiv programs and sql queries being executed and I think it will increase performance?
Basically I want to turn this on in production environment but not sure if I will break anything by doing this for example input or output of different programs running parallel or data getting out of sequence?
I did do some research :
https://publib.boulder.ibm.com/iseries/v5r2/ic2924/index.htm?info/rzakz/rzakzqqrydegree.htm
And understand each option but I do not know the risk of changing it from default to multiple.
First off, in order get the most out of *MAX and *OPTIMIZE, you'd need a system with more than one core (enabled for IBM i / DB2) along with the DB2 Symmetric Multiprocessing (SMP) (57xx-SS1 option 26) license program installed; thus allowing the system to use SMP for queries and index builds.
For *IO, the system can use multiple tasks via simultaneous multithreading (SMT) even on a single core POWER 5 or higher box. SMT is enabled via the Processor multi tasking (QPRCMLTTSK) system value
You're unlikely to "break" anything by changing the value. As long as your applications don't make bad assumptions about result set ordering. For example, CPYxxxIMPF makes use of SQL behind the scenes; with anything but *NONE you might end up with the rows in your DB2 table in different order from the rows in the import file.
You will most certainly increase the CPU usage. This is not a bad thing; unless you're currently pushing 90% + CPU usage regularly. If you're only using 50% of your CPU, it's probably a good thing to make use of SMT/SMP to provide better response time even if it increases the CPU utilization to 60%.
Having said that, here's a story of it being a problem... http://archive.midrange.com/midrange-l/200304/msg01338.html
Note that in the above case, the OP was pre-building work tables at sign on in order to minimize the wait when it was time to use them. Great idea 20 years ago with single threaded systems. Today, the alternative would be to take advantage of SMP/SMT and build only what's needed when needed.
As you note in a comment, this kind of change is difficult to test in non-production environments since workloads in DEV & TEST are different. So it's important to collect good performance data before & after the change. You might also consider moving it stages *NONE --> *IO --> *OPTIMIZE and then *MAX if you wish. I'd spend at least a month at each level, if you have periodic month end jobs.

Clarifications in Electric commander and tutorial

I was searching for tutorials on Electric cloud over the net but found nothing. Also could not find good blogs dealing with it. Can somebody point me in right directions for this?
Also we are planning on using Electric cloud for executing perl scripts in parallel. We are not going to build software. We are trying to test our hardware in parallel by executing the same perl script in parallel using electric commander. But I think Electric commander might not be the right tool given its cost. Can you suggest some of the pros and cons of using electric commander for this and any other feature which might be useful for our testing.
Thanks...
RE #1: All of the ElectricCommander documentation is located in the Electric Cloud Knowledge Base located at https://electriccloud.zendesk.com/entries/229369-documentation.
ElectricCommander can also be a valuable application to drive your tests in parallel. Here are just a few aspects for consideration:
Subprocedures: With EC, you can just take your existing scripts, drop them into a procedure definition and call that procedure multiple times (concurrently) in a single procedure invocation. If you want, you can further decompose your scripts into more granular subprocedures. This will drive reuse, lower cost of administration, and it will enable your procedures to run as fast as possible (see parallelism below).
Parallelism: Enabling a script to run in parallel is literally as simple as checking a box within EC. I'm not just referring to running 2 procedures at the same time without risk of data collision. I'm referring to the ability to run multiple steps within a procedure concurrently. Coupled with the subprocedure capability mentioned above, this enables your procedures to run as fast as possible as you can nest suprocedures within other subprocedures and enable everything to run in parallel where the tests will allow it.
Root-cause Analysis: Tests can generate an immense amount of data, but often only the failures, warnings, etc. are relevant (tell me what's broken). EC can be configured to look for very specific strings in your test output and will produce diagnostic based on that configuration. So if your test produces a thousand lines of output, but only 5 lines reference errors, EC will automatically highlight those 5 lines for you. This makes it much easier for developers to quickly identify root-cause analysis.
Results Tracking: ElectricCommander's properties mechanism allows you to store any piece of information that you determine to be relevant. These properties can be associated with any object in the system whether it be the procedure itself or the job that resulted from the invocation of a procedure. Coupled with EC's reporting capabilities, this means that you can produce valuable metrics indicating your overall project health or throughput without any constraint.
Defect Tracking Integration: With EC, you can automatically file bugs in your defect tracking system when tests fail or you can have EC create a "defect triage report" where developers/QA review the failures and denote which ones should be auto-filed by EC. This eliminates redundant data entry and streamlines overall software development.
In short, EC will behave exactly they way you want it to. It will not force you to change your process to fit the tool. As far as cost goes, Electric Cloud provides a version known as ElectricCommander Workgroup Edition for cost-sensitive customers. It is available for a small annual subscription fee and something that you may want to follow up on.
I hope this helps. Feel free to contact your account manager or myself directly if you have additional questions (dfarhang#electric-cloud.com).
Maybe you could execute the same perl script on several machines by using r-commands, or cron, or something similar.
To further address the parallel aspect of your question:
The command-line interface lets you write scripts to construct
procedures, including this kind of subprocedure with parallel steps.
So you are not limited, in the number of parallel steps, to what you
wrote previously: you can write a procedure which dynamically sizes
itself to (for example) the number of steps you would like to run in
parallel, or the number of resources you have to run steps in
parallel.