Hi pretty new to azure suite anyway, I’m looking to do copy activity in data factory pipeline from one azure storage container to another to be used as a backup. This will be copied every 2 weeks and the old deleted when the new is written. The pipeline carries out the copy with 167GB data read and same written based on debug details. Azure pricing says that for every 4mb that’s one operation, so 167000/4 = 41750 x 2 for read and write transactions. When put into pricing calculator I get that as being pretty cheap only like $4-5 a month with storage. Are there other costs I’m missing? Cluster costs etc.? Or have I totally missed something else. I removed iterative write operations which seem to be the most costly of ops in the calculator because I can’t see that I’m doing those in this case simply copying data with no transformations.
Any info on this would be much appreciated, found it a little difficult to fully grasp.
Thanks in advance.
when running pipeline with copy activities, the other cost you incur is related to pipeline orchestration and execution.
you can view pipeline run consumption from the monitoring section in data factory UI by clicking on the consumption link next to the pipeline run entry
that will show you the pipeline run consumption breakdown in terms of activity runs, data movement DIU-hrs and execution hrs.
for a simple copy task like you described, this cost is insignificant, but you can click on the pricing calculator link in the consumption view and play around with different scenarios to estimate the cost of your adf pipelines
Related
Situation: Every day a bunch of JSON files are generated and put into Azure BLOB storage. Also every day an Azure data factory copy jobs makes a look up in the blob storage and does a "Filter by last modified":
Start time: #adddays(utcnow(),-2)
End time: #utcnow()
The files are copied to Azure Datalake Gen2.
On normal days with 50-100 new JSON-files the copy jobs goes fine but at the last day of every quarter the number of new JSON-files increases to 10.000+ files and then the copy job fails with the message "ErrorCode=SystemErrorFailToInsertSubJobForTooLargePayload,….."
Therefore I have made a new copy job that uses a for each loop to run parallel copy jobs. This can copy much larger volumes of files, but it still takes a couple of minutes per file and I have not seen more than around 500 files per hour being copied, so that is still not fast enough.
Therefore I am searching for more ways to optimize the copy. I have put in a couple of screen shots but can give more details on specifics.
The issue is with the size of payload which is unable to process using the current configuration (expecting you are using default settings).
You can optimize the Copy activity performance by considering the underlying changes in your Azure Data Factory (ADF) environment.
Data Integration Units
Self-hosted integration runtime scalability
Parallel copy
Staged copy
You can try these Performance Tuning Steps in your ADF to increase the performance.
Configure the copy optimization features in settings tab.
Refer Copy activity performance optimization features for more details and better understanding.
I have a pipeline contains multiple copy activity, and the main purpose of these activities is to merge multiples files into one single file.
the problem of this pipeline is, it takes about 4 hours to executes (to merge the files).is there any way to reduce the duration please.
thanks for your reply .
If the copy operation is being performed on an Azure integration
runtime, the following steps must be followed:
For Data Integration Units (DIU) and parallel copy settings, start with the
default values.
If you're using a self-hosted integration runtime, you'll need to do
the following:
Would recommend that you run IR on a separate computer. The machine should
be kept isolated from the data store server. Start using the default
defaults for parallel copy settings and the self-hosted IR on a single
node.
Else you may leverage:
A Data Integration Unit (DIU)
It is a measure that represents the power of a single unit in Azure Data Factory and Synapse pipelines. Power is a combination of CPU, memory, and network resource allocation. DIU only applies to Azure integration runtime. DIU does not apply to self-hosted integration runtime.
Parallel Copy
Could set the parallel Copies property to indicate the parallelism you want the copy activity to use. Think of this property as the maximum number of threads within the copy activity. The threads operate in parallel. The threads either read from your source, or write to your sink data stores.
Here, is the MSFT Document to Troubleshoot copy activity performance.
When copying data into Azure Table, default parallel copy is 4.he
range of DIU setting is 2-256.However, specific behaviors of DIU in
different copy scenarios are different even though you set the number
as you want.
Please see the table list here,especially for the below part
DIU has some limitations as you seen,so you could choose the optimal setting with your custom scenario.
If you are trying to copy 1GB data, thus somehow DIU never crossed 4.
But when If you try to copy 10GB data, then you could notice DIU started scaling up beyond 4.
Here is the list of the Data Integration Units.
I want to copy CSV file in Blob to another Blob by Data Factory. I want to estimate the cost I have to pay. So, What will affect the amount I have to check.
Thanks for your help
Data Factory pricing has several factors. You can find the current pricing here
This is what it looks like today.
If you are copying files from one blob to another, you can use the Azure integration runtime. The act of copying the file is considered an activity. DIUs are the compute resources that do the work. When the copy activity runs on an Azure integration runtime, the minimal allowed Data Integration Units (formerly known as Data Movement Units) is two.
You didn't specify how often you want to copy the files from one blob to another, but that factors into how many DIU-hours you will use.
So if I assume the following:
1000 activity runs
30 DIU-hours
No self-hosted IR, no data flows
1 entity unit of read/write operations
1 entity unit of monitoring operations
The Azure Pricing calculator gives me about $9 per month. You may come up with a different number if you adjust the activity runs and DIU-hours. If it is a single small file you are copying daily, you probably wouldn't use 30 DIU-hours. If it is actually 5 of the world's largest CSVs, and you copy them hourly, you might. The only real way to know the DIU hours is to build a pipeline and execute it once.
The estimated amount you will pay is measured by AU hours. So you can run a job and see how much it will cost and then just do the mat.
For example. This job run 5 AU's for 33 seconds and it cost 0.07 EUR.
You can see in the AU analysis if you switch the n# of AU's to use the cost and if its more efective use more AU's to lower the cost or if less AU's can do the same job for less money.
You can also try calculate the cost of any Azure services in the link provided by Azure:
https://azure.microsoft.com/en-gb/pricing/calculator/
My plan:
Move all data processing to Spark (PySpark preferably) with final output (consumer facing) data going to Redshift only. Spark seems to connect to all the various sources well (Dynamo DB, S3, Redshift). Output to Redshift/S3 etc depending on customer need. This avoids having multiple Redshift clusters, broken/overusing internal unsupported ETL tools, copy of the same data across clusters, views and tables etc (which is the current setup).
Use Luigi to build a web UI to daily monitor pipelines and visualise the dependency tree, and schedule ETL's. Email notifications should be an option for failures also. An alternative is AWS data pipeline, but, Luigi seems to have a better UI for what is happening where many dependencies are involved (some trees are 5 levels deep, but perhaps this can also be avoided with better Spark code).
Questions:
Does Luigi integrate with Spark (I have only used PySpark before, not Luigi, so this is a learning curve for me). The plan was to schedule 'applications' and Spark actually has ETL too I believe, so unsure how Luigi integrates here?
How to account for the fact that some pipelines may be 'real time' - would I need to spin up the Spark / EMR job hourly for example then?
I'm open to thoughts / suggestions / better ways of doing this too!
To answer your questions directly,
1) Yes, Luigi does play nicely with PySpark, just like any other library. We certainly have it running without issue -- the only caveat is that you have to be a little careful with imports and have them within the functions of the Luigi class as, in the background, it is spinning up new Python instances.
2) There are ways of getting Luigi to slurp in streams of data, but it is tricky to do. Realistically, you'd fall back to running an hourly cron cycle to just call the pipeline and process and new data. This sort of reflects Spotify's use case for Luigi where they run daily jobs for calculate top artist, etc.
As #RonD suggests, if I was building a new pipeline now, I'd skip Luigi and go straight to AirFlow. If nothing else, look at the release history. Luigi hasn't really been significantly worked on for a long time (because it works for the main dev). Whereas AirFlow is actively being incubated by Apache.
Instead of Luigi use Apache Airflow for workflow orchestration (code is written in Python). It has a lot of operators and hooks built in which you can call in DAGs (Workflows). For example create task to call operator to start up EMR cluster, another to run PySpark script located in s3 on cluster, another to watch the run for status. You can use tasks to set up dependencies etc too.
During my cloud dataprep adventures I have come across yet another very annoying bug.
The problem occurs when creating complex flow structures which need to be connected through reference datasets. If a certain limit is crossed in performing a number of unions or a joins with these sets, dataflow is unable to start a job.
I have had a lot of contact with support and they are working on the issue:
"Our Systems Engineer Team was able to determine the root cause resulting into the failed job. They mentioned that the job is too large. That means that the recipe (combined from all datasets) is too big, and Dataflow rejects it. Our engineering team is still investigating approaches to address this.
A workaround is to split the job into two smaller jobs. The first run the flow for the data enrichment, and then use the output as input in the other flow. While it is not ideal, this would be a working solution for the time being."
I ran into the same problem and have a fairly educated guess as to the answer. Keep in mind that DataPrep simply takes all your GUI based inputs and translates it into Apache Beam code. When you pass in a reference data set, it probably writes some AB code that turns the reference data set into a side-input (https://beam.apache.org/documentation/programming-guide/). DataFlow will perform a Parellel Do (ParDo) function where it takes each element from a PCollection, stuffs it into a worker node, and then applies the side-input data for transformation.
So I am pretty sure if the reference sets get too big (which can happen with Joins), the underlying code will take an element from dataset A, pass it to a function with side-input B...but if side-input B is very big, it won't be able to fit into the worker memory. Take a look at the Stackdriver logs for your job to investigate if this is the case. If you see 'GC (Allocation Failure)' in your logs this is a sign of not enough memory.
You can try doing this: suppose you have two CSV files to read in and process, file A is 4 GB and file B is also 4 GB. If you kick off a job to perform some type of Join, it will very quickly outgrow the worker memory and puke. If you CAN, see if you can pre-process in a way where one of the files is in the MB range and just grow the other file.
If your data structures don't lend themselves to that option, you could do what the Sys Engs suggested, split one file up into many small chunks and then feed it to the recipe iteratively against the other larger file.
Another option to test is specifying the compute type for the workers. You can iteratively grow the compute type larger and larger to see if it finally pushes through.
The other option is to code it all up yourself in Apache Beam, test locally, then port to Google Cloud DataFlow.
Hopefully these guys fix the problem soon, they don't make it easy to ask them questions, that's for sure.