I want to copy CSV file in Blob to another Blob by Data Factory. I want to estimate the cost I have to pay. So, What will affect the amount I have to check.
Thanks for your help
Data Factory pricing has several factors. You can find the current pricing here
This is what it looks like today.
If you are copying files from one blob to another, you can use the Azure integration runtime. The act of copying the file is considered an activity. DIUs are the compute resources that do the work. When the copy activity runs on an Azure integration runtime, the minimal allowed Data Integration Units (formerly known as Data Movement Units) is two.
You didn't specify how often you want to copy the files from one blob to another, but that factors into how many DIU-hours you will use.
So if I assume the following:
1000 activity runs
30 DIU-hours
No self-hosted IR, no data flows
1 entity unit of read/write operations
1 entity unit of monitoring operations
The Azure Pricing calculator gives me about $9 per month. You may come up with a different number if you adjust the activity runs and DIU-hours. If it is a single small file you are copying daily, you probably wouldn't use 30 DIU-hours. If it is actually 5 of the world's largest CSVs, and you copy them hourly, you might. The only real way to know the DIU hours is to build a pipeline and execute it once.
The estimated amount you will pay is measured by AU hours. So you can run a job and see how much it will cost and then just do the mat.
For example. This job run 5 AU's for 33 seconds and it cost 0.07 EUR.
You can see in the AU analysis if you switch the n# of AU's to use the cost and if its more efective use more AU's to lower the cost or if less AU's can do the same job for less money.
You can also try calculate the cost of any Azure services in the link provided by Azure:
https://azure.microsoft.com/en-gb/pricing/calculator/
Related
Situation: Every day a bunch of JSON files are generated and put into Azure BLOB storage. Also every day an Azure data factory copy jobs makes a look up in the blob storage and does a "Filter by last modified":
Start time: #adddays(utcnow(),-2)
End time: #utcnow()
The files are copied to Azure Datalake Gen2.
On normal days with 50-100 new JSON-files the copy jobs goes fine but at the last day of every quarter the number of new JSON-files increases to 10.000+ files and then the copy job fails with the message "ErrorCode=SystemErrorFailToInsertSubJobForTooLargePayload,….."
Therefore I have made a new copy job that uses a for each loop to run parallel copy jobs. This can copy much larger volumes of files, but it still takes a couple of minutes per file and I have not seen more than around 500 files per hour being copied, so that is still not fast enough.
Therefore I am searching for more ways to optimize the copy. I have put in a couple of screen shots but can give more details on specifics.
The issue is with the size of payload which is unable to process using the current configuration (expecting you are using default settings).
You can optimize the Copy activity performance by considering the underlying changes in your Azure Data Factory (ADF) environment.
Data Integration Units
Self-hosted integration runtime scalability
Parallel copy
Staged copy
You can try these Performance Tuning Steps in your ADF to increase the performance.
Configure the copy optimization features in settings tab.
Refer Copy activity performance optimization features for more details and better understanding.
I have a pipeline contains multiple copy activity, and the main purpose of these activities is to merge multiples files into one single file.
the problem of this pipeline is, it takes about 4 hours to executes (to merge the files).is there any way to reduce the duration please.
thanks for your reply .
If the copy operation is being performed on an Azure integration
runtime, the following steps must be followed:
For Data Integration Units (DIU) and parallel copy settings, start with the
default values.
If you're using a self-hosted integration runtime, you'll need to do
the following:
Would recommend that you run IR on a separate computer. The machine should
be kept isolated from the data store server. Start using the default
defaults for parallel copy settings and the self-hosted IR on a single
node.
Else you may leverage:
A Data Integration Unit (DIU)
It is a measure that represents the power of a single unit in Azure Data Factory and Synapse pipelines. Power is a combination of CPU, memory, and network resource allocation. DIU only applies to Azure integration runtime. DIU does not apply to self-hosted integration runtime.
Parallel Copy
Could set the parallel Copies property to indicate the parallelism you want the copy activity to use. Think of this property as the maximum number of threads within the copy activity. The threads operate in parallel. The threads either read from your source, or write to your sink data stores.
Here, is the MSFT Document to Troubleshoot copy activity performance.
When copying data into Azure Table, default parallel copy is 4.he
range of DIU setting is 2-256.However, specific behaviors of DIU in
different copy scenarios are different even though you set the number
as you want.
Please see the table list here,especially for the below part
DIU has some limitations as you seen,so you could choose the optimal setting with your custom scenario.
If you are trying to copy 1GB data, thus somehow DIU never crossed 4.
But when If you try to copy 10GB data, then you could notice DIU started scaling up beyond 4.
Here is the list of the Data Integration Units.
Hi pretty new to azure suite anyway, I’m looking to do copy activity in data factory pipeline from one azure storage container to another to be used as a backup. This will be copied every 2 weeks and the old deleted when the new is written. The pipeline carries out the copy with 167GB data read and same written based on debug details. Azure pricing says that for every 4mb that’s one operation, so 167000/4 = 41750 x 2 for read and write transactions. When put into pricing calculator I get that as being pretty cheap only like $4-5 a month with storage. Are there other costs I’m missing? Cluster costs etc.? Or have I totally missed something else. I removed iterative write operations which seem to be the most costly of ops in the calculator because I can’t see that I’m doing those in this case simply copying data with no transformations.
Any info on this would be much appreciated, found it a little difficult to fully grasp.
Thanks in advance.
when running pipeline with copy activities, the other cost you incur is related to pipeline orchestration and execution.
you can view pipeline run consumption from the monitoring section in data factory UI by clicking on the consumption link next to the pipeline run entry
that will show you the pipeline run consumption breakdown in terms of activity runs, data movement DIU-hrs and execution hrs.
for a simple copy task like you described, this cost is insignificant, but you can click on the pricing calculator link in the consumption view and play around with different scenarios to estimate the cost of your adf pipelines
I have files that accumulate in Blob Storage on Azure that are moved each hour to ADLS with data factory... there are around 1000 files per hour, and they are 10 to 60kb per file...
what is the best combination of:
"parallelCopies": ?
"cloudDataMovementUnits": ?
and also,
"concurrency": ?
to use?
currently i have all of these set to 10, and each hourly slice takes around 5 minutes, which seems slow?
could ADLS, or Blob be getting throttled, how can i tell?
There won't be a one solution fits all scenarios when it comes to optimizing a copy activity. However there few things you can checkout and find a balance. A lot of it depends on the pricing tiers / type of data being copied / type of source and sink.
I am pretty sure that you would have come across this article.
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance
this is a reference performance sheet, the values are definitely different depending on the pricing tiers of your source and destination items.
Parallel Copy :
This happens at the file level, so it is beneficial if your source files are big as it chunks the data (from the article)
Copy data between file-based stores Between 1 and 32. Depends on the size of the files and the number of cloud data movement units (DMUs) used to copy data between two cloud data stores, or the physical configuration of the Self-hosted Integration Runtime machine.
The default value is 4.
behavior of the copy is important. if it is set to mergeFile then parallel copy is not used.
Concurrency :
This is simply how many instances of the same activity you can run in parallel.
Other considerations :
Compression :
Codec
Level
Bottom line is that you can pick and choose the compression, faster compression will increase network traffic, slower will increase time consumed.
Region :
the location or region of that the data factory, source and destination might affect performance and specially the cost of the operation. having them in the same region might not be feasible all the time depending on your business requirement, but definitely something you can explore.
Specific to Blobs
https://learn.microsoft.com/en-us/azure/storage/common/storage-performance-checklist#blobs
this article gives you a good number of metrics to improve performance, however when using data factory i don't think there is much you can do at this level. You can use the application monitoring to check out throughput while your copy is going on.
I have a lot of small unstructured json files (less than 1K each) I want to store on Google cloud storage somehow (using streaming). I would prefer to avoid putting them into zip files (I think) since I'm thinking of using Apache Drill to perform queries against them. Would it be more cost effective to merge multiple json documents together rather than storing them one by one? (I assume that writing the files in batches would be a good thing regardless if they're merged or stored separately)
Well...maybe. It depends on your usage pattern.
GCS does not have a per-object charge. Instead, it charges per Gigabyte stored per month. Breaking the files up won't affect that at all.
However, GCS also charges a per-operation fee. At time of writing, every 10,000 downloads will cost you a penny, and every 10,000 uploads will cost you a dime. If you only have a few thousand files or only access a few files at a time, this might not make a big difference, but if you need to download all of the files frequently, or if you need to replace them frequently, and you're doing millions or billions of separate uploads per day, suddenly using a few big files instead could save you a lot of money.
If you can estimate how many downloads and uploads you'll be doing under each scenario, Google provides a calculator to let you know what it will cost: https://cloud.google.com/products/calculator/