Performance issues when merging large files into one single file - azure-data-factory

I have a pipeline contains multiple copy activity, and the main purpose of these activities is to merge multiples files into one single file.
the problem of this pipeline is, it takes about 4 hours to executes (to merge the files).is there any way to reduce the duration please.
thanks for your reply .

If the copy operation is being performed on an Azure integration
runtime, the following steps must be followed:
For Data Integration Units (DIU) and parallel copy settings, start with the
default values.
If you're using a self-hosted integration runtime, you'll need to do
the following:
Would recommend that you run IR on a separate computer. The machine should
be kept isolated from the data store server. Start using the default
defaults for parallel copy settings and the self-hosted IR on a single
node.
Else you may leverage:
A Data Integration Unit (DIU)
It is a measure that represents the power of a single unit in Azure Data Factory and Synapse pipelines. Power is a combination of CPU, memory, and network resource allocation. DIU only applies to Azure integration runtime. DIU does not apply to self-hosted integration runtime.
Parallel Copy
Could set the parallel Copies property to indicate the parallelism you want the copy activity to use. Think of this property as the maximum number of threads within the copy activity. The threads operate in parallel. The threads either read from your source, or write to your sink data stores.
Here, is the MSFT Document to Troubleshoot copy activity performance.
When copying data into Azure Table, default parallel copy is 4.he
range of DIU setting is 2-256.However, specific behaviors of DIU in
different copy scenarios are different even though you set the number
as you want.
Please see the table list here,especially for the below part
DIU has some limitations as you seen,so you could choose the optimal setting with your custom scenario.
If you are trying to copy 1GB data, thus somehow DIU never crossed 4.
But when If you try to copy 10GB data, then you could notice DIU started scaling up beyond 4.
Here is the list of the Data Integration Units.

Related

Optimize Azure Data Factory copy of 10.000+ JSON files from BLOB storage to ADLS G2

Situation: Every day a bunch of JSON files are generated and put into Azure BLOB storage. Also every day an Azure data factory copy jobs makes a look up in the blob storage and does a "Filter by last modified":
Start time: #adddays(utcnow(),-2)
End time: #utcnow()
The files are copied to Azure Datalake Gen2.
On normal days with 50-100 new JSON-files the copy jobs goes fine but at the last day of every quarter the number of new JSON-files increases to 10.000+ files and then the copy job fails with the message "ErrorCode=SystemErrorFailToInsertSubJobForTooLargePayload,….."
Therefore I have made a new copy job that uses a for each loop to run parallel copy jobs. This can copy much larger volumes of files, but it still takes a couple of minutes per file and I have not seen more than around 500 files per hour being copied, so that is still not fast enough.
Therefore I am searching for more ways to optimize the copy. I have put in a couple of screen shots but can give more details on specifics.
The issue is with the size of payload which is unable to process using the current configuration (expecting you are using default settings).
You can optimize the Copy activity performance by considering the underlying changes in your Azure Data Factory (ADF) environment.
Data Integration Units
Self-hosted integration runtime scalability
Parallel copy
Staged copy
You can try these Performance Tuning Steps in your ADF to increase the performance.
Configure the copy optimization features in settings tab.
Refer Copy activity performance optimization features for more details and better understanding.

Price of copy activity in azure df

Hi pretty new to azure suite anyway, I’m looking to do copy activity in data factory pipeline from one azure storage container to another to be used as a backup. This will be copied every 2 weeks and the old deleted when the new is written. The pipeline carries out the copy with 167GB data read and same written based on debug details. Azure pricing says that for every 4mb that’s one operation, so 167000/4 = 41750 x 2 for read and write transactions. When put into pricing calculator I get that as being pretty cheap only like $4-5 a month with storage. Are there other costs I’m missing? Cluster costs etc.? Or have I totally missed something else. I removed iterative write operations which seem to be the most costly of ops in the calculator because I can’t see that I’m doing those in this case simply copying data with no transformations.
Any info on this would be much appreciated, found it a little difficult to fully grasp.
Thanks in advance.
when running pipeline with copy activities, the other cost you incur is related to pipeline orchestration and execution.
you can view pipeline run consumption from the monitoring section in data factory UI by clicking on the consumption link next to the pipeline run entry
that will show you the pipeline run consumption breakdown in terms of activity runs, data movement DIU-hrs and execution hrs.
for a simple copy task like you described, this cost is insignificant, but you can click on the pricing calculator link in the consumption view and play around with different scenarios to estimate the cost of your adf pipelines

Create data as a load on J-Meter

I'm working on Kubernetes on Microsoft azure with real data. Now, I need to generate a sample of data on JMeter then use it as workload to stress the CPU in Tea-Store microservices on Kubernetes. Any hint or source about How to do that, and which type of files work with JMeter?
If you want a specific answer you need to ask more specific question.
The most common parameterization options are:
If you need to ingest data from external data sources:
CSV Data Set Config allows reading CSV files into JMeter Variables so each virtual user on each iteration reads next line from the CSV file
__CSVRead() function does more or less the same however it can be declared/used in the runtime so you can have dynamic filename/path and you decide when to proceed to next column/row
JDBC Request sampler allows reading test data from the database or creating test data in the database
__StringFromFile() function reads next line from file each time it's being called
__FileToString() function reads whole file into memory/variable
If you need to generate brand new/random data:
__threadNum() - number of current thread
__time() and __timeShift() - current timestamp in various formats plus possibility to generate dates in future or past
__Random() - generate a random number
__RandomString() - generate a random string out of provided characters
__UUID() - generate unique GUID-like structure
__groovy() - for everything else, it executes arbitrary Groovy code and returns the result
IN addition to great Dmitri's answer I would like to add few cents.
Please take a look to 13-Step Guide to Performance Testing in Kubernetes, especially to
Step 12: Automating the Performance Tests
When running performance tests, we need to run these tests for a range
of workload scenarios (e.g. concurrency levels, heap sizes, message
sizes, etc.). Running the tests manually for each of these scenarios
is time-consuming and likely to cause errors. Therefore it is
important to automate the performance tests prior to executing them.
We automate our performance tests using a shell script:
start_performance_test.sh.
This script can have an idea for smth similar for you. Also overall the article introduces you Jmeter usage with some examples.

google Dataprep: number of instances and architecture optimisation

I have noticed that every destination in Google dataprep (be it manual or scheduled) spins up a compute engine instance. Limit quota for a normal account is 8 instances max.
look at this flow:
dataprep flow
Since datawrangling is composed by multiple layers and you might want to materialize intermediate steps with exports, what is the best approach/ architecture to run dataprep flows?
Option A
run 2 separate flows and schedule them with a 15 min. discrepancy:
first flow will export only the final step
other flow will export intermediate steps only
this way you´re not hitting the quota limit but you´re still calculating early stages of the same flow multiple times
Option B
leave the flow as it is and request for more Compute Engine Quota: Computational effort is the same, I will just have more instances running in parallel instead of sequentially
Option C
each step has his own flow + create reference dataset:
this way each flow will only run one single step.
E.g.
when I run the job "1549_first_repo" I will no longer calculate the 3 previous steps but only the last one: the transformations between the referenced "5912_first" table and "1549_first_repo".
This last option seems to me the most reasonable as each transformation is run once at most, Am I missing something?
and also, is there a way to run each export sequentially instead of in parallel?
-- EDIT 30. May--
it turns out option C is not the way to go as "referencing" is a pure continuation of the previous flow. You could imagine the flow before the referenced dataset and after the referenced dataset as a single flow.
Still trying to figure out how to achieve modularity without redundantly calculating the same operations.
Both options A and B are good, the difference being the quota increase. If you are expecting to upgrade sooner or later, might as well do it sooner.
And other option, if you are familiar with java or python and Dataflow, is to create a pipeline having a combination of numWorkers, workerMachineType, and maxNumWorkers that fits within your trial limit of 8 cores (or virtual CPUs). Here are the pipeline option and here is a tutorial that can give you a better view of the product.

Data Factory Copy Activity Blob -> ADLS

I have files that accumulate in Blob Storage on Azure that are moved each hour to ADLS with data factory... there are around 1000 files per hour, and they are 10 to 60kb per file...
what is the best combination of:
"parallelCopies": ?
"cloudDataMovementUnits": ?
and also,
"concurrency": ?
to use?
currently i have all of these set to 10, and each hourly slice takes around 5 minutes, which seems slow?
could ADLS, or Blob be getting throttled, how can i tell?
There won't be a one solution fits all scenarios when it comes to optimizing a copy activity. However there few things you can checkout and find a balance. A lot of it depends on the pricing tiers / type of data being copied / type of source and sink.
I am pretty sure that you would have come across this article.
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance
this is a reference performance sheet, the values are definitely different depending on the pricing tiers of your source and destination items.
Parallel Copy :
This happens at the file level, so it is beneficial if your source files are big as it chunks the data (from the article)
Copy data between file-based stores Between 1 and 32. Depends on the size of the files and the number of cloud data movement units (DMUs) used to copy data between two cloud data stores, or the physical configuration of the Self-hosted Integration Runtime machine.
The default value is 4.
behavior of the copy is important. if it is set to mergeFile then parallel copy is not used.
Concurrency :
This is simply how many instances of the same activity you can run in parallel.
Other considerations :
Compression :
Codec
Level
Bottom line is that you can pick and choose the compression, faster compression will increase network traffic, slower will increase time consumed.
Region :
the location or region of that the data factory, source and destination might affect performance and specially the cost of the operation. having them in the same region might not be feasible all the time depending on your business requirement, but definitely something you can explore.
Specific to Blobs
https://learn.microsoft.com/en-us/azure/storage/common/storage-performance-checklist#blobs
this article gives you a good number of metrics to improve performance, however when using data factory i don't think there is much you can do at this level. You can use the application monitoring to check out throughput while your copy is going on.