Hi I am new to Azure data factory and not all familiar with the back-end processing that run behind the scenes. I am wondering if there is a performance impact to running couple of data flows in parallel when compared to having all the transformations in one data flow.
I am trying to stage some data with a not exists transformation. i have to do it for multiple tables. when i test ran two data flows in parallel the clusters were brought up together for both the data flows simultaneously. But I am not sure if this the best approach to distribute the loading of tables across couple of data flows or to have all the transformations in one data flow
1: If you execute data flows in a pipeline in parallel, ADF will spin-up separate Spark clusters for each based on the settings in your Azure Integration Runtime attached to each activity.
2: If you put all of your logic inside a single data flow, then it will all execute in that same job execution context on a single Spark cluster instance.
3: Another option is to execute the activities in serial in the pipeline. If you have set a TTL on the Azure IR configuration, then ADF will reuse the compute resources (VMs) but you will still a brand-new Spark context for each execution.
All are valid practices and which one you choose should be driven by your requirements for your ETL process.
No. 3 will likely take the longest time to execute end-to-end. But it does provide a clean separation of operations in each data flow step.
No. 2 could be more difficult to follow logically and doesn't give you much re-usability.
No. 1 is really similar to #3, but you run them all in parallel. Of course, not every end-to-end process can run in parallel. You may require a data flow to finish before starting the next, in which case you're back in #3 serial mode.
Related
As a part of our architecture, we are having Kinesis streams which will send data streams to Redshift, the current thought process is to create external external schema on top of the Kinesis streams and then use materialized views to persist the data with minimal transformations as needed.
However In addition to this there is a need to perform a series of transformations (which are quite complex) on this data and load it into target tables. Was thinking about using stored procedures to perform these transformations and load into a target table. So the flow is like Kinesis Streams -> External View (Real time) -> Batch Processing (Materialized view and stored procedure).
To call the stored procedures (SP) on a timely schedule, the thought process was to schedule the SQL queries. Being fairly new to AWS Redshift and the Kinesis streaming ingestion and exploring on the available options, would like to get thoughts on the above approach.
Further understand that there are limitations with stored procedures, (https://docs.aws.amazon.com/redshift/latest/dg/stored-procedure-constraints.html) and scheduling of SQL queries might not give the feasibility of implementing a good orchestration flow. Hence would like to understand alternate methods that are available to implement the above specifically post the availability of data in external view.
I have a Spring Batch solution which reads several tables in Oracle database, does some flattening and cleaning of data, and sends it to a Restful Api which is our BI platform. The Spring Batch breaks down this data in chunks by date and not by size. It may happen that on a particular day, one chunk could consist of million rows. We are running the complete end-to-end flow in the following way:
Control-M sends a trigger to Load Balancer at a scheduled time
Through Load Balancer request lands on to an instance of Spring Batch app
Spring Batch reads data for that day in chunks from Oracle database
Chunks are then sent to target API
My problems are:
The chunks can get heavier. If it contains of million rows then the instance's heap size increases and at one point chunks will get processed at trickling pace
One instance bears the load of entire batch processing
How can I distribute this processing across a group of instances? Is parallel processing achievable and if yes then how can I make sure that the same rows are not read by multiple instances (to avoid duplication)? Any other suggestions?
Thanks.
You can use a (locally or remotely) partitioned step where each worker step is assigned a distinct dataset. You can find more details and a code example in the documentation here:
https://docs.spring.io/spring-batch/docs/current/reference/html/spring-batch-integration.html#remote-partitioning
https://github.com/spring-projects/spring-batch/tree/main/spring-batch-samples#partitioning-sample
We are in the process of designing an ETL process, where we’ll be getting a daily account file (maybe half a million records, could grow) from client and we’ll be loading that file to our database.
Our current process splits the file into smaller files and load it to staging...sometime or if the process fails, we try to figure out how many records we have processed and then start again from that point. Is there any other better alternative to this problem?
We are thinking about using Kafka. I’m pretty new to Kafka. I would really appreciate some feedback if kafka is the way to go or we’re just over-killing a simple ETL process where we just load the data to a staging table and finally to destination table.
Apache Kafka® is a distributed streaming platform. What exactly does
that mean?
A streaming platform has three key capabilities:
Publish and subscribe to streams of records, similar to a message
queue or enterprise messaging system.
Store streams of records in a fault-tolerant durable way.
Process streams of records as they occur.
Kafka is generally used for two broad classes of applications:
Building real-time streaming data pipelines that reliably get data
between systems or applications
Building real-time streaming applications that transform or react to
the streams of data
https://kafka.apache.org/intro
If you encounter errors which make you check the last commited record to your staging database and need system to auto manage this stuff, Kafka can help you ease the process.
Though Kafka is built to work with massive data loads and spread across a cluster, you certainly can use it for smaller problems and utilize it's queuing functionalities and offset management, even with one broker (server) and low number of partitions (level of parallelism).
If you don't anticipate any scale at all, I would suggest you to consider RabbitMQ.
RabbitMQ is a message-queueing software also known as a message
broker or queue manager. Simply said; it is software where queues are
defined, to which applications connect in order to transfer a message
or messages.
https://www.cloudamqp.com/blog/2015-05-18-part1-rabbitmq-for-beginners-what-is-rabbitmq.html
“How to know if Apache Kafka is right for you” by Amit Rathi
https://link.medium.com/enGzNaNvT4
In case you chose Kafka:
When you receive a file, create a process which iterates all over it's lines and sends them to Kafka (Kafka Producer).
Create another process which continuously receive events from kafka (Kafka Consumer) and writes them in mini batches to the database (similar to your small files).
Setup Kafka:
https://dzone.com/articles/kafka-setup
Kafka Consumer/Producer simple example:
http://www.stackframelayout.com/programowanie/kafka-simple-producer-consumer-example/
Don't assume importing data is as easy as dumping it in your database and having the computer handle all the processing work. As you've discovered, an automated load can have problems.
First, database ELT processes depreciate the hard drive. Do not stage the data into one table prior to inserting it in its native table. Your process should only import the data one time to its native table to protect hardware.
Second, you don't need third-party software to middle-man the work. You need control so you're not manually inspecting what was inserted. This means your process is to first clean / transform the data prior to import. You want to prevent all problems prior to load by cleaning and structuring and even processing the data. The load should only be an SQL insert script. I have torn apart many T-SQL scripts where someone thought it convenient to integrate processing with database commands. Don't do it.
Here's how I manage imports from spreadsheet reports. Excel formulas are better than learning ETL tools like SSIS. I use cell formulas to validate whether the record is valid to go into our system. This result is its own column, and then if that column is true, a concatentation column displays an insert script.
=if(J1, concatenate("('", A1, "', ", B1, "),"), "")
If the column is false, the concat column shows nothing. This allows me to copy/paste the inserts into SSMS and conduct mass inserts via "insert into table values" scripts.
If this is actually updating existing records, as your comment appears to suggest, then you need to master the data, organizing what's changed in logs for your users.
Synchronization steps:
Log what is there before you update
Download and compare local vs remote copies for differences; you cannot compare the two without a) having them both in the same physical location or b) controlling the other system
Log what you're updating with, and timestamp when you're updating it
Save and close the logs
Only when 1-4 are done should you post an update to production
My guide to synchronizing data sources and handling Creates/Updates/Deletes:
sync local files with server files
How does Datastage Parallelism help with Performance improvement? What is the relationship between Parallelism and Performance?
Thanks & Regards,
Subhasree
This question is very broad - please try to be nore specific next time.
There are several differnt parallel approaches in DataStage:
Pipeline Parallelism: Imagine a job where data get read from a database is being transformed and written to another database. While data is still read from the database some rows get transformed and some (have already been transformed) and are already written to the target.
Because you do not have to wait for a single step to finish this provides performamce.
Partitioning Parallelism: Data get read i.e. from a Sequential file and then will be split up into different data partitions (number of partitions is determined by the configuration file). Parallel stages also designed once will be instanciated one per partition and therefore extra threads will be spawned. These thread will be running in parallel and again provide a better performance (throughput).
Hope this helps.
We process lots of files(around 500) overnight and those files comes every few min. But when they come , it is in group of 30-50. Is it a good idea to launch job for each file or group them and process it using multithreaded step?
Instead of going multithreaded directly or job per file, I'd recommend using partitioning. Using the MultiResourcePartitioner, you can create a partition per file which means each file get's its own step. By doing this, you can avoid some of the threading complexities (step scope stateful components), and still maintain things like restartability and the independent execution of each file with in the "batch" (run of the job). You can read more about partitioning in the documentation here: http://docs.spring.io/spring-batch/trunk/reference/html/scalability.html
It looks like the order in which files are processed does not matter.
I would use an instance of the batch job per file and not a multi-threaded step. Some advantages of using separate job instances are
It is easier to implement that a multi-threaded step.
Errors in one file will not affect the processing of other files.
If your files are very large, you can implement a multi-threaded step to process records of one file in parallel. This is something I would consider only if performance is not to expectations.
Multi-threaded programming in general is hard. Spring batch does a good job of abstracting the complexities of parallel processing but I have found that there are usually nuances to deal with, so it is best to avoid multi-threaded steps if you can.