Do activities in the Data Flow count towards the Pipeline Activity limit of 40? - azure-data-factory

I essentially want two sinks for 1 output. It's seeming like I'll have to duplicate my Pipeline and change the sink to the other Sink location.
I'd like to avoid that as much as possible.
So, I have a pipeline with 30 Copy Activities. Plain and simple, source to sink.
If I changed those into Data flows which split the Sink between two different sources (using the new Branch feature), would that increase the count of activities or do Data flows count as 1 activity?

Data Flow is one active. But in Data Flow active, we can create more flows to copy data or do data conversion from source and sink.
We can create more sources to one sink, but one sink for one output, just for now there we can't achieve two sinks for one output.
The max number of 40 activities allowed per pipeline. Data Flow doesn't have the source and sink limits. I just tested and we can create more than 40 flows. That mean that we can create 40 data flows in one pipeline, and each data flow can contains more than 40 flows(source to sink).
Like you said, you have a pipeline with 30 Copy Activities, you have two ways to build the pipeline:
30 actives: copy active 1 + copy active 2 + copy active 3 + ... + copy active 30.
1 active Data Flow: source1-->sink 1, source2-->sink2, ... ,source30-->sink30.
Data Factory doesn't charge for the actives number, only charge the for the amount of the amount of data transferring and how many resource you used in Data Factory.

Related

Calling Trigger once in Databricks to process Kinesis Stream

I am looking a way to trigger my Databricks notebook once to process Kinesis Stream and using following pattern
import org.apache.spark.sql.streaming.Trigger
// Load your Streaming DataFrame
val sdf = spark.readStream.format("json").schema(my_schema).load("/in/path")
// Perform transformations and then write…
sdf.writeStream.trigger(Trigger.Once).format("delta").start("/out/path")
It looks like it's not possible with AWS Kinesis and that's what Databricks documentation suggest as well. My Question is what else can we do to Achieve that?
As you mentioned in the question the trigger once isn't supported for Kinesis.
But you can achieve what you need by adding into the picture the Kinesis Data Firehose that will write data from Kinesis into S3 bucket (you can select format that you need, like, Parquet, ORC, or just leave in JSON), and then you can point the streaming job to given bucket, and use Trigger.Once for it, as it's a normal streaming source (For efficiency it's better to use Auto Loader that is available on Databricks). Also, to have the costs under the control, you can setup retention policy for your S3 destination to remove or archive files after some period of time, like 1 week or month.
A workaround is to stop after X runs, without trigger. It'll guarantee a fix number of rows per run.
The only issue is that if you have millions of rows waiting in the queue you won't have the guarantee to process all of them
In scala you can add an event listener, in python count the number of batches.
from time import sleep
s = sdf.writeStream.format("delta").start("/out/path")
#by defaut keep spark.sql.streaming.numRecentProgressUpdates=100 in the list. Stop after 10 microbatch
#maxRecordsPerFetch is 10 000 by default, so we will consume a max value of 10x10 000= 100 000 messages per run
while len(s.recentProgress) < 10:
print("Batchs #:"+str(len(s.recentProgress)))
sleep(10)
s.stop()
You can have a more advanced logic counting the number of message processed per batch and stopping when the queue is empty (the throughput should lower once it's all consumed as you'll only get the "real-time" flow, not the history)

Kafka - Difference between Events with batch data and Streams

What is the fundamental difference between an event with a batch of data attached and a kafka stream that occasionally sends data ? Can they be used interchangeably ? When should I use the first and when the latter ? Could you provide some simple use cases ?
Note: There is some info in the comments of this question but I would ask for a more well rounded answer.
I assume that with "difference" between streams and events with batched data you are thinking of:
Stream: Every event of interest is sent immediately to the stream. Those individual events are therefore fine-grained, small(er) in size.
Events with data batch: Multiple individual events get aggregated into a larger batch, and when the batch reaches a certain size, a certain time has passed, or a business transaction has completed, the batch event is sent to the stream. Those batch events are therefore more coarse-grained and large(r) in size.
Here is a list of characteristics that I can think of:
Realtime/latency: End-to-end processing time will typically be smaller for individual events, and longer for batch events, because the publisher may wait with sending batch events until enough individual events have accumulated.
Throughput: Message brokers differ in performance characteristics regarding max. # of in/out events / sec at comparable in/out amounts of data. For example, comparing Kinesis vs. Kafka, Kinesis has a lower max. # of in/out events / sec it can handle than a finely tuned Kafka cluster. So if you were to use Kinesis, batch events may make more sense to achieve the desired throughput in terms of # of individual events. Note: From what I know, the Kinesis client library has a feature to transparently batch individual events if desired/possible to increase throughput.
Order and correlation: If multiple individual events belong to one business transaction and need to be processed by consumers together and/or possibly in order, then batch events may make this task easier because all related data becomes available to consumers at once. With individual events, you would have to put appropriate measures in place like selecting appropriate partition keys to guarantee that individual events get processed in order and possibly by the same consumer worker instance.
Failure case: If batch events contain independent individual events, then it may happen that a subset of individual events in a batch fails to process (irrelevant whether temporary or permanent failure). In such a case, consumers may not be able to simply retry the entire event because parts of the batch event has already caused state changes. Explicit logic (=additional effort) may be necessary to handle partial processing failure of batch events.
To answer your question whether the two can be used interchangeably, I would say in theory yes, but depending on the specific use case, one of the two approaches will likely result better performance or result in less complex design/code/configuration.
I'll edit my answer if I can think of more differentiating characteristics.

Synchronize Data From Multiple Data Sources

Our team is trying to build a predictive maintenance system whose task is to look at a set of events and predict whether these events depict a set of known anomalies or not.
We are at the design phase and the current system design is as follows:
The events may occur on multiple sources of an IoT system (such as cloud platform, edge devices or any intermediate platforms)
The events are pushed by the data sources into a message queueing system (currently we have chosen Apache Kafka).
Each data source has its own queue (Kafka Topic).
From the queues, the data is consumed by multiple inference engines (which are actually neural networks).
Depending upon the feature set, an inference engine will subscribe to
multiple Kafka topics and stream data from those topics to continuously output the inference.
The overall architecture follows the single-responsibility principle meaning that every component will be separate from each other and run inside a separate Docker container.
Problem:
In order to classify a set of events as an anomaly, the events have to occur in the same time window. e.g. say there are three data sources pushing their respective events into Kafka topics, but due to some reason, the data is not synchronized.
So one of the inference engines pulls the latest entries from each of the kafka topics, but the corresponding events in the pulled data do not belong to the same time window (say 1 hour). That will result in invalid predictions due to out-of-sync data.
Question
We need to figure out how can we make sure that the data from all three sources are pushed in-order so that when an inference engine requests entries (say the last 100 entries) from multiple kakfa topics, the corresponding entries in each topic belong to the same time window?
I would suggest KSQL, which is a streaming SQL engine that enables real-time data processing against Apache Kafka. It also provides nice functionality for Windowed Aggregation etc.
There are 3 ways to define Windows in KSQL:
hopping windows, tumbling windows, and session windows. Hopping and
tumbling windows are time windows, because they're defined by fixed
durations they you specify. Session windows are dynamically sized
based on incoming data and defined by periods of activity separated by
gaps of inactivity.
In your context, you can use KSQL to query and aggregate the topics of interest using Windowed Joins. For example,
SELECT t1.id, ...
FROM topic_1 t1
INNER JOIN topic_2 t2
WITHIN 1 HOURS
ON t1.id = t2.id;
Some suggestions -
Handle delay at the producer end -
Ensure all three producers always send data in sync to Kafka topics by using batch.size and linger.ms.
eg. if linger.ms is set to 1000, all messages would be sent to Kafka within 1 second.
Handle delay at the consumer end -
Considering any streaming engine at the consumer side (be it Kafka-stream, spark-stream, Flink), provides windows functionality to join/aggregate stream data based on keys while considering delayed window function.
Check this - Flink windows for reference how to choose right window type link
To handle this scenario, data sources must provide some mechanism for the consumer to realize that all relevant data has arrived. The simplest solution is to publish a batch from data source with a batch Id (Guid) of some form. Consumers can then wait until the next batch id shows up marking the end of the previous batch. This approach assumes sources will not skip a batch, otherwise they will get permanently mis-aligned. There is no algorithm to detect this but you might have some fields in the data that show discontinuity and allow you to realign the data.
A weaker version of this approach is to either just wait x-seconds and assume all sources succeed in this much time or look at some form of time stamps (logical or wall clock) to detect that a source has moved on to the next time window implicitly showing completion of the last window.
The following recommendations should maximize success of event synchronization for the anomaly detection problem using timeseries data.
Use a network time synchronizer on all producer/consumer nodes
Use a heartbeat message from producers every x units of time with a fixed start time. For eg: the messages are sent every two minutes at the start of the minute.
Build predictors for producer message delay. use the heartbeat messages to compute this.
With these primitives, we should be able to align the timeseries events, accounting for time drifts due to network delays.
At the inference engine side, expand your windows at a per producer level to synch up events across producers.

Need advice on storing time series data in aligned 10 minute batches per channel

I have time series data in Kafka. The schema is quite simple - the key is the channel name, and the values are Long/Double tuples of the timestamp and the value (in reality it's a custom Avro object but it boils down to this). They always come in correct chronological order.
The wanted end result is data packaged in 10 minute batches, aligned at 10 minutes (i.e., 00:00 < t <= 00:10, 00:10 < t <= 00:20, ..., 23: 50 < t <= 00:00). Each package is to contain only data of one channel.
My idea is to have two Spark Streaming jobs. The first one takes the data from the Kafka topics and dumps it to a table in a Cassandra database where the key is the timestamp and the channel name, and every time such an RDD hits a 10 minute boundary, this boundary is posted to another topic, alongside the channel whose boundary is hit.
The second job listens to this "boundary topic", and for every received 10 minute boundary, the data is pulled from Cassandra, some calculations like min, max, mean, stddev are done and the data and these results are packaged to a defined output directory. That way, each directory contains the data from one channel and one 10 minute window.
However, this looks a bit clunky and like a lot of extra work to me. Is this a feasible solution or are there any other more efficient tricks to it, like some custom windowing of the Kafka data?
I agree with your intuition that this solution is clunky. How about simply using the time windowing functionality built into the Streams DSL?
http://kafka.apache.org/11/documentation/streams/developer-guide/dsl-api.html#windowing
The most natural output would be a new topic containing the windowed aggregations, but if you really need it written to a directory that should be possible with Kafka Connect.
I work with the Flink Stream Processing, not Spark-streaming but I guess the programming concept of both of them is alike. So supposing data are ordered chronologically and you want to aggregate data for every 10 minutes and do some processing on aggregated data, I think the best approach is to use the Streaming Window Functions. I suggest to define a function to map every incoming data's timestamp to the last 10 minutes:
12:10:24 ----> 12:10:00
12:10:30 ----> 12:10:00
12:25:24 ----> 12:20:00
So you can create a keyed stream object like:
StreamObject<Long, Tuple<data>>
That the Long field is the mapped timestamp of every message. Then you can apply a window. You should search what kind of window is more appropriate for your case.
Point: Setting a key for the data stream will cause the window function to consider a logical window for every key.
In the most simple case, you should define a time window of 10 minutes and aggregate all data incoming on that period of time.
The other approach, if you know the rate of generating of data and how many messages will be generated in a period of 10 minutes, is to use Count window. For example, a window with the count of 20 will listen to the stream and aggregate all the messages with the same key in a logical window and apply the window function just when the number of messages in the window reaches 20.
After messages aggregated in a window as desired, you could apply your processing logic using a reduce function or some action like that.

Spring Batch - gridSize

I want some clear picture in this.
I have 2000 records but I limit 1000 records in the master for partitioning using rownum with gridSize=250 and partition across 5 slaves running in 10 machines.
I assume 1000/250= 4 steps will be created.
Whether data info sent to 4 slaves leaving 1 slave idle? If number
of steps is more than the number of slave java process, I assume the
data would be eventually distributed across all slaves.
Once all steps completed, would the slave java process memory is
freed (all objects are freed from memory as the step exists)?
If all steps completed for 1000/250=4 steps, to process the
remaining 1000 records, how can I start my new job instance without
scheduler triggers the job.
Since, you have not shown your Partitioner code, I would try to answer only on assumptions.
You don't have to assume about number of steps ( I assume 1000/250= 4 steps will be created ), it would be number of entries you create in java.util.Map<java.lang.String,ExecutionContext> that you return from your partition method of Partitioner Interface.
partition method takes gridSize as argument and its up to you to make use of this parameter or not so if you decide to do partitioning based on some other parameter ( instead of evenly distributing count ) then you can do that. Eventually, number of partitions would be number of entries in returned map and values stored in ExecutionContext can be used for fetching data in readers and so on.
Next, you can choose about number of steps to be started in parallel by setting appropriate TaskExecutor and concurrencyLimit values i.e. you might create 100 steps in partition but want to start only 4 steps in parallel and that can very well be achieved by configuration settings on top of partitioner.
Answer#1: As already pointed, data distribution has to be coded by you in your reader using ExecutionContext information you created in partitioner. It doesn't happen automatically.
Answer#2: Not sure what you exactly mean but yes, everything gets freed after completion and information is saved in meta data.
Answer#3: As already pointed out, all steps would be created in one go for all the data. Which steps run for which data and how many run in parallel can be controlled by readers and configuration.
Hope it helps !!