how to get result of Kafka streams aggregate task and send the data to another service? - apache-kafka

I use Kafka streams to process the real-time data and I need to do some aggregate operations for data of a windowed time.
I have two questions about the aggregate operation.
How to get the aggregated data? I need to send it to a 3rd service.
After the aggregate operation, I can't send message to a 3rd service, the code doesn't run.
Here is my code:
stream = builder.stream("topic");
windowedKStream = stream.map(XXXXX).groupByKey().windowedBy("5mins");
ktable = windowedKStream.aggregate(()->"", new Aggregator(K,V,result));
// my data is stored in 'result' variable, but I can't get it at the end of the 5 mins window.
// I need to send the 'result' to a 3rd service. But I don't know where to temporarily store it and then how to get it.
// below is the code the call a 3rd service, but the code can't be executed(reachable).
// I think it should be executed every 5 mins when thewindows is over. But it isn't.
result = httpclient.execute('result');

I guess might want to do something like:
ktable.toStream().foreach((k,v) -> httpclient.execute(v));
Each time the KTable is updated (with caching disabled), the update record will be sent downstream, and foreach will be executed with v being the current aggregation result.

Related

Batch processing with Vert.x

Is there a way to do a typical batch processing with Vert.x - like providing a file or DB query as input and let each record be processed by a vertice in non-blocking way.
In examples of Vertice, a server is defined in startup. And even though multiple vertices are deployed, server is created only onece. Which means that Vert.x engine does have a build in concept of a server and knows how to send incomming requests to each vertice for processing.
Same happens with Event Bus as well.
But is there a way to define a vertice with a handler for processing data from a general stream - query, file, etc..
I am particularly interested in spreading data processing over cluster nodes.
One way I can think of, is execute a query a regular way and then publish data to event bus for processing. But that means that if I have to process few millions of records, I will run out of memory. Of course I could do paging, etc.. - but there is no coordination between retrieving and processing of data.
Thanks
Andrius
If you are using the JDBC Client, you can stream the query result:
(using vertx-rx-java2)
JDBCClient client = ...;
JsonObject params = new JsonArray().add(dataCategory);
client.rxQueryStreamWithParams("SELECT * FROM data WHERE data.category = ?", params)
.flatMapObservable(SQLRowStream::toObservable)
.subscribe(
(JsonArray row) -> vertx.eventBus().send("data.process", row)
);
This way each row is send to the event bus. If you then have multiple verticle instances that each listen to this address, you spread the data processing to multiple threads.
If you are using another SQL Client have a look at its documentation - Maybe is has a similar method.

Batching Kafka Events using Faust

I have a Kafka topic we will call ingest that receives an entry every x seconds. I have a process that I want to run on this data but it has to be run on 100 events at a time. Thus, I want to batch the entries together and send them to a new topic called batched-ingest. The two topics will look like this...
ingest = [entry, entry, entry, ...]
batched-ingest = [[entry_0, entry_1, ..., entry_99]]
What is the correct way to do this using faust? The solution I have right now is this...
app = faust.App("explore", value_serializer="raw")
ingest = app.topic('ingest')
ingest_batch = app.topic('ingest-batch')
#app.agent(ingest, sink=[ingest_batch])
async def test(stream):
async for values in stream.take(10, within=1000):
yield values
I am not sure if this is the correct way to do this in Faust. If so, what should I set within to in order to make it always wait until len(values) = 100?
as mentioned in the faust take documentation if you omit the within from take(100, within=10) the code will block forever if there are 99 messages and the last hundredth message is never received. To solve this add a within timeout so that up to 100 values will be processed within 10 seconds. so that if there are periods of 10 seconds with no events received it will still process what it has gathered.

Databricks Scala : How to stream result from sql select

I need to send data from a databricks delta table into azure event hubs.
The data will be selected with a sql select
spark.sql("SELECT [columns] FROM table WHERE [where clause]")
This select will return many many rows and after it, I will apply some transformation (mainly to be in accordance to the event hub event data message).
At the end I will send it to event hub.
As far as I can tell, at the moment of writing, I need to use "writeStream" but is this enough? How can I control how many messages are sent per batch? Do I even need to care about it or does the lib handle it?
Another question I have is, from the moment I use "writeStream" the command hangs in a running/streaming state for eternity. Is this correct or am I not being patient enough? If I'm correct, then how can I stop it (in a non-manual way) after sending all data?
Notes:
This will be running in a job that is to be triggered manually
The lib i use for the event hub connection is com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.14.1
Once you get your final records which you want to save in eventhub than after your write command you need to call .start() method which will enable your stream to write data back to eventhub.
Also if your jobs gets failed than in that case you need to stop your sparkContext using sc.stop() or spark.sparkContext.stop()

Read Data, Hold Data for N Seconds, Write Data (Kafka, Flink)

Application Reads from a kafka topic.
Each message must be unique (Duplicates Ignored)
holds the data for 'N' Seconds
and writes to different kafka topic as individual messages
Is there a way to hold the message for 'N seconds' and write to kafka
Each message must be written to the same topic after 'N' Seconds from the time it came in.
Currently I'm holding the data in a json structure in memory and every time a message comes in, I loop through all the messages that I have and compare times.
Naturally this is not the way to do it
val some_consumer= new FlinkKafkaConsumer09(data_topic
, new JSONKeyValueDeserializationSchema(false), properties)
some_consumer.setStartFromLatest()
val in_stream = env.addSource(some_consumer)
.filter(!_.isNull)
.map(x => processMessage(x))
def process(x: ObjectNode){
// store message in json if not existing
// loop through entire set and compare times
// if after 'N' seconds
// write to kafka
kafka_producer.send(new ProducerRecord[String, String](output_topic, the_unique_message))
}
You should hold the messages in Flink state, so that they are checkpointed, and will be restored in the case of failures.
To de-duplicate the stream, you can key the stream by whatever attribute makes an event unique, i.e., keyBy(x -> x.uniqueId). Then I would use a KeyedProcessFunction, and buffer the first event for each key in a ValueState<Event>. You can use either an EventTimeTimer or a ProcessingTimeTimer to trigger sending out the event (whichever is appropriate). If the scope of de-duplication is N seconds, then you can clear the state at the same time you emit the event.
You can use Tumbling Windows
https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#tumbling-windows
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
The above example means the data goes out each 5 seconds, and you can see it clearly when printing to the console
in your case you don't need EventTime and can use ProcessingTime.
Also you don't need keyBy(), just use AllWindow, although it's not a bad idea to use keyBy() so you obtain parallelisms
after window(), you can call FlinkKafkaSink. Because this window would periodically emit events each X minutes/seconds as you wish
You might be careful about the memory limit, because the data which is kept in the window is stored in memory

How to schedule periodical task based on number of processed messages?

I want to use Kafka Processor API to process messages from Kafka.
I would like to call some periodically function - something like:
context.schedule(IntervalMS,punctuationType, somePunctuator), where somePunctuator perform some periodical job, but instead using interval time as trigger I would like to invoke that task after processing some number of messages
Is it possible do such triggering in Kafka streams?
yes, it's possible with using Kafka Streams State Store.
logic depends on what exactly you need to do on reaching the number of processed messages.
if you need to propagate data to the next processor or sink node, you need to store aggregated values as a list of objects inside key-value state store. inside Processor.process(..) you put data into key-value store, and after that check whether number of items reached limit, and do required logic (like processorContext.forward(..)). please take a look at similar example here.
if you need to do some logic after reaching number and don't need values, you could store only counter, and inside Processor.process(..) increment this value.