Microservices with Kafka - how can we know when a service has successfully processed a message - apache-kafka

We currently have a topic that is being consumed by two services as outlined in the architecture below. One is NLP service and the other one is CV service. They are separated because they belong to different teams.
Let's say the original message is like this:
{
"id": 1234,
"text": "I love pizza",
"photo": "https://photo.service/photo001"
}
The NLP service will process the message and produce a new message to topic 1 as below:
{
"id": 1234,
"text": "I love pizza",
"nlp": "pizza",
"photo": "https://photo.service/photo001"
}
And the CV (Computer Vision) will process it and produce the below message to topic 2:
{
"id": 1234,
"text": "I love pizza",
"photo": "https://photo.service/photo001",
"cv": ["pizza", "restaurant", "cup", "spoon", "folk"]
}
Lastly, there's a final service that need both pieces of information from the two services above. However, the amount of time taken by NLP service and CV service is different. Now, as the final service, how do I grab both messages from topic 1 and topic 2 for this particular message with id 1234?

You can use Kafka Streams or ksqlDB to run a join query. Otherwise, you'd use an external database for the same.
E.g. You'd create a table for whichever events finish "first", then you join the second incoming stream on the ID keys of that table. Without a persistent table, you can join two streams, but this assumes there is a time window in which both events will exist.
Alternatively, don't split the incoming stream.
A -> NLP -> CV -> final service

Related

Understanding use case for max.in.flight.request property in Kafka

I'm building a Spring Boot consumer-producers project with Kafka as middleman between two microservices. The theme of the project is a basketball game. Here is a small state machine diagram, in which events are displayed. There will be many more different events, this is just a snippet.
Start event:
{
"id" : 5,
"actualStartTime" : "someStartTime"
}
Point event:
{
"game": 5,
"type": "POINT",
"payload": {
"playerId": 44,
"value": 3
}
}
Assist event:
{
"game": 4,
"type": "ASSIST",
"payload": {
"playerId": 278,
"value": 1
}
}
Jump event:
{
"game": 2,
"type": "JUMP",
"payload": {
"playerId": 55,
"value": 1
}
}
End event:
{
"id" : 5,
"endTime" : "someStartTime"
}
Main thing to note here is that if there was an Assist event it must be followed with Point event.
Since I'm new to Kafka, I'll keep things simple and have one broker with one topic and one partition. For my use case I need to maintain ordering of each of these events as they actually happen live on the court (I have a json file with 7000 lines and bunch of these and other events).
So, let's say that from the Admin UI someone is sending these events (for instance via WebSockets) to the producers app. Producer app will be doing some simple validation or whatever it needs to do. Now, we can also image that we have two instances of producer app, one is at ip:8080 (prd1) and other one at ip:8081 (prd2).
In reality sequence of these three events happend: Assist -> Point -> Jump. The operator on the court send those three events in that order.
Assist event was sent on prd1 and Point was sent on prd2. Let's now imagine that there was a network glitch in communication between prd1 and Kafka cluster. Since we are using Kafka latest Kafka at the time of this writing, we already have enabled.idempotence=true and Assist event will not be sent twice.
During retry of Assist event on prd1 (towards Kafka), Point event on prd2 passed successfully. Then Assist event passed and after it Jump event (at any producer) also ended up in Kafka.
Now in queue we have: Point -> Assist -> Jump. This is not allowed.
My question is whether these types of problems should be handle by application's business logic (for example Spring State Machine) or this ordering can be handled by Kafka?
In case of latter, is property max.in.flight.request=1 responsible for ordering? Are there any other properties which might preserve ordering?
On the side note, is it a good tactic to use single partition for single match and multiple consumers for any of the partitions? Most probably I would be streaming different types of matches (basketball, soccer, golf, across different leagues and nations) and most of them will require some sort of ordering.
This maybe can be done with KStreams but I'm still on Kafka's steep learning curve.
Update 1 (after Jessica Vasey's comments):
Hi, thanks for very through comments. Unfortunately I didn't quite get all the pieces of the puzzle. What confuses me the most is some terminology you use and order of things happening. Not saying it's not correct, just I didn't understand.
I'll have two microservices, so two Producers. I got be be able to understand Kafka in microservices world, since I'm Java Spring developer and its all about microservices and multiple instances.
So let's say that on prd1 few dto events came along [Start -> Point -> Assist] and they are sent as a ProducerRequest (https://kafka.apache.org/documentation/#recordbatch), they are placed in RECORDS field. On the prd2 we got [Point -> Jump] also as a ProducerRequest. They are, in my understanding, two independent in-flight requests (out of 5 possible?)? Their ordering is based on a timestamp?
So when joining to the cluster, Kafka assigns id to producer let's say '0' for prd1 and '1' for prd2 (I guess it also depends on topic-partition they have been assigned). I don't understand whether each RecordBatch has its monotonically increasing sequence number id or each Kafka message within RecordBatch has its own monotonically increasing sequence number or both? Also the part 'time to recover' is bugging me. Like, if I got OutofOrderSequenceException, does it mean that [Point -> Jump] batch (with possibly other in-flight requsets and other batches in producer's buffer) will sit on Kafka until either delivery.timeout.ms expirees or when it finally successfully [Start -> Point -> Assist] is sent?
Sorry for confusing you further, it's some complex logic you have! Hopefully, I can clarify some points for you. I assumed you had one producer, but after re-reading your post I see you have two producers.
You cannot guarantee the order of messages across both producers. You can only guarantee the order for each individual producer. This post explains this quite nicely Kafka ordering with multiple producers on same topic and parititon
On this question:
They are, in my understanding, two independent in-flight requests (out
of 5 possible?)? Their ordering is based on a timestamp?
Yes, each producer will have max.in.flight.requests.per.connection set to 5.
You could provide a timestamp in your producer, which could help with your situation. However, I won't go into too much detail on that right now and will first answer your questions.
I don't understand whether
each RecordBatch has its monotonically increasing sequence number id
or each Kafka message within RecordBatch has its own monotonically
increasing sequence number or both? Also the part 'time to recover' is
bugging me. Like, if I got OutofOrderSequenceException, does it mean
that [Point -> Jump] batch (with possibly other in-flight requsets and
other batches in producer's buffer) will sit on Kafka until either
delivery.timeout.ms expirees or when it finally successfully [Start ->
Point -> Assist] is sent?
Each message is assigned a monotonically increasing sequence number. This LinkedIn post explains is better than I ever could!
Yes, other batches will sit on the producer until either the previous batch is acknowledged (which could be less than 2 mins) OR delivery.timeout.ms expires.
Even if max.in.flight.requests.per.connection > 1, setting enable.idempotence=true should preserve the message order as this assigns the messages a sequence number. When a batch fails, all subsequent batches to the same partition fail with OutofOrderSequenceException.
Number of partitions should be determined by your target throughput. If you wanted to send basketball matches to one partition and golf to another, you can use keys to determine which message should be sent where.

How to parse topics in kafka to another topics by condition in real time?

I am using Kafka connect to send elasticsearch data to kafka.
Once the connector is running, a topic is automatically created whose name is the elasticsearch index followed by a prefix.
Now, I would like to split this topic into N topics by condition
all my output kafka topic is like this:
{"schema":
{"type":"struct",
"fields":[
{"type":"string","optional":true,"field":"nature"},
{"type":"string","optional":true,"field":"description"},
{"type":"string","optional":true,"field":"threshold"},
{"type":"string","optional":true,"field":"quality"},
{"type":"string","optional":true,"field":"rowid"},
{"type":"string","optional":true,"field":"avrotimestamp"},
{"type":"array","items":{"type":"string","optional":true},"optional":true,"field":"null"},
{"type":"string","optional":true,"field":"domain"},
{"type":"string","optional":true,"field":"name"},
{"type":"string","optional":true,"field":"avroversion"},
{"type":"string","optional":true,"field":"esindex"},
{"type":"string","optional":true,"field":"value"},
{"type":"string","optional":true,"field":"chrono"},
{"type":"string","optional":true,"field":"esid"},
{"type":"string","optional":true,"field":"ts"}],"optional":false,"name":"demofilter"},
"payload":
{
"nature":"R01",
"description":"Energy",
"threshold":"","quality":"192",
"rowid":"34380941",
"avrotimestamp":"2022-09-20T04:00:11.939Z",
"null":["TGT BQ 1B"],
"domain":"CFO",
"name":"RDC.R01.RED.MES",
"avroversion":"1",
"esindex":"demo_filter",
"value":"4468582",
"chrono":"133081200000000000",
"esid":"nuWIrYMBHyNMgyhJYscV",
"ts":"2022-09-20T02:00:00.000Z"
}
}
the description field takes several values ​​but should contain one of these keywords: energy, electric, and temperature (example: life energy, body temperature, car energy)
the goal is that when the description field has the energy keyword, the data must be sent to the energy topic and so on, all in real time of course
what i was looking for:
according to my research kafka stream is an option, unfortunately with the wordcount example I can't figure out how I can do it. (I'm learning kafka stream for data processing).
use python to sort after consuming the data but it takes time and loses the word in real time
What should I do?
Using Kafka Streams, you can make dynamic routing decisions in the to() based on whatever is in the payload of an event. Here, the name of the output topic is derived from event data.
myStream.to(
(eventId, event, record) -> "topic-prefix-" + event.methodOfYourEventLikeGetTypeName()
);

How to store/aggregate correlated cdc events with flink?

I have a kafka queue where multiple cdc events are coming from a database.
Suppose the following three tables implementing a student-course n:n association:
STUDENT
COURSE
STUDENT_COURSE
I can have the following "business" events:
A new student enrolls to a course: in this case I would receive 3 events on my kafka queue, said events can come in any order but I'd like to emit a "business" event like this one: {type:"enroll", "student": {"name": "Jhon", "age": "..}, "course": {"name":"physics", "teacher":"Foo", ...}
A student changes their course: in this case I would only receive 1 event on my kafka queue (on STUDENT_COURSE) and I'd launch a "business" event like this one: {"type": "change", "student": {"name": "Jhon", "age": "..}, "newcourse": {"name":"maths", "teacher":"Foo", ...}
Updates on STUDENT information (say email, phone,...) or COURSE information (time, teacher,...) 1 event on either table
My issue is that I don't know how to store and correlate said CDC to make a business event together, in fact I'd need to do something like this:
Receive the event and store it in an "uncertain" state, wait for a reasonable time, say 10 sec
If an event on another table is received then I'm in case 1
Otherwise I'm in 2/3
Is there a way to obtain this behavior in flink?
Looks like you could start with a streaming SQL join on the 3 dynamic tables derived from these CDC streams, which will produce an update stream along the lines of what you're looking for.
Some of the examples in https://github.com/ververica/flink-sql-cookbook should provide inspiration for getting started. https://github.com/ververica/flink-sql-CDC and https://github.com/ververica/flink-cdc-connectors are also good resources.

Client for Kafka testing

I am trying to setup a Kafka environment. I already have implemented a Kafka producer and consumer in my code.
Is there a Kafka test client I can use to test this setup?
Basically, what I want is this:
my code publishes some event.
test client gets it.
my test client publishes some event.
My code gets it.
IS there a Kafka test client that can be used to do the above? I tried searching the Kafka website and found nothing.
The Java API comes with MockConsumer and MockProducer for unit testing, as well as TopologyTestDriver for Kafka Streams.
If you want integration testing with a real broker, you can use testcontainers (i.e. Docker), or use spring-test-kafka (Spring not required)
Cucumblan-message library contains predefined Gherkin step definition for Kafka message/event testing.
No need to create any consumer or producer and this framework will take care.
This code base will help to test any Kafka messages.
Refer the following code based from your sample:
https://tutorials.virtualan.io/#/Cucumblan-message
https://github.com/virtualansoftware/cucumblan/blob/master/samples/cucumblan-message-testing/src/test/resources/features/kafka.feature#L47
Need to implement simple class JSON based consumer class for read message and validate the message.
https://github.com/virtualansoftware/cucumblan/blob/master/samples/cucumblan-message-testing/src/test/java/io/virtualan/test/msgtype/impl/JSONMessage.java
Scenario: check produce and consume event validation 1
Given Send inline message pets for event MOCK_REQUEST on pet with type JSON
| { "category": { "id": 100, "name": "Fish-POST" }, "id": 100, "name": "GoldFish-POST", "photoUrls": [ "/fish/" ], "status": "available", "tags": [ { "id": 100, "name": "Fish-POST" } ] } |
And Pause message PROCESSING for process for 2000 milliseconds
When Verify-by-elements for pets for event MOCK_RESPONSE contains 101 on pet with type JSON
| id | i~101 |
| category.name | german shepherd |
Then Verify for pets for event MOCK_RESPONSE contains 101 on pet with type JSON
| id,name, category/id:name,status |
| i~101,Rocky,i~100:german shepherd,available |
And Verify for pets for event MOCK_RESPONSE contains 101 on pet with type JSON
| id,name, category/id:name,tags/id:name,status,photoUrls |
| i~101,Rocky,i~100:german shepherd,i~101:brown\|,available,string\| |

OPC Publisher doesn't send data in order as in generated by OPC simulation server

I have been trying to retrieve sensor data generated by OPC simulation server (data listed in excel file and read by OPC simulation) in to one of the custom modules in Azure IOT Edge. When the data logged in the console it shows me that data has not been logged in order. Following is the JSON for OPC publisher hosted in iot edge as a module.
"OPCPublisher": {
"version": "1.0",
"type": "docker",
"status": "running",
"restartPolicy": "always",
"settings": {
"image": "mcr.microsoft.com/iotedge/opc-publisher:2.8",
"createOptions": {
"Hostname": "publisher",
"Cmd": [
"publisher",
"--pf=/appdata/publishednodes.json",
"--lf=/appdata/publisher.log",
"--aa"
],
"HostConfig": {
"Binds": [
"/home/sineth/iiotedge:/appdata"
]
}
}
}
}
Following is the published nodes json in gateway device.
Following is the screenshot of my excel sheet data
But the OPC publisher will not route the data in to modules in order that starting from anywhere but in order .
For an example it sends starting from the row ,value 11 for Tag11 and then again sends the next row which has the value 17 for tag 11. And sometimes sends a batch of data. no proper order.
This is not a issue with OPC server simulation since i have tested Simulation server with a standalone OPC client and it gets the data in order. Excel is read by simulation server.
Following image is a screenshot of my IoT edge module(python) where i log the data to console retrieving from OPC Publisher routing.
Appreciate any help on this.
Thanks a lot.
Adding summary from GitHub Issues discussions here:
OPC Publisher generate a unique message id, for each OPC UA endpoint (auto increasing by one)
python client code above's logs the same message more than 3500 times
Receiving of message don't seems to block and therefor is handling the same message over and over again
receive_on_message_input is deprecated and should not be used anymore, see API documentation
Without the duplicates all value changes are in order but the behavior is still not what the OP needs.
More than one message (containing value changes for all three tags) is batched
OPC Publisher try to optimize for cost and performance, sending each message at a time is neither of them, but it is possible to configure OPC Publisher in a way to send data directly by setting the batch size to one.
command line argument --bs=1
Not starting by the first value
OPC Publisher establish a connection to OPC UA server and creates monitored items, for every OPC UA nodes in it's config file. As default the OPC UA monitored item, will send a default notification with the current value. If you want to ignore it, you could use skip first.
command line argument --sk=true
But in the case described above also the first value is relevant. If the first message (message id = 1) don't contain the first value, then the OPC server simulation changed them before.
Please be aware that the OPC Publisher can only publish once the OPC UA client/server connection is fully established (including trusting of certificates), the subscriptions and the monitored items are created. This time is also depending on the performance of OPC UA server and network.
Proposals:
Change OPC Simulation to only start simulation sequence once a client connection is fully established
Retrieve the same message multiple times
If the messages are received multiple times it could be an error with the routing from the messages from one IoT edge module to another. Please make sure to explicitly name the sending module (in this case the OPC Publisher)
"$edgeHub": {
"properties.desired": {
"schemaVersion": "1.2",
"routes": {
"opcPublisherToPyDataConsumer": "FROM /messages/modules/opc-publisher INTO BrokeredEndpoint(\"/modules/PyDataConsumer/inputs/fromOPC\")"
}
}
}
So some questions:
What version of iotedge are you using?
Is it just that logs are not in order or are messages being received out of order?
What protocol are you using MQTT or AMQP?