Integrating WSO2 Siddhi CEP and Kafka - apache-kafka

I'm currently in the process of integrating WSO2's Siddhi CEP and Kafka. I want to produce a Siddhi stream by receiving events from Kafka. The Kafka data being received is in JSON format, where each event looks something like this:
{
"event":{
"orderID":"1532538588320",
"timestamps":[
15325,
153
],
"earliestTime":1532538
}
}
The SiddhiApp that I'm trying to run in the WSO2 stream processor looks like this:
#App:name('KafkaSiddhi')
#App:description('Consume events from a Kafka Topic and print the output.')
-- Streams
#source(type='kafka',
topic.list = 'order-aggregates',
partition.no.list = '0',
threading.option = 'single.thread',
group.id = 'time-aggregates',
bootstrap.servers = 'localhost:9092, localhost:2181',
#map(type='json'))
define stream TimeAggregateStream (orderID string,timestamps
object,earliestTime long);
#sink(type="log")
define stream TimeAggregateResultStream (orderID string, timestamps
object, earliestTime long);
-- Queries
from TimeAggregateStream
select orderID, timestamps, earliestTime
insert into TimeAggregateResultStream;
Running this app should log all of the data being updated in the order-aggregates Kafka cluster that I'm listening to. But I see no output whatsoever when click run.
I can tell that there is some type of interaction between the WSO2 stream processor and the order-aggregates topic, because error messages are outputted in real-time whenever I run the application with inconsistent data types for my stream schema. The error messages look like this:
[2018-07-25_10-14-37_224] ERROR
{org.wso2.extension.siddhi.map.json.sourcemapper.JsonSourceMapper} -
Json message {"event":{"orderID":"210000000016183","timestamps":
[1532538627000],"earliestTime":1532538627000}} contains incompatible
attribute types and values. Value 210000000016183 is not compatible with
type LONG. Hence dropping the message. (Encoded)
However, when I have the schema setup correctly, I receive no output at all when I run the application. I really don't know how to make sense of this. When I try to debug this by putting a breakpoint into the line including 'insert into', the debugger never stops at that line.
Can anyone offer some insight on how to approach this issue?

We have added the object support for json mapper extension in the latest release of the extension. Please download the extension[1] and replace the siddhi-map-json jar in /lib.
[1] https://store.wso2.com/store/assets/analyticsextension/details/0e6a6b38-f1d1-49f5-a685-e8c16741494d
Best Regards,
Ramindu.

Related

How to parse topics in kafka to another topics by condition in real time?

I am using Kafka connect to send elasticsearch data to kafka.
Once the connector is running, a topic is automatically created whose name is the elasticsearch index followed by a prefix.
Now, I would like to split this topic into N topics by condition
all my output kafka topic is like this:
{"schema":
{"type":"struct",
"fields":[
{"type":"string","optional":true,"field":"nature"},
{"type":"string","optional":true,"field":"description"},
{"type":"string","optional":true,"field":"threshold"},
{"type":"string","optional":true,"field":"quality"},
{"type":"string","optional":true,"field":"rowid"},
{"type":"string","optional":true,"field":"avrotimestamp"},
{"type":"array","items":{"type":"string","optional":true},"optional":true,"field":"null"},
{"type":"string","optional":true,"field":"domain"},
{"type":"string","optional":true,"field":"name"},
{"type":"string","optional":true,"field":"avroversion"},
{"type":"string","optional":true,"field":"esindex"},
{"type":"string","optional":true,"field":"value"},
{"type":"string","optional":true,"field":"chrono"},
{"type":"string","optional":true,"field":"esid"},
{"type":"string","optional":true,"field":"ts"}],"optional":false,"name":"demofilter"},
"payload":
{
"nature":"R01",
"description":"Energy",
"threshold":"","quality":"192",
"rowid":"34380941",
"avrotimestamp":"2022-09-20T04:00:11.939Z",
"null":["TGT BQ 1B"],
"domain":"CFO",
"name":"RDC.R01.RED.MES",
"avroversion":"1",
"esindex":"demo_filter",
"value":"4468582",
"chrono":"133081200000000000",
"esid":"nuWIrYMBHyNMgyhJYscV",
"ts":"2022-09-20T02:00:00.000Z"
}
}
the description field takes several values ​​but should contain one of these keywords: energy, electric, and temperature (example: life energy, body temperature, car energy)
the goal is that when the description field has the energy keyword, the data must be sent to the energy topic and so on, all in real time of course
what i was looking for:
according to my research kafka stream is an option, unfortunately with the wordcount example I can't figure out how I can do it. (I'm learning kafka stream for data processing).
use python to sort after consuming the data but it takes time and loses the word in real time
What should I do?
Using Kafka Streams, you can make dynamic routing decisions in the to() based on whatever is in the payload of an event. Here, the name of the output topic is derived from event data.
myStream.to(
(eventId, event, record) -> "topic-prefix-" + event.methodOfYourEventLikeGetTypeName()
);

Unable to get Repeatedly.forever triggering working in my pipeline

Currently developing a Beam pipeline using Flinkrunner and Apache Beam 2.29.
As per suggestion in Beam File processing patterns, I have an unbound pipeline listening to a Kafka topic for a CSV filename and once received processes it through TextIO readFile().
We end up with two PCollections, one is from the file being processed and the other is from a lookup from an external datastore. The PCollections are joined using the Join extension which forces us to setup some triggering on these two PCollections. So I have defined something like the below for each PCollection in hopes that the end result following the join would produce some new output every time a new filename arrives from the Kafka topic we are monitoring.
PCollection<KV<String, Map<String, AttributeValue>>> lookupTable = LookupTable.getPspLookupData(p, lookupTableName, lookupTableRegionFilter)
.apply("WindowB", Window.<KV<String, Map<String, AttributeValue>>>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(15))))
.withAllowedLateness(Duration.standardSeconds(5))
.discardingFiredPanes()
);
But it simply does not work more than once. It seems that if I send one or more kafka messages before the 15 seconds defined in plusDelayOf() the data gets processed but anything sent past those 15 seconds (from pipeline startup) is never processed and the pipeline is simply "stuck" despited having defined a trigger of Repeatedly.forever...
I have tried numerous combinations and I simply cant get it to work. Would welcome any ideas or suggestions to get this to work. Feels like I am missing something basic but I have been at this for hours.
Thanks,
Serge

Can the Kafka Connect JDBC Sink dump raw data?

Partly for testing and debugging but also to work around an issue we are seeing in a topic where we have are unable to change the producer I would like to be able to store the value as a string in a CLOB in a database table.
I have this working as a Java based consumer but I am looking at whether this could be achieved using Kafka Connect.
Everything I have read says you need a schema with the reasoning being that how else would it know how to process the data into columns (which makes sense) but I don't want to do any processing of the data (which could be JSON but might just be text) I just want to treat the whole value as a string and load it into one column.
Is there any way this can be done within the Connect config or am I looking at adding extra processing to update the message (in which case the Java client is probably going to end up being simpler)
No, the JDBC Sink connector requires a schema to work. You could modify the source code to add in this behaviour.
I would personally try to stick with Kafka Connect for streaming data to a database since it does all the difficult stuff (scale out, restarts, etc etc etc) very well. Depending on the processing that you're talking about, it could well be that Single Message Transform would be very applicable, since they fit into the Kafka Connect pipeline. Or for more complex processing, Kafka Streams or ksqlDB.

Avoid Data Loss While Processing Messages from Kafka

Looking out for best approach for designing my Kafka Consumer. Basically I would like to see what is the best way to avoid data loss in case there are any
exception/errors during processing the messages.
My use case is as below.
a) The reason why I am using a SERVICE to process the message is - in future I am planning to write an ERROR PROCESSOR application which would run at the end of the day, which will try to process the failed messages (not all messages, but messages which fails because of any dependencies like parent missing) again.
b) I want to make sure there is zero message loss and so I will save the message to a file in case there are any issues while saving the message to DB.
c) In production environment there can be multiple instances of consumer and services running and so there is high chance that multiple applications try to write to the
same file.
Q-1) Is writing to file the only option to avoid data loss ?
Q-2) If it is the only option, how to make sure multiple applications write to the same file and read at the same time ? Please consider in future once the error processor
is build, it might be reading the messages from the same file while another application is trying to write to the file.
ERROR PROCESSOR - Our source is following a event driven mechanics and there is high chance that some times the dependent event (for example, the parent entity for something) might get delayed by a couple of days. So in that case, I want my ERROR PROCESSOR to process the same messages multiple times.
I've run into something similar before. So, diving straight into your questions:
Not necessarily, you could perhaps send those messages back to Kafka in a new topic (let's say - error-topic). So, when your error processor is ready, it could just listen in to the this error-topic and consume those messages as they come in.
I think this question has been addressed in response to the first one. So, instead of using a file to write to and read from and open multiple file handles to do this concurrently, Kafka might be a better choice as it is designed for such problems.
Note: The following point is just some food for thought based on my limited understanding of your problem domain. So, you may just choose to ignore this safely.
One more point worth considering on your design for the service component - You might as well consider merging points 4 and 5 by sending all the error messages back to Kafka. That will enable you to process all error messages in a consistent way as opposed to putting some messages in the error DB and some in Kafka.
EDIT: Based on the additional information on the ERROR PROCESSOR requirement, here's a diagrammatic representation of the solution design.
I've deliberately kept the output of the ERROR PROCESSOR abstract for now just to keep it generic.
I hope this helps!
If you don't commit the consumed message before writing to the database, then nothing would be lost while Kafka retains the message. The tradeoff of that would be that if the consumer did commit to the database, but a Kafka offset commit fails or times out, you'd end up consuming records again and potentially have duplicates being processed in your service.
Even if you did write to a file, you wouldn't be guaranteed ordering unless you opened a file per partition, and ensured all consumers only ran on a single machine (because you're preserving state there, which isn't fault-tolerant). Deduplication would still need handled as well.
Also, rather than write your own consumer to a database, you could look into Kafka Connect framework. For validating a message, you can similarly deploy a Kafka Streams application to filter out bad messages from an input topic out into a topic to send to the DB

How to implement a microservice Event Driven architecture with Spring Cloud Stream Kafka and Database per service

I am trying to implement an event driven architecture to handle distributed transactions. Each service has its own database and uses Kafka to send messages to inform other microservices about the operations.
An example:
Order service -------> | Kafka |------->Payment Service
| |
Orders MariaDB DB Payment MariaDB Database
Order receives an order request. It has to store the new Order in its DB and publish a message so that Payment Service realizes it has to charge for the item:
private OrderBusiness orderBusiness;
#PostMapping
public Order createOrder(#RequestBody Order order){
logger.debug("createOrder()");
//a.- Save the order in the DB
orderBusiness.createOrder(order);
//b. Publish in the topic so that Payment Service charges for the item.
try{
orderSource.output().send(MessageBuilder.withPayload(order).build());
}catch(Exception e){
logger.error("{}", e);
}
return order;
}
These are my doubts:
Steps a.- (save in Order DB) and b.- (publish the message) should be performed in a transaction, atomically. How can I achieve that?
This is related to the previous one: I send the message with: orderSource.output().send(MessageBuilder.withPayload(order).build()); This operations is asynchronous and ALWAYS returns true, no matter if the Kafka broker is down. How can I know that the message has reached the Kafka broker?
Steps a.- (save in Order DB) and b.- (publish the message) should be
performed in a transaction, atomically. How can I achieve that?
Kafka currently does not support transactions (and thus also no rollback or commit), which you'd need to synchronize something like this. So in short: you can't do what you want to do. This will change in the near-ish future, when KIP-98 is merged, but that might take some time yet. Also, even with transactions in Kafka, an atomic transaction across two systems is a very hard thing to do, everything that follows will only be improved upon by transactional support in Kafka, it will still not entirely solve your issue. For that you would need to look into implementing some form of two phase commit across your systems.
You can get somewhat close by configuring producer properties, but in the end you will have to chose between at least once or at most once for one of your systems (MariaDB or Kafka).
Let's start with what you can do in Kafka do ensure delivery of a message and further down we'll dive into your options for the overall process flow and what the consequences are.
Guaranteed delivery
You can configure how many brokers have to confirm receipt of your messages, before the request is returned to you with the parameter acks: by setting this to all you tell the broker to wait until all replicas have acknowledged your message before returning an answer to you. This is still no 100% guarantee that your message will not be lost, since it has only been written to the page cache yet and there are theoretical scenarios with a broker failing before it is persisted to disc, where the message might still be lost. But this is as good a guarantee as you are going to get.
You can further reduce the risk of data loss by lowering the intervall at which brokers force an fsync to disc (emphasized text and/or flush.ms) but please be aware, that these values can bring with them heavy performance penalties.
In addition to these settings you will need to wait for your Kafka producer to return the response for your request to you and check whether an exception occurred. This sort of ties into the second part of your question, so I will go into that further down.
If the response is clean, you can be as sure as possible that your data got to Kafka and start worrying about MariaDB.
Everything we have covered so far only addresses how to ensure that Kafka got your messages, but you also need to write data into MariaDB, and this can fail as well, which would make it necessary to recall a message you potentially already sent to Kafka - and this you can't do.
So basically you need to choose one system in which you are better able to deal with duplicates/missing values (depending on whether or not you resend partial failures) and that will influence the order you do things in.
Option 1
In this option you initialize a transaction in MariaDB, then send the message to Kafka, wait for a response and if the send was successful you commit the transaction in MariaDB. Should sending to Kafka fail, you can rollback your transaction in MariaDB and everything is dandy.
If however, sending to Kafka is successful and your commit to MariaDB fails for some reason, then there is no way of getting back the message from Kafka. So you will either be missing a message in MariaDB or have a duplicate message in Kafka, if you resend everything later on.
Option 2
This is pretty much just the other way around, but you are probably better able to delete a message that was written in MariaDB, depending on your data model.
Of course you can mitigate both approaches by keeping track of failed sends and retrying just these later on, but all of that is more of a bandaid on the bigger issue.
Personally I'd go with approach 1, since the chance of a commit failing should be somewhat smaller than the send itself and implement some sort of dupe check on the other side of Kafka.
This is related to the previous one: I send the message with:
orderSource.output().send(MessageBuilder.withPayload(order).build());
This operations is asynchronous and ALWAYS returns true, no matter if
the Kafka broker is down. How can I know that the message has reached
the Kafka broker?
Now first of, I'll admit I am unfamiliar with Spring, so this may not be of use to you, but the following code snippet illustrates one way of checking produce responses for exceptions.
By calling flush you block until all sends have finished (and either failed or succeeded) and then check the results.
Producer<String, String> producer = new KafkaProducer<>(myConfig);
final ArrayList<Exception> exceptionList = new ArrayList<>();
for(MessageType message : messages){
producer.send(new ProducerRecord<String, String>("myTopic", message.getKey(), message.getValue()), new Callback() {
#Override
public void onCompletion(RecordMetadata metadata, Exception exception) {
if (exception != null) {
exceptionList.add(exception);
}
}
});
}
producer.flush();
if (!exceptionList.isEmpty()) {
// do stuff
}
I think the proper way for implementing Event Sourcing is by having Kafka be filled directly from events pushed by a plugin that reads from the RDBMS binlog e.g using Confluent BottledWater (https://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/) or more active Debezium (http://debezium.io/). Then consuming Microservices can listen to those events, consume them and act on their respective databases being eventually consistent with the RDBMS database.
Have a look here to my full answer for a guideline:
https://stackoverflow.com/a/43607887/986160