Does kafka holds capability of rule-engine? - apache-kafka

We are using kafka for messaging and lot more stuff but now there is a requirement where we need some kind of rule-engine for data processing based on some rules. Does kafka holds any capability like this (rule-engine) or we have to use third party rule engine's (eg. https://camunda.com/dmn/ ) only and integrate with kafka.

There is no need to use a third party rule engine with Apache Kafka. As part of the project there is Kafka Streams and also to ease off a bit the need to write Java code to express rules there is ksqlDB that is based on a subset of ANSI SQL.
While these options are not necessarily a rule engine per-se; they share the same semantics which is: given an intermediate processing output the relevant result based on the computation. The difference will be in the how and not in the what. So I think they are decent replacements.
You can always integrate both as well. Several rule engines such as Drools from Red Hat are Java-based and thus; can be easily accessed from a Kafka Streams processor. As long the if-then-else rules run in the same JVM space of the Kafka Streams application you won't have any performance penalties other than a possibly bigger JVM heap.

Related

How to do stream processing with Redpanda?

Redpanda seems easy to work with, but how would one process streams in real-time?
We have a few thousand IoT devices that send us data every second. We would like to get the running average of the data from the last hour for each of the devices. Can the built-in WebAssembly stuff be used for this, or do we need something like Materialize?
Given that it is marketed as "Kafka Compatible," any Kafka library should work with RedPanda, including Kafka Streams, KSQL, Apache Spark, Flink, Storm, etc.
Thanks, folks. Since it hasn't been mentioned I'm going to add my own answer as well here.
We ended up using Bytewax.
It worked great with our existing Kubernetes setup. It supports stateful operations and scales horizontally into multiple pods if needed. It's pretty performant (1), and since it's basically just a python program it can be customized to read and write to whatever you want.
(1) The Bytewax pod actually uses less CPU than our KafkaJS pod, which just stores all messages to a DB.
Here's more information about stream processors that work with Redpanda.
https://redpanda.com/blog/kafka-stream-processors

Real-time processing: Storm / flink vs standard application (java, c#...)

I am wondering about the choice of implementing an application processing events coming from Kafka, I have in mind two architecture patterns:
an application developed using the Apache Storm or Apache Flink framework that would process events consumed from Kafka
a Java application (or python, C#...), deployed X times (scalable depending on traffic), which would process events coming from Kafka
I find it difficult to see which of the scenarios is the most interesting.
Someone could help me on this topic ?
It's hard to give some definitive advice with so little information available. So I leave my response vague until you provide more specific information:
Choosing a processing framework over a native implementation gives you the following advantages:
Parallel processing with (in theory) infinite scalability: If you ever expect that you cannot process all events in a single thread in a timely manner, you first need to scale up (more threads) and eventually scale out (more machines). A frameworks takes care of all synchronization between threads and machines, so you just need to write sequential code glued together with some high-level primitives (similar to LINQ in C#).
Fault tolerance: What happens when your code screws up (some edge case not implemented)? When you run out of resources? When network (to Kinesis or other machines) temporarily breaks? A framework takes care of all these nasty little details.
In case of failure, when you restart application, most frameworks give you some form of exactly once processing: How do you avoid losing data? How do you avoid duplicates when reprocessing old data?
Managed state: If your application needs to remember things for a certain time (calculating sums/average or joining data), how do you ensure that the state is kept in sync with data in case of failure?
Advanced features: time triggers, complex event processing (=pattern matching on events), writing to different sinks (Kafka for low latency, s3 for batch processing)
Flexibility of storage: if you want to try out a different storage system, it's much easier to change source/sink in an application writing in a framework.
Integration in deployment platforms: If you want to scale to several machines, it's usually much easier to scale a platform that already offers related integration (at the time of writing that should be mostly Kubernetes). But all frameworks also support simple local setups where you just scale-up on one (bigger) machine.
Low-level optimizations: When using new engines with higher abstractions, it's possible that the frameworks generate code that is much more efficient than what you can implement yourself (with specific memory layout or serialized data processing).
The big downsides are usually:
Complexity of the framework: you need to understand how the framework works from a user's perspective. However, you usually save time by not going into the details of writing a custom consumer/producer, so it's not as bad as it initially seems.
Flexibility in code: you cannot write arbitrary code anymore. Since the framework handles parallelism for you, you need to think in terms of chunks of data and adjust your algorithms accordingly. Standard SQL operations are usually directly supported though in one form or another.
Less control over resource usage: since the platform schedules the task across machines, you may end up with unfortunate assignments and the platform may give you too little options to fix it. Note that most applications are more intrinsically bound to bad resource utilization because of data skew and suboptimal algorithms though.

Sharing partitioning logic across polyglot producers with Kafka

We are building an event sourced system at my company, relying on Kafka.
In order to be GDPR compliant, we need to be able to update the events.
Our idea is to use the compaction and tombstone capabilities.
This means that we cannot use the default partitioning strategy, as we want each message to have an unique key (in order to overwrite a specific message), but we still want events occuring on the same aggregate to end on the same partition.
Which brings us to the creation of a custom partitioner (basically copying the "hash modulo" logic of the default partitioner, but using a different value than the message key to compute the hash).
The issue is that we're evolving in a polyglot environment (we have php, python and Java/Kotlin services publishing and consuming events).
We want to ensure that all these services will produce messages to the same partition given a specific partition key (in case different services will publish events to the same topic).
Our main idea was to use a common hashing algorithm, but we find it hard to find one with both a strong distribution guarantee and a good stability (not just part of an experimental lib).
PHP natively supports a wide range of hashing algorithms, but we find it hard to find the same support in the other languages.
As Kafka default partitioner relies on murmur2, we started looking in that direction as well. Unfortunately, it is not natively supported by php (although some implementations exist). Furthermore, this algorithm uses a seed, which means that we will need to use the exact same seed for all our publisher services, which is starting to make the approach look quite complex.
However, we could be looking at the design from the wrong angle. Sharing event store write capabilities across polyglot services might not be a good idea and each services could have its own partitioning logic as long as it ensures the "one partition per aggregate" requirement. The thing is that we have to think this ahead, because no technical safeguard will prevent one service in the future to publish on a "shared" event stream (and not using the exact same partitioning logic will have a huge impact when it happens).
Would someone has experience with building an event store with Kafka in a polyglot environment, and could highlight us on this specific topic, please?

Microservices & Kafka: To couple or not to couple

I'm having a problem wrapping my mind around a probably normal setup of Microservices and Kafka we are currently setting up.
We are having one Topic in Kafka and multiple consumers reading from this Topic via separate consumer groups.
But somehow I think this could lead to coupling in terms of Microservices as we are having two consumers reading the exact data from the same Topic. Additionally we do not have any retention time for the messages and therefore I'm treating The Kafka as some Kind of data store. So I would think we should rather replicate the messages into its own topic for another Service/consumer.
We are having different opinions on how this is coupling or decoupling and I'd like to hear you opinions on what I'm getting wrong because I feel like I do. Thank you for your support!
In my opinion using a Kafka topic for multiple services or apps to consume is the right approach as long as your services don't rely on it repeatedly. Meaning a service should read the queue once, translate the data into whatever it requires and store it by itself if required. This way the topic doesn't become a permanent data store but a rather a decoupled way to input data (as if you were to call the service directly with that raw data, but in a more decoupled fashion by allowing the service to read the topic whenever ready for it in whatever frequency that is required). This increases the resilience of your overall system.
And there is a coupling, that is the raw data. But from my perspective it is totally OK for multiple services to understand the same data format (of the topic) - As long as its format is mostly stable. The assumption here is that this is raw data that each service has to transform into a form that is useful for itself. You just have to make sure the raw data format is versioned correctly whenever changes are necessary. And to allow services to continue to work you will have to potentially deliver multiple versions concurrently until all services support the latest version. This type of architectural style is used by many large systems and works, as long as you don't have a scenario where you need to require the raw data format to change very frequently in a way that makes it incompatible with your service designs. (If that were the case you'd probably need another layer of stable meta-model below that can describe the dynamic raw-data.)

Suggested Hadoop-based Design / Component for Ingestion of Periodic REST API Calls

We are planning to use REST API calls to ingest data from an endpoint and store the data to HDFS. The REST calls are done in a periodic fashion (daily or maybe hourly).
I've already done Twitter ingestion using Flume, but I don't think using Flume would suit my current use-case because I am not using a continuous data firehose like this one in Twitter, but rather discrete regular time-bound invocations.
The idea I have right now, is to use custom Java that takes care of REST API calls and saves to HDFS, and then use Oozie coordinator on that Java jar.
I would like to hear suggestions / alternatives (if there's easier than what I'm thinking right now) about design and which Hadoop-based component(s) to use for this use-case. If you feel I can stick to Flume, then kindly give me also an idea how to do this.
As stated in the Apache Flume web:
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
As you can see, among the features attributed to Flume is the gathering of data. "Pushing-like or emitting-like" data sources are easy to integrate thanks to HttpSource, AvroSurce, ThriftSource, etc. In your case, where the data must be let's say "actively pulled" from a http-based service, the integration is not so obvious, but can be done. For instance, by using the ExecSource, which runs a script getting the data and pushing it to the Flume agent.
If you use a proprietary code in charge of pulling the data and writting it into HDFS, such a design will be OK, but you will be missing some interesting built-in Flume characteristics (that probably you will have to implement by yourself):
Reliability. Flume has mechanisms to ensure the data is really persisted in the final storage, retrying until is is effectively written. This is achieved through the usage of an internal channel buffering data both at the input (ingesting peaks of loads) and the output (retaining data until it is effecively persisted) and the transaction concept.
Performance. The usage of transactions and the possibility to configure multiple parallel sinks (data processors) will your deployment able to deal with really large amounts of data generated per second.
Usability. By using Flume you don't need to deal with the storage details (e.g. HDFS API). Even, if some day you decide to change the final storage you only have to reconfigure the Flume agent for using the new related sink.