Will ESPER CEP work on Hadoop or Spark platform - complex-event-processing

I have used ESPER CEP in standalone system. Is ESPER cloud compatible? Will it run on Hadoop or Spark Platforms
Thanks

Esper runs in any JVM environment with any programming language. Esper have a .NET version too that runs on any CLR. So Esper can run as part of a Spark and-or Hadoop stack. If you use Esper in a stateless way, and only do filtering, transformation etc.., you don't worry about where the state lives when it fails and the world is easy. If you want to use Esper in a stateful way, you must worry where the state lives, and that requires more thinking. With stateful I mean using aggregations, data windows, patterns etc etc. So when you have the stateful use, such as a count for example, and maybe you need to make sure the count is not lost when the job moves or the system reboots, that is when you may need EsperHA.

Related

How to do stream processing with Redpanda?

Redpanda seems easy to work with, but how would one process streams in real-time?
We have a few thousand IoT devices that send us data every second. We would like to get the running average of the data from the last hour for each of the devices. Can the built-in WebAssembly stuff be used for this, or do we need something like Materialize?
Given that it is marketed as "Kafka Compatible," any Kafka library should work with RedPanda, including Kafka Streams, KSQL, Apache Spark, Flink, Storm, etc.
Thanks, folks. Since it hasn't been mentioned I'm going to add my own answer as well here.
We ended up using Bytewax.
It worked great with our existing Kubernetes setup. It supports stateful operations and scales horizontally into multiple pods if needed. It's pretty performant (1), and since it's basically just a python program it can be customized to read and write to whatever you want.
(1) The Bytewax pod actually uses less CPU than our KafkaJS pod, which just stores all messages to a DB.
Here's more information about stream processors that work with Redpanda.
https://redpanda.com/blog/kafka-stream-processors

PyFlink performance compared to Scala

How PyFlink performance is compared to Flink + Scala?
Big Picture.
The goal is to build Lambda architecture with Cold and Hot Tier.
Cold (Batch) Tier will be implemented with Apache Spark (PySpark).
But with Hot (Streaming) Tier there are different options: Spark Streaming or Flink.
Thus Apache Flink is pure streaming rather then Spark's micro-batches, I tend to choose Apache Flink.
But my only point of concern is performance of PyFlink. Will it have less latency that PySpark streaming? Is it slower then Scala written Flink code? In what cases it's slower?
Thank you in advance!
I had implemented something very similar , and from my experience these are a few things
Performance of the job is completely dependent on the type of code you are writing , if you are using some custom UDFs written in python to run while you extract then the performance is going to be slower than doing the same thing using Scala based code - this happens majorly because of the conversion of python objects to JVM and vice versa . But this will happen while you are using Pyspark .
Flink is true streaming process, the micro batches in spark are not so if your use case does need a true streaming service go ahead with Flink.
If you stick your service to the native functions given in PyFlink you will not observe any noticeable difference in performance .

Real-time processing: Storm / flink vs standard application (java, c#...)

I am wondering about the choice of implementing an application processing events coming from Kafka, I have in mind two architecture patterns:
an application developed using the Apache Storm or Apache Flink framework that would process events consumed from Kafka
a Java application (or python, C#...), deployed X times (scalable depending on traffic), which would process events coming from Kafka
I find it difficult to see which of the scenarios is the most interesting.
Someone could help me on this topic ?
It's hard to give some definitive advice with so little information available. So I leave my response vague until you provide more specific information:
Choosing a processing framework over a native implementation gives you the following advantages:
Parallel processing with (in theory) infinite scalability: If you ever expect that you cannot process all events in a single thread in a timely manner, you first need to scale up (more threads) and eventually scale out (more machines). A frameworks takes care of all synchronization between threads and machines, so you just need to write sequential code glued together with some high-level primitives (similar to LINQ in C#).
Fault tolerance: What happens when your code screws up (some edge case not implemented)? When you run out of resources? When network (to Kinesis or other machines) temporarily breaks? A framework takes care of all these nasty little details.
In case of failure, when you restart application, most frameworks give you some form of exactly once processing: How do you avoid losing data? How do you avoid duplicates when reprocessing old data?
Managed state: If your application needs to remember things for a certain time (calculating sums/average or joining data), how do you ensure that the state is kept in sync with data in case of failure?
Advanced features: time triggers, complex event processing (=pattern matching on events), writing to different sinks (Kafka for low latency, s3 for batch processing)
Flexibility of storage: if you want to try out a different storage system, it's much easier to change source/sink in an application writing in a framework.
Integration in deployment platforms: If you want to scale to several machines, it's usually much easier to scale a platform that already offers related integration (at the time of writing that should be mostly Kubernetes). But all frameworks also support simple local setups where you just scale-up on one (bigger) machine.
Low-level optimizations: When using new engines with higher abstractions, it's possible that the frameworks generate code that is much more efficient than what you can implement yourself (with specific memory layout or serialized data processing).
The big downsides are usually:
Complexity of the framework: you need to understand how the framework works from a user's perspective. However, you usually save time by not going into the details of writing a custom consumer/producer, so it's not as bad as it initially seems.
Flexibility in code: you cannot write arbitrary code anymore. Since the framework handles parallelism for you, you need to think in terms of chunks of data and adjust your algorithms accordingly. Standard SQL operations are usually directly supported though in one form or another.
Less control over resource usage: since the platform schedules the task across machines, you may end up with unfortunate assignments and the platform may give you too little options to fix it. Note that most applications are more intrinsically bound to bad resource utilization because of data skew and suboptimal algorithms though.

Does kafka holds capability of rule-engine?

We are using kafka for messaging and lot more stuff but now there is a requirement where we need some kind of rule-engine for data processing based on some rules. Does kafka holds any capability like this (rule-engine) or we have to use third party rule engine's (eg. https://camunda.com/dmn/ ) only and integrate with kafka.
There is no need to use a third party rule engine with Apache Kafka. As part of the project there is Kafka Streams and also to ease off a bit the need to write Java code to express rules there is ksqlDB that is based on a subset of ANSI SQL.
While these options are not necessarily a rule engine per-se; they share the same semantics which is: given an intermediate processing output the relevant result based on the computation. The difference will be in the how and not in the what. So I think they are decent replacements.
You can always integrate both as well. Several rule engines such as Drools from Red Hat are Java-based and thus; can be easily accessed from a Kafka Streams processor. As long the if-then-else rules run in the same JVM space of the Kafka Streams application you won't have any performance penalties other than a possibly bigger JVM heap.

How to maintain Alpakka/Akka Streams source state across application restarts?

I am new to Alpakka and am considering using it for system integration. What would be the ideal way to maintain the state of the Akka Streams sources across application restarts ?
For example: let's assume I'm using something as follows to continuously read some input data and dump it somewhere. What if it runs for like 4h, then the full JVM crashes and restarts (e.g. k8s restarts my pod or so):
someSource
.via(someTransformation)
.via(someOtherTransformation)
.toMap(...)
.run()
I understand that if someSource is a Kafka source or Kinesis source or some other stateful source, they can keep track of their offset or checkpoint and restart more or less where they left off.
However, many other sources have no such concept, e.g. the Cassandra source, the File source or the RDBMs source. For example, if I shutdown and restart the code provided in the rdms example, it will restart from the top each time.
Am I understanding correctly that there is no mechanism to address that out of the box, s.t. we have to handle it manually ? I would have imagined that this feature would be desired so commonly that it would be handled somehow. If not, how do people typically address that ? Do you use Akka persistence to store some cursors in a few actors? Or do you store the origin offset together with the output data and re-read it on startup?
Or am I looking at all this the wrong way?
It is a feature that is extremely commonly desired, for the reason you suggest.
However, the only generic, reliable way to implement this would be using akka persistence which is probably the single heaviest (e.g. it requires choosing a database) dependency in the Akka ecosystem. Beyond that, it's going to be somewhat source specific. Some (e.g. Kafka, Kinesis) have a means of doing this that's going to fit the bill in nearly every scenario, but for the others, the details of how to store the state of consumption are something on which there will be a lot of differences of opinion. Akka and Alpakka in general tend to shy away from opinionation.