Well, I'm a newbie on Apache Flink and reading some source codes on the Internet.
Sometimes I saw StreamExecutionEnvironment but I have also seen StreamTableEnvironment.
I've read the official doc but I still can't figure out their difference.
Furthermore, I'm trying to code a Flink Stream Job, which receives the data from Kafka. In this case, which kind of environment should I use?
A StreamExecutionEnvironment is used with the DataStream API. You need a StreamTableEnvironment if you are going to use the higher level Table or SQL APIs.
The section of the docs on creating a TableEnvironment covers this in more detail.
Which you should use depends on whether you want to work with lower level data streams, or a higher level relational API. Both can be used to implement a streaming job that reads from Kafka.
The documentation includes a pair of code walkthroughs introducing both APIs, which should help you figure out which API is better suited for your use case. See the DataStream walkthrough and the Table walkthrough.
To learn the DataStream API, spend a day working through the self-paced training. There's also training for Flink SQL. Both are free.
Related
There are alot of articles about kafka and event sourcing. Most of the articles are about that kafka is not realy usefull when doing eventsourcing, because you cannot query the events just for a given aggregate id.
If you store the events in a topic, then yes this is true. This is because we need to read all the events and skip those that are not relevant.
But what about storing the events in rocksdb? Now, we can actually query all the events just for the given aggregate by using the aggregate id as prefix and do a range query in rocksdb.
Is this a good approach? I know that this will use large state and can be problematic when a rebalance occurs. But again, maybe static membership in kafka will help with this.
Kafka Streams's default disk-based StateStore is RocksDB, so yes, this is a perfectly valid approach.
You'd query the store via Kafka Streams APIs, not RocksDB directly, however.
Now, we can actually query all the events just for the given aggregate by using the aggregate id as prefix
Unclear what you mean by prefix. The stores are built exclusively by Kafka record keys, not by prefixed values. However, as linked in the comments, the store does support prefix scanning, but I assume that'd be the prefix of the kafka record keys
this will use large state and can be problematic
You can refer the Memory Management page on handling state and what to tune for handling its size.
Kafka Streams and RockDB is really good solution for a quick startup to understand the concepts but I am not sure it is good idea in the long term for the production.
My personal experience with RockDB was not that brilliant in production, if you plan to use a Key/Value database in production Apache Cassandra seems to be a much better solution.
But you are also right, querying things only over Primary Key is not that flexible, so implementing CQRS does make much more sense, so you get much more flexibility for Query Side.
As a person that walked the path you plan the walk, you can find my proposed solution in my blog :)
Redpanda seems easy to work with, but how would one process streams in real-time?
We have a few thousand IoT devices that send us data every second. We would like to get the running average of the data from the last hour for each of the devices. Can the built-in WebAssembly stuff be used for this, or do we need something like Materialize?
Given that it is marketed as "Kafka Compatible," any Kafka library should work with RedPanda, including Kafka Streams, KSQL, Apache Spark, Flink, Storm, etc.
Thanks, folks. Since it hasn't been mentioned I'm going to add my own answer as well here.
We ended up using Bytewax.
It worked great with our existing Kubernetes setup. It supports stateful operations and scales horizontally into multiple pods if needed. It's pretty performant (1), and since it's basically just a python program it can be customized to read and write to whatever you want.
(1) The Bytewax pod actually uses less CPU than our KafkaJS pod, which just stores all messages to a DB.
Here's more information about stream processors that work with Redpanda.
https://redpanda.com/blog/kafka-stream-processors
We are using kafka for messaging and lot more stuff but now there is a requirement where we need some kind of rule-engine for data processing based on some rules. Does kafka holds any capability like this (rule-engine) or we have to use third party rule engine's (eg. https://camunda.com/dmn/ ) only and integrate with kafka.
There is no need to use a third party rule engine with Apache Kafka. As part of the project there is Kafka Streams and also to ease off a bit the need to write Java code to express rules there is ksqlDB that is based on a subset of ANSI SQL.
While these options are not necessarily a rule engine per-se; they share the same semantics which is: given an intermediate processing output the relevant result based on the computation. The difference will be in the how and not in the what. So I think they are decent replacements.
You can always integrate both as well. Several rule engines such as Drools from Red Hat are Java-based and thus; can be easily accessed from a Kafka Streams processor. As long the if-then-else rules run in the same JVM space of the Kafka Streams application you won't have any performance penalties other than a possibly bigger JVM heap.
We are planning to use REST API calls to ingest data from an endpoint and store the data to HDFS. The REST calls are done in a periodic fashion (daily or maybe hourly).
I've already done Twitter ingestion using Flume, but I don't think using Flume would suit my current use-case because I am not using a continuous data firehose like this one in Twitter, but rather discrete regular time-bound invocations.
The idea I have right now, is to use custom Java that takes care of REST API calls and saves to HDFS, and then use Oozie coordinator on that Java jar.
I would like to hear suggestions / alternatives (if there's easier than what I'm thinking right now) about design and which Hadoop-based component(s) to use for this use-case. If you feel I can stick to Flume, then kindly give me also an idea how to do this.
As stated in the Apache Flume web:
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
As you can see, among the features attributed to Flume is the gathering of data. "Pushing-like or emitting-like" data sources are easy to integrate thanks to HttpSource, AvroSurce, ThriftSource, etc. In your case, where the data must be let's say "actively pulled" from a http-based service, the integration is not so obvious, but can be done. For instance, by using the ExecSource, which runs a script getting the data and pushing it to the Flume agent.
If you use a proprietary code in charge of pulling the data and writting it into HDFS, such a design will be OK, but you will be missing some interesting built-in Flume characteristics (that probably you will have to implement by yourself):
Reliability. Flume has mechanisms to ensure the data is really persisted in the final storage, retrying until is is effectively written. This is achieved through the usage of an internal channel buffering data both at the input (ingesting peaks of loads) and the output (retaining data until it is effecively persisted) and the transaction concept.
Performance. The usage of transactions and the possibility to configure multiple parallel sinks (data processors) will your deployment able to deal with really large amounts of data generated per second.
Usability. By using Flume you don't need to deal with the storage details (e.g. HDFS API). Even, if some day you decide to change the final storage you only have to reconfigure the Flume agent for using the new related sink.
I wonder if it is possible, or if someone has tried to setup Apache Kafka as consumer of PostgreSQL logigal log stream? Does that even makes sense?
https://wiki.postgresql.org/wiki/Logical_Log_Streaming_Replication
I have a legacy source system that I need to make realtime dashboard from. For some reasons I can't hook the application events (btw, it's java app). Instead, I'm thinking of some kind of a lambda architecture: when dashboard initializes, it reads from persisted data "data warehouse" which gets there after some ETL. And then changing events are streamed via Kafka to the the dashboard.
Another use of the events stored in Kafka would be a kind of change data capture approach for data warehouse population. This is necessary because there is no commercial CDC tool that supports postgesql. And the source application is updating tables without keeping history.
A combination of xsteven's PostgreSQL WAL to protobuf project - decoderbufs (https://github.com/xstevens/decoderbufs) - and his pg_kafka producer (https://github.com/xstevens/pg_kafka) might be a start.
Take a look at Bottled Water which:
uses the logical decoding feature (introduced in PostgreSQL 9.4) to
extract a consistent snapshot and a continuous stream of change events
from a database. The data is extracted at a row level, and encoded using Avro. A
client program connects to your database, extracts this data, and
relays it to Kafka
They also have Docker images so looks like it'd be easy to try it out.
The Debezium project provides a CDC connector for streaming data changes from Postgres into Apache Kafka. Currently it supports Decoderbufs and wal2json as logical decoding plug-ins. Bottled Water referenced in Steve's answer is comparable, but it is not actively maintained any longer.
Disclaimer: I'm the project lead of Debezium