I have a Google Cloud Dataflow pipeline (written with the Apache Beam SDK) that, in its normal mode of operation, handles event data published to Cloud Pub/Sub.
In order to bring the pipeline state up to date, and to create the correct outputs, there is a significant amount of historical event data which must be processed first. This historical data is available via JDBC. In testing, I am able to use the JdbcIO.Read PTransform to read and handle all historical state, but I'd like to initialize my production pipeline using this JDBC event data, and then cleanly transition to reading events from Pub/Sub. This same process may happen again in the future if the pipeline logic is ever altered in a backward incompatible way.
Note that while this historical read is happening, new events are continuing to arrive into Pub/Sub (and these end up in the database also), so there should be a clean cutover from only historical events read from JDBC, and only newer events read from Pub/Sub.
Some approaches I have considered:
Have a pipeline that reads from both inputs, but filters data from JDBC before a certain timestamp, and from pub/sub after a certain timestamp. Once the pipeline is caught up deploy an update removing the JDBC input.
I don't think this will work because removal of an I/O transform is not backward compatible. Alternately, the JDBC part of the pipeline must stay there forever, burning CPU cycles for no good reason.
Write a one-time job that populates pub/sub with the entirety of the historical data, and then starts the main pipeline reading only from pub/sub.
This seems to use more pub/sub resources than necessary, AND I think newer data interleaved in the pipeline with much older data will cause watermarks to be advanced too early.
Variation of option #2 -- stop creating new events until the historical data is handled, to avoid messing up watermarks.
This requires downtime.
It seems like it would be a common requirement to backfill historical data into a pipeline, but I haven't been able to find a good approach to this.
Your first option, reading from a Bounded source (filtered to timestamp <= cutoff) and PubSub (filtered to timestamp > cutoff) should work well.
Because JDBC.Read() is a bounded source, it will be read all the data and then "finish" i.e. never produce any more data, advance its watermark to +infinity, and not be invoked again (so there's no concern about it consuming CPU cycles, even if it's present in your graph).
There are a lot of articles across the internet about the usage of the Kafka Streams, but almost nothing about how it's done internally.
Does it use any features inside Kafka outside the standard set (let's call "standard" the librdkafka implementation)?
If it saves the state inside RocksDB (or any custom StateStore), how it guarantees that the state saving and the commit are in one transaction?
The same question in the case when the state is saved in the compacted log (the commit and the log updated should be in one transaction).
Thank you.
I found the answers by combining information from several threads here.
It uses transactions (see https://stackoverflow.com/a/54593467/414016) which are unsupported (yet) by librdkafka.
It doesn't really rely on RocksDB, instead it saves the state changes into the commit log (see https://stackoverflow.com/a/50264900/414016).
It does it using the transactions as mentioned above.
We have a micro-services architecture, with Kafka used as the communication mechanism between the services. Some of the services have their own databases. Say the user makes a call to Service A, which should result in a record (or set of records) being created in that service’s database. Additionally, this event should be reported to other services, as an item on a Kafka topic. What is the best way of ensuring that the database record(s) are only written if the Kafka topic is successfully updated (essentially creating a distributed transaction around the database update and the Kafka update)?
We are thinking of using spring-kafka (in a Spring Boot WebFlux service), and I can see that it has a KafkaTransactionManager, but from what I understand this is more about Kafka transactions themselves (ensuring consistency across the Kafka producers and consumers), rather than synchronising transactions across two systems (see here: “Kafka doesn't support XA and you have to deal with the possibility that the DB tx might commit while the Kafka tx rolls back.”). Additionally, I think this class relies on Spring’s transaction framework which, at least as far as I currently understand, is thread-bound, and won’t work if using a reactive approach (e.g. WebFlux) where different parts of an operation may execute on different threads. (We are using reactive-pg-client, so are manually handling transactions, rather than using Spring’s framework.)
Some options I can think of:
Don’t write the data to the database: only write it to Kafka. Then use a consumer (in Service A) to update the database. This seems like it might not be the most efficient, and will have problems in that the service which the user called cannot immediately see the database changes it should have just created.
Don’t write directly to Kafka: write to the database only, and use something like Debezium to report the change to Kafka. The problem here is that the changes are based on individual database records, whereas the business significant event to store in Kafka might involve a combination of data from multiple tables.
Write to the database first (if that fails, do nothing and just throw the exception). Then, when writing to Kafka, assume that the write might fail. Use the built-in auto-retry functionality to get it to keep trying for a while. If that eventually completely fails, try to write to a dead letter queue and create some sort of manual mechanism for admins to sort it out. And if writing to the DLQ fails (i.e. Kafka is completely down), just log it some other way (e.g. to the database), and again create some sort of manual mechanism for admins to sort it out.
Anyone got any thoughts or advice on the above, or able to correct any mistakes in my assumptions above?
Thanks in advance!
I'd suggest to use a slightly altered variant of approach 2.
Write into your database only, but in addition to the actual table writes, also write "events" into a special table within that same database; these event records would contain the aggregations you need. In the easiest way, you'd simply insert another entity e.g. mapped by JPA, which contains a JSON property with the aggregate payload. Of course this could be automated by some means of transaction listener / framework component.
Then use Debezium to capture the changes just from that table and stream them into Kafka. That way you have both: eventually consistent state in Kafka (the events in Kafka may trail behind or you might see a few events a second time after a restart, but eventually they'll reflect the database state) without the need for distributed transactions, and the business level event semantics you're after.
(Disclaimer: I'm the lead of Debezium; funnily enough I'm just in the process of writing a blog post discussing this approach in more detail)
Here are the posts
https://debezium.io/blog/2018/09/20/materializing-aggregate-views-with-hibernate-and-debezium/
https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/
first of all, I have to say that I’m no Kafka, nor a Spring expert but I think that it’s more a conceptual challenge when writing to independent resources and the solution should be adaptable to your technology stack. Furthermore, I should say that this solution tries to solve the problem without an external component like Debezium, because in my opinion each additional component brings challenges in testing, maintaining and running an application which is often underestimated when choosing such an option. Also not every database can be used as a Debezium-source.
To make sure that we are talking about the same goals, let’s clarify the situation in an simplified airline example, where customers can buy tickets. After a successful order the customer will receive a message (mail, push-notification, …) that is sent by an external messaging system (the system we have to talk with).
In a traditional JMS world with an XA transaction between our database (where we store orders) and the JMS provider it would look like the following: The client sets the order to our app where we start a transaction. The app stores the order in its database. Then the message is sent to JMS and you can commit the transaction. Both operations participate at the transaction even when they’re talking to their own resources. As the XA transaction guarantees ACID we’re fine.
Let’s bring Kafka (or any other resource that is not able to participate at the XA transaction) in the game. As there is no coordinator that syncs both transactions anymore the main idea of the following is to split processing in two parts with a persistent state.
When you store the order in your database you can also store the message (with aggregated data) in the same database (e.g. as JSON in a CLOB-column) that you want to send to Kafka afterwards. Same resource – ACID guaranteed, everything fine so far. Now you need a mechanism that polls your “KafkaTasks”-Table for new tasks that should be send to a Kafka-Topic (e.g. with a timer service, maybe #Scheduled annotation can be used in Spring). After the message has been successfully sent to Kafka you can delete the task entry. This ensures that the message to Kafka is only sent when the order is also successfully stored in application database. Did we achieve the same guarantees as we have when using a XA transaction? Unfortunately, no, as there is still the chance that writing to Kafka works but the deletion of the task fails. In this case the retry-mechanism (you would need one as mentioned in your question) would reprocess the task an sends the message twice. If your business case is happy with this “at-least-once”-guarantee you’re done here with a imho semi-complex solution that could be easily implemented as framework functionality so not everyone has to bother with the details.
If you need “exactly-once” then you cannot store your state in the application database (in this case “deletion of a task” is the “state”) but instead you must store it in Kafka (assuming that you have ACID guarantees between two Kafka topics). An example: Let’s say you have 100 tasks in the table (IDs 1 to 100) and the task job processes the first 10. You write your Kafka messages to their topic and another message with the ID 10 to “your topic”. All in the same Kafka-transaction. In the next cycle you consume your topic (value is 10) and take this value to get the next 10 tasks (and delete the already processed tasks).
If there are easier (in-application) solutions with the same guarantees I’m looking forward to hear from you!
Sorry for the long answer but I hope it helps.
All the approach described above are the best way to approach the problem and are well defined pattern. You can explore these in the links provided below.
Pattern: Transactional outbox
Publish an event or message as part of a database transaction by saving it in an OUTBOX in the database.
http://microservices.io/patterns/data/transactional-outbox.html
Pattern: Polling publisher
Publish messages by polling the outbox in the database.
http://microservices.io/patterns/data/polling-publisher.html
Pattern: Transaction log tailing
Publish changes made to the database by tailing the transaction log.
http://microservices.io/patterns/data/transaction-log-tailing.html
Debezium is a valid answer but (as I've experienced) it can require some extra overhead of running an extra pod and making sure that pod doesn't fall over. This could just be me griping about a few back to back instances where pods OOM errored and didn't come back up, networking rule rollouts dropped some messages, WAL access to an aws aurora db started behaving oddly... It seems that everything that could have gone wrong, did. Not saying Debezium is bad, it's fantastically stable, but often for devs running it becomes a networking skill rather than a coding skill.
As a KISS solution using normal coding solutions that will work 99.99% of the time (and inform you of the .01%) would be:
Start Transaction
Sync save to DB
-> If fail, then bail out.
Async send message to kafka.
Block until the topic reports that it has received the
message.
-> if it times out or fails Abort Transaction.
-> if it succeeds Commit Transaction.
I'd suggest to use a new approach 2-phase message. In this new approach, much less codes are needed, and you don't need Debeziums any more.
https://betterprogramming.pub/an-alternative-to-outbox-pattern-7564562843ae
For this new approach, what you need to do is:
When writing your database, write an event record to an auxiliary table.
Submit a 2-phase message to DTM
Write a service to query whether an event is saved in the auxiliary table.
With the help of DTM SDK, you can accomplish the above 3 steps with 8 lines in Go, much less codes than other solutions.
msg := dtmcli.NewMsg(DtmServer, gid).
Add(busi.Busi+"/TransIn", &TransReq{Amount: 30})
err := msg.DoAndSubmitDB(busi.Busi+"/QueryPrepared", db, func(tx *sql.Tx) error {
return AdjustBalance(tx, busi.TransOutUID, -req.Amount)
})
app.GET(BusiAPI+"/QueryPrepared", dtmutil.WrapHandler2(func(c *gin.Context) interface{} {
return MustBarrierFromGin(c).QueryPrepared(db)
}))
Each of your origin options has its disadvantage:
The user cannot immediately see the database changes it have just created.
Debezium will capture the log of the database, which may be much larger than the events you wanted. Also deployment and maintenance of Debezium is not an easy job.
"built-in auto-retry functionality" is not cheap, it may require much codes or maintenance efforts.
We are aggregating in session windows by using the following code:
.windowedBy(SessionWindows.with(...))
.aggregate(..., ..., ...)
The state store that is created for us automatically is backed by a changelog topic with cleanup.policy=compact.
When redeploying our topology, we found that restoring the state store took much longer than expected (10+ minutes). The explanation seems to be that even though a session has been closed, it is still present in the changelog topic.
We noticed that session windows have a default maintain duration of one day but even after the inactivity + maintain durations have been exceeded, it does not look like messages are removed from the changelog topic.
a) Do we need to manually delete "old" (by our definition) messages to keep the size of the changelog topic under control? (This may be the case as hinted to by [1].)
b) Would it be possible to somehow have the changelog topic created with cleanup.policy=compact,delete and would that even make sense?
[1] A session store seems to be created internally by Kafka Stream's UnwindowedChangelogTopicConfig (and not WindowedChangelogTopicConfig) which may make this comment from Kafka Streams - reducing the memory footprint for large state stores relevant: "For non-windowed store, there is no retention policy. The underlying topic is compacted only. Thus, if you know, that you don't need a record anymore, you would need to delete it via a tombstone. But it's a little tricky to achieve... – Matthias J. Sax Jun 27 '17 at 22:07"
You are hitting a bug. I just created a ticket for this: https://issues.apache.org/jira/browse/KAFKA-7101.
I would recommend that you modify the topic config manually for fix the issue in your deployment.
We are using Debezium + PostgreSQL.
Notice that we get 4 types of events for create, read, update and delete - c, r, u and d.
The read type of event is unused for our application. Actually, I could not think of an use case for the 'r' events unless we are doing auditing or mirroring the activities of a transaction.
We are facing difficulties scaling & I suspect it is because of network getting hogged by read type of events.
How do we filter out those events in postgreSQL itself?
I got a clue from one of the contributors to use snapshot.mode. I guess something that has to be done when Debezium creates a snapshot. I am unable to figure out how to do that.
It is likely that your database has existed for some time and contains data and changes that have been purged from the logical decoding logs. If you then start using the Debezium PostgreSQL connector to start capturing changes into Kafka, the question becomes what a consumer of the events in Kafka should be able to see.
One scenario is that a consumer should be able to see events for all rows in the database, even those that existed prior to the start of CDC. For example, this allows a consumer to completely reproduce/replicate all of the existing data and keep that data in sync over time. To accomplish this, the Debezium PostgreSQL connector starts up can begin by creating a snapshot of the database contents before it starts capturing the changes. This is done atomically, so that even if the snapshot process takes a while to run, the connector will still see all of the events that occurred since the snapshot process was started. These events are represented as "read" events, since in effect the connector is simply reading the existing rows. However, they are identical to "insert" events, so any application could treat reads and inserts in the same way.
On the other hand, if consumers of the events in Kafka do not need to see events for all existing rows, then the connector can be configured to avoid the snapshot and to instead begin by capturing the changes. This may be useful in some scenarios where the entire database state need not be found in Kafka, but instead the goal is to simply capture the changes that are occurring.
The Debezium PostgreSQL connector will work either way, so you should use the approach that works for how you're consuming the events.