JDBC sink connector insert/upsert based on max timestamp? - apache-kafka

I'm very new to Kafka connect
I am inserting records from multiple sources into one table.
In some cases, it may be possible for some records to reach before others.
Since I cannot control which source will pull which record first, I want to add a check on the timestamp key of the record.
I have a key called "LastModified_timestamp" in my schema where I store the timestamp of the latest state of my record.
I want to add a check to my JDBC sink connector where I can upsert a record based on comparing the value of LastModified_timestamp
I want to ignore the records which have a older timestamp and only want to upsert/insert the latest one. I couldn't find any configuration to achieve this
Is there any way by which I can achieve this?
Will writing a custom query help in this case?

The JDBC Sink connector does not support this kind of feature. You have two options to consider:
Single Message Transform (SMT) - these apply logic to records as they pass through Kafka Connect. SMT are great for things like dropping columns, changing datatypes, etc. BUT not appropriate for more complex processing and logic, including logic which needs to span multiple records as yours does here
Process the data in the source Kafka topic first, to apply the necessary logic. You can do this with Kafka Streams, KSQL, and several other stream processing frameworks (e.g. Spark, Flink, etc). You'd need some kind of stateful logic that could work out if a record was older than already processed.
Can you describe more about your upstream source for the data? It might be there's a better way to orchestrate the data coming through to enforce the ordering.
A final idea would be to land all records to your target DB and then use logic in your database query consuming it to select the most recent (based on LastModified_timestamp) record for a given key.
Disclaimer: I work for Confluent, the company behind the open-source KSQL project.

Related

How to create kafka event from database?

There is a legacy service that writes values to the database.
I need to converting values to events and then sending it to kafka.
I'm going to make a service that, once in a fixed delay checks for new records and sends them, also writing the submitted records ids to the technical table, but maybe there is some other way, best practice or pattern.
You may want to look into Debezium that implements Change Data Capture on relational and NoSql data stores and streams the data into Kafka.
https://github.com/debezium/debezium
https://debezium.io/documentation

Filter Repeated Messages In Kafka

PREFACE:
In our organization we're trying to use Kafka to solve a problem that involves capturing changes in an Oracle Database and send through Kafka. It's in fact a CDC, we are using Kafka Connector for that.
We catch the changes in Oracle, using Oracle Flashback queries, this allows us to get the timestamp of the change and operation (Insert, Delete, Update) involved.
Once some change is made in a table we observe, the Kafka Connector publishes this to a topic, we further read this topic using Kafka Streams.
The problem is that sometimes there are equal lines that appears in the Flashback query, because of some Update in the table that didn't change nothing (this triggers a flashback change too), or if the table has 100 columns, and we watch only 20, it end up that we see repeated lines in the query because none of those 20 fields has changed.
We use flashback to get changed rows (including excluded ones). In the connector we set timestamp+increment mode (timestamp is obtained by the field versions_starttime of flashback query)
Important: we can't touch the DB more than this, I mean, we can't create triggers instead of using this already Flashback scheme.
THE QUESTION
We're trying to filter records in Kafka, if some (key, value) is equally in content we want to discard. Note that, this is not exactly once semantics. The record will be repeated with large timestamps differences.
If I use a KTable to check the last value of some record, how efficient this will be after a long period?
I mean, internal state storage of consumers, is handled by RocksDB and a backed Kafka Topic, since if I use a non windowed KTable this internal space could end up being very large.
Which is considered a good approach in this scenario? To not overload Kafka Consumers internal state storage, and the same time being capable to know if the actual record was already processed time ago.

Beam / Cloud Dataflow: How to Add Kafka (or PubSub) topics to Running Stream

(How) is it possible to dynamically add or remove topics to a running pipeline as a source or sink (Kafka or PubSub)? Or have as a sink a dynamic pattern like it is possible with BigQuery Table names.
Some background: We have different topics, one per customer, to better facilitate downstream aggregations and also clean/up add them on the fly. Kafka is used to be able to backfill calculations over periods that are longer than possible with PubSub.
The options I have in my mind right now are either extending KafkaIO to support this, or to update the pipeline each time there is a topic added removed (meaning there will be some lags in the stream while its updated). Or maybe I'm having a wrong design pattern in my head and there are other solutions for this.
You are correct that right now the easiest solution is updating the pipeline.
However, a new API called Splittable DoFn (SDF) is currently in active development; it is already available in the Cloud Dataflow runner in streaming mode and in the Direct runner, and implementation is in progress in Flink and Apex runners.
It makes it possible to do things like "create a PCollection of Kafka topic names and read each of those topics", so you can have one pipeline stage produce names of topics to be read (e.g. the names themselves could arrive over Kafka or Pubsub every time a customer is added, or you could write an SDF to watch the result of a database query returning a list of customers and emit new ones), and another stage reading those topics.
See http://s.apache.org/splittable-do-fn for the design doc of the API, and http://s.apache.org/textio-sdf for an example proposed refactoring of TextIO using this API - you may want to try to modify KafkaIO yourself in a similar fashion.

Using kafka streams to create a table based on elasticsearch events

Is it possible to use Kafka streaming to create a pipeline that reads JSON from a Kafka topic and then do some logic with them and send the results to another Kafka topic or something else?
For example, I populate my topic using logs from elasticsearch. That is pretty easy using a simple logstash pipeline.
Once I have my logs in the kafka topic, I want to extract some pieces of information from the log and put them in a sort of "table" with N column(is Kafka capable of this?) and then put the table somewhere else (another topic or a db).
I didn't find any example that satisfies my criteria.
thanks
Yes, it's possible.
There is no concept of columns in kafka or kafka-streams. However, you typically just define a plain old java object of your choice, with the fields that your want (fields being the equivalent of columns in this case). You produce the output in that format to an output topic (using an appropriately chosen serializer). Finally, if you want to store the result in a relational database, you map the fields into columns, typically using a kafka connect jdbc sink:
http://docs.confluent.io/current/connect/connect-jdbc/docs/sink_connector.html

Query Kafka topic for specific record

Is there an elegant way to query a Kafka topic for a specific record? The REST API that I'm building gets an ID and needs to look up records associated with that ID in a Kafka topic. One approach is to check every record in the topic via a custom consumer and look for a match, but I'd like to avoid the overhead of reading a bunch of records. Does Kafka have a fast, built in filtering capability?
The only fast way to search for a record in Kafka (to oversimplify) is by partition and offset. The new producer class can return, via futures, the partition and offset into which a message was written. You can use these two values to very quickly retrieve the message.
So if you make the ID out of the partition and offset then you can implement your fast query. Otherwise, not so much. This means that the ID for an object isn't part of your data model, but rather is generated by the Kafka-knowledgable code.
Maybe that works for you, maybe it doesn't.
This might be late for you, but it will help for how other see this question, now there is KSQL, kafka sql is an open-source streaming SQL engine
https://github.com/confluentinc/ksql/