I have three Kafka topics: entity.created, entity.deleted, entity.attribute. They all have some entity ID as the message key and a Protobuf message value.
I'm trying to use ksqlDB to aggregate messages from the first two topics and produce messages to the third one:
-- TODO: this might be possible to do with just one table
CREATE TABLE apples_created AS
SELECT references->basket_id, count(1) AS count
FROM entity_created
WHERE type = 'APPLE'
GROUP BY references->basket_id
EMIT CHANGES;
CREATE TABLE apples_deleted AS
SELECT references->basket_id, count(1) AS count
FROM entity_deleted
WHERE type = 'APPLE'
GROUP BY references->basket_id
EMIT CHANGES;
CREATE TABLE apples_any AS
SELECT apples_created.basket_id AS basket_id, apples_deleted.count IS NULL OR apples_created.count > apples_deleted.count AS has_apples
FROM apples_created
LEFT JOIN apples_deleted ON apples_created.apple_id = apples_deleted.apple_id
EMIT CHANGES;
However, I have found no way to get the output of apples_any to my entity.attribute topic. For example:
CREATE TABLE attr_testing WITH (kafka_topic = 'entity.attribute') AS
SELECT basket_id, 'BASKET' AS type, 'hasApples' AS name, CASE WHEN has_apples THEN 'true' ELSE '' END AS value
FROM apples_any
EMIT CHANGES;
fails, because it tries to register its own schema for the existing topic:
Could not register schema for topic: Schema being registered is incompatible with an earlier schema for subject "entity.attribute-value"; error code: 409
I will have multiple Kafka producers writing to entity.attribute and want ksqlDB to be just one of them. Is writing to an existing topic possible with ksqlDB? If not, I imagine I need a way to copy messages from apples_any (the underlying topic) to entity.attribute. What would be the preferred solution in the Kafka ecosystem?
I'm using ksqlDB 0.17.0.
Edit:
This is the value schema used by the entity.attribute topic:
message EntityAttribute {
EntityType type = 2;
string name = 3;
string value = 4;
}
enum EntityType {
APPLE = 0;
BASKET = 1;
}
Edit 2:
The error in ksqlDB logs I get when I try to create that attr_testing table is:
WARN Found incompatible change: Difference{fullPath='#/EntityAttribute', type=MESSAGE_REMOVED} (io.confluent.kafka.schemaregistry.protobuf.ProtobufSchema)
If I understand correctly, ksqlDB is trying to create a completely new schema, with a different message name. This leads me to believe that whatever schema I use, ksqlDB will ignore it. Also, I have no idea how I could tell ksqlDB to create an enum field (EntityType) instead of a string. That's why I feel like producing to an existing topic is not a ksqlDB's use case...
Related
I want to change schema on insert in PostgreSql depends on value message when i producer message to kafka connect?
How can i do that?
I tried using each topic for a different schema type, but this is not what i want.
eg: Topic name: country_city
{ cityId: 1 }
will insert into schema: country_1
{ cityId: 2 }
will insert into schema: country_2
The schema/table that is used depends exclusively on the topic name, not values within each record.
You will need to use Kafka Connect Transforms to override the "outgoing topic name" to control where data is written in a database.
However, if you have numbered tables in your database, that is likely an anti-pattern and you should be using relations instead.
I have a requirement to produce data from multiple MongoDB tables and push to the same Kafka Topic using the mongo-kafka connector. Also I have to ensure that the data for the same table key columns always go to the same partition every time to ensure message ordering.
For example :
tables --> customer , address
table key columns -->CustomerID(for table customer) ,AddressID(for table address)
For CustomerID =12345 , it will always go to partition 1
For AddressID = 54321 , it will always go to partition 2
For a single table , the second requirement is easy to achieve using chained transformations. However for multiple tables->1 topic , finding it difficult to achieve since each of these tables has different key column names.
Is there any way available to fulfil both requirements using the Kafka connector?
If you use ExtractField$Key transform and IntegerConverter, all matching IDs should go to the same partition.
If you have two columns and one table, or end up with keys like {"CustomerID": 12345} then you have a composite/object key, meaning the whole key will be hashed when used to compute partitioning, not the ID itself.
You cannot set partition for specific fields within any record without setting producer.override.partitioner.class in Connector config. In other words, you need to implement a partitioner that will deserialize your data, parse the values, then compute and return the respective partition.
Desired functionality: For a given key, key123, numerous services are running in parallel and reporting their results to a single location, once all results are gathered for key123 they are passed to a new downstream consumer.
Original idea: Using AWS DynamoDB to hold all results for a given entry. Every time a result is ready a micro-service does a PATCH operation to the database on key123. An output stream checks each UPDATE to see if the entry is complete, if so, it is forwarded downstream.
New Idea: Use Kafka Streams and KSQL to reach the same goal. All services write their output to the results topic, the topic forms a change log Kstream that we KSQL query for completed entries. Something like:
CREATE STREAM competed_results FROM results_stream SELECT * WHERE (all results != NULL).
The part I'm not sure how to do is the PATCH operation on the stream. To have the output stream show the accumulation of all messages for key123 instead of just the most recent one?
KSQL users, does this even make sense? Am I close to a solution that someone has done before?
If you can produce all your events to the same topic, with the key set, then you can collect all of the events for a specific key using an aggregation in ksqlDB such as:
CREATE STREAM source (
KEY INT KEY, -- example key to group by
EVENT STRING -- example event to collect
) WITH (
kafka_topic='source', -- or whatever your source topic is called.
value_format='json' -- or whatever value format you need.
);
CREATE TABLE agg AS
SELECT
key,
COLLECT_LIST(event) as events
FROM source
GROUP BY key;
This will create a changelog topic called AGG by default. As new events are received for a specific key on the source topic, ksqlDB will produce messages to the AGG topic, with the key set to key and the value containing the list of all the events seen for that key.
You can then import this changelog as a stream:
CREATE STREAM agg_stream (
KEY INT KEY,
EVENTS ARRAY<STRING>
) WITH (
kafka_topic='AGG',
value_format='json'
);
And you can then apply some criteria to filter the stream to only include your final results:
STREAM competed_results AS
SELECT
*
FROM agg_stream
WHERE ARRAY_LEN(EVENTS) = 5; -- example 'complete' criteria.
You may even want to use a user-defined function to define your complete criteria:
STREAM competed_results AS
SELECT
*
FROM agg_stream
WHERE IS_COMPLETE(EVENTS);
Is there a way to pass in or get access to the message key from the join section in a Kafka Stream DSL join?
I have something like this right now:
KStream<String, GenericRecord> completedEventsStream = inputStartKStream.
join(
inputEndKStream,
(leftValue, rightValue) -> customLambda((Record) leftValue, (Record) rightValue),
JoinWindows.of(windowDuration),
Joined.with(stringSerde, genericAvroSerde, genericAvroSerde)
);
However, the leftValue and rightValue records passed in to customLambda don't have access to the kafka message key, because that's a separate string. The only content they have is the message itself, not the key.
Is there a way to get access to the key from inside the join lambda? One thing I could do is simply add the message key as part of the message itself, and access it as a regular field there, but I was wondering if the framework provides a way to access it directly?
Most of the time the key is also available in the value of the record, is this not the case for your app?
It looks like the ValueJoiner interface has an improvement filed as part of KIP-149, but hasn't been completed like the other methods in that KIP: ValueTransformer and ValueMapper.
You could add a step before your join to extract the key and include it in the value of your message before doing the join using ValueMapperWithKey.
I have a table containing a large number of records. There's a column defining a type of the record. I'd like to collect records with a specific value in that column. Kind of:
Select * FROM myVeryOwnTable WHERE type = "VERY_IMPORTANT_TYPE"
What I've noticed I can't use WHERE clause in a custom query when I choose incremental(+timestamp) mode, otherwise I'd need to take care if filtering on my own.
The background of that I'd like to achieve is that I use Logstash to transfer some type of data from MySQL to ES. That's easily achievable there by using query that can contain where clause. However, with Kafka I can transfer my data much quicker (almost instantly) after inserting new rows in DB.
Thank you for any hints or advices.
Thanks to #wardziniak I was able to set it up.
query=select * from (select * from myVeryOwnTable p where type = 'VERY_IMPORTANT_TYPE') p
topic.prefix=test-mysql-jdbc-
incrementing.column.name=id
however, I was expecting a topic test-mysql-jdbc-myVeryOwnTable so I've registered my consumer to that. However, using the query shown above table name is skipped so my topic was named exactly as prefix defined above. So I've just updated my properties topic.prefix=test-mysql-jdbc-myVeryOwnTable and it seems to be working just fine.
You can use subquery in your Jdbc Source Connector query property.
Sample JDBC Source Connector configuration:
{
...
"query": "select * from (select * from myVeryOwnTable p where type = 'VERY_IMPORTANT_TYPE') p",
"incrementing.column.name": "id",
...
}