I got some usecase, where contractors work for a specified time in my housing project. And I want to map it to kafka and thought of a topic like:
key : {"validFrom":"2019-09-01", "validTill":"2019-10-10", "name":"contractor1"}
Messsage is a more complicated, like costs that variate at which weekday "contractor1" is working for me.
Another service of mine will query the topic for "2019-10-02" and the message linked to the key, which is between validFrom - validTill, will be returned.
Is this a meaningful way to use kafka or am I thinking in the wrong direction ?(The key will be unique)
If by "point in time" you mean the time of message creation, then you can search by message timestamp - that search is very efficient because timestamp is indexed on the server side.
If you want to find a message based on the value of some message field, like "validFrom" - that will take some time for large topics - you'll have to scan every message in the topic. So, it would make sense to use combination of both methods.
Some UI tools allow you to do this type of search out-of-the-box, take a look at Kafka Magic https://www.kafkamagic.com - it allows writing complex queries using standard JavaScript in combination with timestamp/partition/offset filters.
If you are writing your own solution - standard Kafka client SDK for many languages has methods for locating messages by the timestamp - point your consumer to the start timestamp and read message after message until you find what you are looking for. That is a perfectly valid method.
Related
I have a materialized view that is updated from many streams. Every one enrich it partially. Order doesn't matter. Updates comes in not specified time. Is following algorithm is a good approach:
Update comes and I check what is stored in materialized view via get(), that this is an initial one so enrich and save.
Second comes and get() shows that partial update exist - add next information
... and I continue with same style
If there is a query/join, object that is stored has a method that shows that the update is not complete isValid() that could be used in KafkaStreams#filter().
Could you share please is this a good plan? Is there any pattern in Kafka streams world that handle this case?
Please advice.
Your plan looks good , you have the general idea, but you'll have to use the lower Kafka Stream API : Processor API.
There is a .transform operator that allow you to access a KeyValueStatestore, inside this operation implementation you are free to decide if you current aggregated value is valid or not.
Therefore send it downstream or returning null waiting for more information.
I have the following situation where I have an Apache Kafka topic containing numerous record types.
For example:
UserCreated
UserUpdated
UserDeleted
AnotherRecordType
...
I wish to implement CDC on the three listed User* record types such that at the end, I have an up-to-date KTable with all user information.
How can I do this in ksqlDB? Since, as far as I know, Debezium and other CDC connectors also source their data from a single topic, I at least know it should be possible.
I've been reading through the Confluent docs for a while now, but I can't seem to find anything quite pertinent to my use case (CDC using existing topic). If there is anything I've overlooked, I would greatly appreciate a link to the relevant documentation as well.
I assume that, at the very least, the records must have the same key for ksqlDB to be able to match them. So my questions boil down to:
How would I teach ksqlDB which is an insert, an update and a delete?
Is the key matching a hard requirement, or are there other join/match predicates that we can use?
One possibility that I can think of is basically how CDC already does it: treat each incoming record as a new entry so that I can have something like a slowly changing dimension in the KTable, grouping on the key and selecting entries with e.g. the latest timestamp.
So, is something like the following:
CREATE TABLE users AS
SELECT user.user_id,
latest_by_offset(user.name) AS name,
latest_by_offset(user.email),
CASE WHEN record.key = UserDeleted THEN true ELSE FALSE END,
user.timestamp,
...
FROM users
GROUP BY user.user_id
EMIT CHANGES;
possible (using e.g. ROWKEY for record.key)? If not, how does e.g. Debezium do it?
The general pattern is to not have different schema types; just User. Then, the first record of any unique key (userid, for example) is an insert. Afterwards any non null values for the same key are updates (generally requiring all fields to be part of the value, effectively going a "replace" operation in the table). Deletes are caused by sending null values for the key (tombstone events).
If you have multiple schemas, it might be better to create a new stream that nulls out any of the delete events, unifies the creates and updates to a common schema that you want information for, and filter event types that you want to ignore.
how does e.g. Debezium do it?
For consuming data coming from Debezium topics, you can use a transform to "extract the new record state". It doesn't create any tables for you.
I have a requirement where I need to read a CSV and publish to Kafka topic in Avro format. During the publish, I need to set the message key as the combination of two attributes. Let's say I have an attribute called id and an attribute called group. I need my message key to be id+"-"+group. Is there a way I can achieve this in Apache nifi flow? Setting the message key to a single attribute works fine for me.
Yes, in the PublishKafka_2_0 (or whatever version you're using), set the Kafka Key property to construct your message key using NiFi Expression Language. For your example, the expression ${id}-${group} will form it (e.g. id=myId & group=MyGroup -> myId-myGroup).
If you don't populate this property explicitly, the processor looks for the attribute kafka.key, so if you had previously set that value, it would be passed through.
Additional information after comment 2020-06-15 16:49
Ah, so the PublishKafkaRecord will publish multiple messages to Kafka, each correlating with a record in the single NiFi flowfile. In this case, the property is asking for a field (a record term meaning some element of the record schema) to use to populate that message key. I would suggest using UpdateRecord before this processor to add a field called messageKey (or whatever you like) to each record using Expression Language, then reference this field in the publishing processor property.
Notice the (?)s on each property which indicates what is or isn't allowed:
When a field doesn't except expression languages, use an updateAttribute processor to set the combined value you need. Then you use the combined value downstream.
Thank you for your inputs. I had to change my initial design of producing with a key combination to actually partitioning the file based on a specific field using PartitionRecord processor. I have a date field in my CSV file and there can be multiple records per date. I partition based on this date field and produce to the kafka topics using the id field as key per partition. The kafka topic name is dynamic and is suffixed with the date value. Since I plan to use Kafka streams to read data from these topics, this is a much better design than the initial one.
I read all those discussions and mailing lists which propose how to standardize the inclusion of a timestamp in a pubsub item but there is no answer how it is done practically today. Do I have to tweak my server to include the creation timestamp (because each server stores that information for some reason ;) or are there plugins or known source code modifications for openfire or ejabberd?
I have entries which are published in json format and not in Atom. BTW where the "published" and "updated" timestamps come from in the XEP-0060? Will the timestamp be added automatically if I configure the node to publish in Atom?
You would have to tweak your server, since the timestamp is not part of the spec.
The only way to have timestamps right now, is to put them in at the application level, which means the publisher is inserting it into the payload.
I have already gone through this
How best to design a REST API with multiple filters?
This does help when you have say 3 or 4 filtering criteria and you can accomodate that in the query String.
However let's take this example
You want to get call details about 20 telephone numbers, between a certain startdate and enddate.
Now I do agree ideally one should be advised to make individual queries for each number and then on the client side collate all data.
However for certain Live systems that would mean 20 rounds of queries on the switches or cdr databases. That is 20 request-response cycles plus the client having to collate and order them again based on time. While in the database level it would have been a simple single query that can return an ordered data and transformed back into a REST xml response that the client can embed on their system.
If we are to use GET the query string will get really confusing and has a limit as well.
Any suggestions to get around this issue.
Of course we can send a POST request with an xml having all numbers in it but that is against REST Get principles.
In case of GET use OData queries. For example when your start and end dates represented as numbers (unix time) URI could look like:
GET http://operatorcalls.com/Calls/Details?$filter=Date le 1342699200 and Date gt 1342526400
What you seem to be missing is an important concept of REST, caching. This can be done, as an example, in the browser, for a single client. Or it can be done as a shared cache between all the clients and the live production system (whatever it may be). Thus reducing queries against a live production system, or in your example, actual switches.
You should really take some time to read Fieldings thesis, and understand that REST is an architectural style.
I found a solution here Handling multiple parameters in a URI (RESTfully) in Java
but not quite happy with it.
So in effect we will end up using /cdr?numbers=number1,number2,number3 ...
However not too pleased with it as there is a limit to Query String in the url and also doesn't really seem to be an elegant solution. Anyone found any solution to this in their own implementation?
Basically not using POST for this kind of Fetch requests and also not using cumbresome and lengthy Query Strings.
We are using Jersey but also open to using CXF or Spring REST