What I want to do is write an Order service project without using the traditional RDBMS(eg. mysql).After having read the docs from Kafka and confluent for several days, I think Kafka stream and state store could help me to implement this without using RDBMS.
But how to store the data that has a long list result in state store?Especially when I need to query result like using limit and offset?
eg:
I have a table, the columns are:
| userId | orderId | ... |
One user may have many order rows to store, so I need a query method to search the key:userId result in the state store with start and limit.But I can't find a method in the Kafka stream State Store interface.
Do I have to do this using an RDBMS? What is the standard method for me to implement an application like this?
Related
I have a materialized view that is updated from many streams. Every one enrich it partially. Order doesn't matter. Updates comes in not specified time. Is following algorithm is a good approach:
Update comes and I check what is stored in materialized view via get(), that this is an initial one so enrich and save.
Second comes and get() shows that partial update exist - add next information
... and I continue with same style
If there is a query/join, object that is stored has a method that shows that the update is not complete isValid() that could be used in KafkaStreams#filter().
Could you share please is this a good plan? Is there any pattern in Kafka streams world that handle this case?
Please advice.
Your plan looks good , you have the general idea, but you'll have to use the lower Kafka Stream API : Processor API.
There is a .transform operator that allow you to access a KeyValueStatestore, inside this operation implementation you are free to decide if you current aggregated value is valid or not.
Therefore send it downstream or returning null waiting for more information.
I have the following situation where I have an Apache Kafka topic containing numerous record types.
For example:
UserCreated
UserUpdated
UserDeleted
AnotherRecordType
...
I wish to implement CDC on the three listed User* record types such that at the end, I have an up-to-date KTable with all user information.
How can I do this in ksqlDB? Since, as far as I know, Debezium and other CDC connectors also source their data from a single topic, I at least know it should be possible.
I've been reading through the Confluent docs for a while now, but I can't seem to find anything quite pertinent to my use case (CDC using existing topic). If there is anything I've overlooked, I would greatly appreciate a link to the relevant documentation as well.
I assume that, at the very least, the records must have the same key for ksqlDB to be able to match them. So my questions boil down to:
How would I teach ksqlDB which is an insert, an update and a delete?
Is the key matching a hard requirement, or are there other join/match predicates that we can use?
One possibility that I can think of is basically how CDC already does it: treat each incoming record as a new entry so that I can have something like a slowly changing dimension in the KTable, grouping on the key and selecting entries with e.g. the latest timestamp.
So, is something like the following:
CREATE TABLE users AS
SELECT user.user_id,
latest_by_offset(user.name) AS name,
latest_by_offset(user.email),
CASE WHEN record.key = UserDeleted THEN true ELSE FALSE END,
user.timestamp,
...
FROM users
GROUP BY user.user_id
EMIT CHANGES;
possible (using e.g. ROWKEY for record.key)? If not, how does e.g. Debezium do it?
The general pattern is to not have different schema types; just User. Then, the first record of any unique key (userid, for example) is an insert. Afterwards any non null values for the same key are updates (generally requiring all fields to be part of the value, effectively going a "replace" operation in the table). Deletes are caused by sending null values for the key (tombstone events).
If you have multiple schemas, it might be better to create a new stream that nulls out any of the delete events, unifies the creates and updates to a common schema that you want information for, and filter event types that you want to ignore.
how does e.g. Debezium do it?
For consuming data coming from Debezium topics, you can use a transform to "extract the new record state". It doesn't create any tables for you.
currently I am working with a pipeline node.js base with backpressure in order to download all data from several API endpoints, some of these endpoints have data related between them.
The idea is to make a refactor in order to build an application more maintainable than the current in order to improve the updates over the data, and the current approach is the following.
The first step (map) is to download all data from endpoints and push into several topics, some of this data is complex and it is necessary data from one endpoint in order to retrieve data from another endpoint.
And the second step (reduce) is to get that data from all topics and push into a SQL but only the data that we need.
The question are
Could be a good approach to this problem
Is better to use Kafka streams in order to use KSQL to make transforms and only use a microservice to publish into the database.
The architecture schema is the following, and real time is not necessary for this data:
Thanks
I've started working with KSQL and quite living the experience. I'm trying to work with Table and Stream join and the scenario is as below.
I have a sample data set like this:
"0117440512","0134217727","US","United States","VIRGINIA","Vienna","DoD Network Information Center"
"0134217728","0150994943","US","United States","MASSACHUSETTS","Woburn","Genuity"
in my kafka topic-1. Is a static data set loaded to Table and might get updated once in a month or so.
I have one more data set like:
{"state":"AD","id":"020","city":"Andorra","port":"02","region":"Canillo"}
{"state":"GD","id":"024","city":"Arab","port":"29","region":"Ordino"}
in kafka topic-2. Is a stream of data being loaded to streams.
Since Table cant be created without specifying the Key, my data don't have a unique column to do so. So while loading data from topic-1 to Table, what exactly should my key be? Remember my Table might get populated/updated once in a month or so with same data and new once too. With new data being loaded I can replace them with the key.
I tried to find if there's something like incremental value as we call PrimaryKey in SQL, but didn't find any.
Can someone help me in correcting my approach towards the implementation or a query to create a PrimaryKey if exists. Thanks
No, KSQL doesn't have the concept of a self-incrementing key. You have to define the key when you produce the data into the topic on which the KSQL Table is defined.
--- EDIT
If you want to set the key on a message as it's ingested through Kafka Connect you can use Single Message Transform (SMT).
"transforms":"createKey,extractInt",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields":"id",
"transforms.extractInt.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field":"id"
See here for more details.
I'm running into Scala,Apache Spark world and I'm trying to understand how to create a "pipeline" that will generate a DataFrame based on the events I receive.
For instance, the idea is that when I receive a specific log/event I have to insert/update a row in the DF.
Let's make a real example.
I would like to create a DataFrame that will represent the state of the users present in my database(postgres,mongo whatever).
When i say state, I mean the current state of the user(ACTIVE,INCOMPLETE,BLOCKED, etc). This states change based on the users activity, so then I will receive logs(JSON) with key "status": "ACTIVE" and so on.
So for example, I'm receiving logs from a Kafka topic.. at some point I receive a log which I'm interested because it defines useful information about the user(the status etc..)
I take this log, and I create a DF with this log in it.
Then I receive the 2nd log, but this one was performed by the same user, so the row needs to be updated(if the status changed of course!) so no new row but update the existing one. Third log, new user, new information so store as a new row in the existing DF.. and so on.
At the end of this process/pipeline, I should have a DF with the information of all the users present in my db and their "status" so then I can say "oh look at that, there are 43 users that are blocked and 13 that are active! Amazing!"
This is the idea.. the process must be in real time.
So far, I've tried this using files not connecting with a kafka topic.
For instance, I've red file as follow:
val DF = mysession.read.json("/FileStore/tables/bm2ube021498209258980/exampleLog_dp_api-fac53.json","/FileStore/tables/zed9y2s11498229410434/exampleLog_dp_api-fac53.json")
which generats a DF with 2 rows with everything inside.
+--------------------+-----------------+------+--------------------+-----+
| _id| _index|_score| _source|_type|
+--------------------+-----------------+------+--------------------+-----+
|AVzO9dqvoaL5S78GvkQU|dp_api-2017.06.22| 1|[2017-06-22T08:40...|DPAPI|
| AVzO9dq5S78GvkQU|dp_api-2017.06.22| 1|[null,null,[Wrapp...|DPAPI|
+--------------------+-----------------+------+--------------------+-----+
in _source there are all the nested things(the status I mentioned is here!).
Then I've selected some useful information like
DF.select("_id", "_source.request.user_ip","_source.request.aw", "_type").show(false)
+--------------------+------------+------------------------------------+-----+
|_id |user_ip |aw |_type|
+--------------------+------------+------------------------------------+-----+
|AVzO9dqvoaL5S78GvkQU|111.11.11.12|285d5034-dfd6-44ad-9fb7-ba06a516cdbf|DPAPI|
|AVzO9dq5S78GvkQU |111.11.11.82|null |DPAPI|
+--------------------+------------+------------------------------------+-----+
again, the idea is to create this DF with the logs arriving from a kafka topic and upsert the log in this DF.
Hope I explained well, I don't want a "code" solution I'd prefer hints or example on how to achieve this result.
Thank you.
As you are looking for resources I would suggest the following:
Have a look at the Spark Streaming Programming Guide (https://spark.apache.org/docs/latest/streaming-programming-guide.html) and the Spark Streaming + Kafka Integration Guide (https://spark.apache.org/docs/2.1.0/streaming-kafka-0-10-integration.html).
Use the Spark Streaming + Kafka Integration Guide for information on how to open a Stream with your Kafka content.
Then have a look at the possible transformations you can perform with Spark Streaming on it in chapter "Transformations on DStreams" in the Spark Streaming Programming Guide
Once you have transformed the stream in a way that you can perform a final operation on it have a look at "Output Operations on DStreams" in the Spark Streaming Programming Guide. I think especially .forEachRDD could be what you are looking for - as you can do an operation (like checking whether a certain key word is in your string and based on this do a database call) for each element of the stream.