KSQL consistency - apache-kafka

KSQL consistency - apache-kafka

I am doing a PoC using dotnet and ksql.
https://github.com/pablocastilla/kafkiano/
The overall idea is to see if I can implement business logic using KSQL. In the example I introduce devices in the stock and make orders from it. The example consists in this:
Two main streams:
Inventory stream receives the adding event to inventory.
Orders stream receives orders of products.
With those streams I create two tables:
ProductStock: it just adds the products to the stock
Orders: counts the orders by product
After those two tables I create another table with the difference between the orders and the products in the inventory, just to know if there are products left.
With a join in that last table and the order stream I can have the stock left when that order is processed.
I am introducing the events using the productname as key. So far it works well in my machine, but my questions are:
Is this consistent in a big production environment? I would like to know restrictions about when the consistency is broken when a lot of events are received in parallel.
How can I know which queries are executed before others? I need to count the difference between inventory and orders before the I join that difference with the order stream
Thanks
KSQL:
//INVENTORY STREAMS
CREATE STREAM InventoryEventsStream (ProductName VARCHAR, Quantity INT) WITH (kafka_topic='INVENTORYEVENTS', key='ProductName', value_format='json');
//TABLE GROUPING BY PRODUCT
CREATE TABLE ProductsStock as select ProductName,sum(Quantity) as Stock from InventoryEventsStream group by ProductName;
// ORDERS STREAMS
CREATE STREAM OrdersCreatedStream (ProductName VARCHAR,Quantity INT, OrderId VARCHAR, User VARCHAR) WITH (kafka_topic='ORDERSEVENTS', key='ProductName', value_format='json');
//TABLE GROUPING BY PRODUCT
CREATE TABLE ProductsOrdered as select ProductName as ProductName,sum(Quantity) as Orders from ORDERSCREATEDSTREAM group by ProductName;
// join with the difference
CREATE TABLE StockByProductTable AS SELECT ps.ProductName as ProductName,ps.Stock - op.Orders as Stock FROM PRODUCTSORDERED op JOIN ProductsStock ps ON op.ProductName = ps.ProductName;
//logic: I want the stock left when I make an order
SELECT ocs.OrderId,ocs.User,sbpt.Stock FROM OrdersCreatedStream ocs JOIN StockByProductTable sbpt ON sbpt.ProductName = ocs.ProductName;

I copy and paste the github answer I got from the confluent team:
"I get your question. A minimal answer for your question is, it executes as soon as your message is available in the stream.
A good analogy would be a machinery which is always running.
Whenever a payload enters inside, it will just process it. Now it comes to you on the chaining part. Are you inserting some payload to a new record stream after processing? Then yes, you can call it 'chaining'. Once you run / execute CTAS/ CSAS statements you see something like 'Table/Stream created and Running' , that is exactly what it means.
You have ignited an always running query!"

Related

How to fetch data from multi tables in nosql using SEMBAST plugin

I am not able to fetch data when multi table which are having relationship . so can any one help as per the below image
finally i need data like
ORDER_ID, USER_FULL_NAME, PRODUCT_NAME, PRODUCT_PRICE from 3 different table .
please help me out.

Sembast, does not provide a way to query multiple stores in one join query. However getting an entity by id (here a user or a product) is almost immediate (store.record(id).get(db) or store.records(ids).get(db)) so I think your best bet is to query the order_items store and fetch the users and products by ids.
Basically using 3 requests in a transaction to ensure data integrity should perform the join you want.

Join on foreign key in Kafka stream

Lets say that I have three Kafka topics filled with events representing business events occuring in different aggregates (event sourcing application). These events allow to build aggregates with following attributes :
users : usedId, name
modules of an application : moduleId, name
grants of users for modules of application : grantId, userId, moduleId, scope
Now I want to create a stream of all grants with name of users and products (instead of id).
I thought to do so :
create a KTable for users by grouping events by userId. The KTable has userId as key. It is ok.
create a KTable for products by grouping events by productId. The KTable has productId as key. It is ok.
create a stream from the stream of Grants and joining on the two KTable.
It is no ok. The problem is that joins seem only possible on primary keys. But the key of the stream is an technical identifier of the Grant and keys of users and products tables are not (they are agnostic of Grant).
So how to proceed ?

Well, there is no direct support for Foreign key join at the moment in Kafka Streams.
There is an open KIP : https://issues.apache.org/jira/browse/KAFKA-3705 for the same.
For now, there can be a workaround to solve this problem. You can use KStream-KTable Join.
First Aggregate the User Stream and Module Stream into respective KTable with aggregated collection of Events.
KTable<String,Object> UserTable = userStream.groupBy(<UserId>).aggregate(<... build collection/latest event>) ;
KTable<String,Object> ModuleTable = moduleStream.groupBy(<ModuleId>).aggregate(<... build collection/latest event>);
Now select the moduleID as a key in the Grants stream.
KStream<String,Object> grantRekeyedStream = grantStream.selectKey(<moduleId>);
It will change the key to moduleId. Now you can perform Stream-Table Join with ModuleTable. It will join all the matching records from right side for the key in the left side. Result stream will have Grant and Module data into one stream with ModuleId as key.
KStream<String,Object> grantModuleStream = grantRekeyedStream.join(moduleTable);
Next step is to join with userTable. Hence you need to rekey the grantModuleTable again with userId.
KStream<String,Object> grantModuleRekeyedStream = grantModuleTable.selectKey(<Select UserId>);
Now grantModuleRekeyedStream can be joined with userTable with KStream-KTable Join
KStream<String,Object> grantModuleUserStream = grantModuleRekeyedStream .join(userTable);
Above Stream will have user ID as a key and contain all grants and module details for that user.

This feature was released as part of Kafka Streams 2.4.0.
Here's an official tutorial on using this feature.

Filter and display database audit / changelog (activity stream)

I'm developing an application with SQLAlchemy and PostgreSQL. Users of the system modify data in 8 or so tables. Consider this contrived example schema:
I want to add visible logging to the system to record what has changed, but not necessarily how it has changed. For example: "User A modified product Foo", "User A added user B" or "User C purchased product Bar". So basically I want to store:
Who made the change
A message describing the change
Enough information to reference the object that changed, e.g. the product_id and customer_id when an order is placed, so the user can click through to that entity
I want to show each user a list of recent and relevant changes when they log in to the application (a bit like the main timeline in Facebook etc). And I want to store subscriptions, so that users can subscribe to changes, e.g. "tell me when product X is modified", or "tell me when any products in store S are modified".
I have seen the audit trigger recipe, but I'm not sure it's what I want. That audit trigger might do a good job of recording changes, but how can I quickly filter it to show recent, relevant changes to the user? Options that I'm considering:
Have one column per ID type in the log and subscription tables, with an index on each column
Use full text search, combining the ID types as a tsvector
Use an hstore or json column for the IDs, and index the contents somehow
Store references as URIs (strings) without an index, and walk over the logs in reverse date order, using application logic to filter by URI
Any insights appreciated :)
Edit It seems what I'm talking about it an activity stream. The suggestion in this answer to filter by time first is sounding pretty good.

Since the objects all use uuid for the id field, I think I'll create the activity table like this:
Have a generic reference to the target object, with a uuid column with no foreign key, and an enum column specifying the type of object it refers to.
Have an array column that stores generic uuids (maybe as text[]) of the target object and its parents (e.g. parent categories, store and organisation), and search the array for marching subscriptions. That way a subscription for a parent category can match a child in one step (denormalised).
Put a btree index on the date column, and (maybe) a GIN index on the array UUID column.
I'll probably filter by time first to reduce the amount of searching required. Later, if needed, I'll look at using GIN to index the array column (this partially answers my question "Is there a trick for indexing an hstore in a flexible way?")
Update this is working well. The SQL to fetch a timeline looks something like this:
SELECT *
FROM (
SELECT DISTINCT ON (activity.created, activity.id)
*
FROM activity
LEFT OUTER JOIN unnest(activity.object_ref) WITH ORDINALITY AS act_ref
ON true
LEFT OUTER JOIN subscription
ON subscription.object_id = act_ref.act_ref
WHERE activity.created BETWEEN :lower_date AND :upper_date
AND subscription.user_id = :user_id
ORDER BY activity.created DESC,
activity.id,
act_ref.ordinality DESC
) AS sub
WHERE sub.subscribed = true;
Joining with unnest(...) WITH ORDINALITY, ordering by ordinality, and selecting distinct on the activity ID filters out activities that have been unsubscribed from at a deeper level. If you don't need to do that, then you could avoid the unnest and just use the array containment #> operator, and no subquery:
SELECT *
FROM activity
JOIN subscription ON activity.object_ref #> subscription.object_id
WHERE subscription.user_id = :user_id
AND activity.created BETWEEN :lower_date AND :upper_date
ORDER BY activity.created DESC;
You could also join with the other object tables to get the object titles - but instead, I decided to add a title column to the activity table. This is denormalised, but it doesn't require a complex join with many tables, and it tolerates objects being deleted (which might be the action that triggered the activity logging).

TSQL - Deleting with Inner Joins and multiple conditions

My question is a variation on one already asked and answered (TSQL Delete Using Inner Joins) but I have a different level of complexity and I couldn't see a solution to it.
My requirement is to delete Special Prices which haven't been accessed in 90 days. Special Prices are keyed on Customer ID and Product ID and the products have to matched to a Customer Order Detail table which also contains a Customer ID and a Product ID. I want to write one function that will look at the Special Price table for each Customer, compare each Product for that Customer with the Customer Order Detail table and if the Maximum Order Date is more than 90 days earlier than today, delete it from the Special Price table.
I know I can use a CURSOR (slow but effective) but would prefer to have a single query like the one in the TSQL Delete Using Inner Joins example. Any ideas and/or is more information required?

I cannot dig more on the situation of your system but i think and if it is ok for you, check MERGE STATEMENT, it might be a help instead of using cursors. check this Link MERGE STATEMENT

Best Way to Sequentially Parse Through a Table in T-SQL

I'm writing a stored procedure in SQL Server and hoping someone can suggest a more computationally efficient way to handle this problem:
I have a table of Customer Orders (i.e., "product demand") data that contains 3000 line items. Each record expresses the Order Qty for a specific product.
I also have another table of Production Orders (i.e., "product supply") data that contains about 200 line items. Each record expresses the Qty Available for each specific product.
The problem is that there is typically less supply than demand and, therefore, the Custom Order table contains an Allocation Priority value that shows each Customer Order's position in line to receive product.
What's the best way to allocate Qty Available in Production Orders to the Order Qty in Customer Orders? Note that you can't allocate more to each Customer Order than has been ordered.
I can do this by creating a WHILE loop and doing the allocation product-by-product, line-by-line but it is very slow.
Is there a faster set-based way to approach this problem?

I don't have data to test against. This would not try and fill partial qty.
select orders.custID, orders.priority, orders.prodID, orders.qty, SUM(cumu.qty) as 'cumu'
from orders
join orders as cumu
on cumu.prodID = orders.prodID
and cumu.priority <= orders.priority
join available
on availble.prodID = orders.prodID
group by orders.custID, orders.priority, orders.prodID, orders.qty
having SUM(cumu.qty) < availble.qty
order by orders.custID, orders.priority, orders.prodID

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse