Join on foreign key in Kafka stream - apache-kafka

Lets say that I have three Kafka topics filled with events representing business events occuring in different aggregates (event sourcing application). These events allow to build aggregates with following attributes :
users : usedId, name
modules of an application : moduleId, name
grants of users for modules of application : grantId, userId, moduleId, scope
Now I want to create a stream of all grants with name of users and products (instead of id).
I thought to do so :
create a KTable for users by grouping events by userId. The KTable has userId as key. It is ok.
create a KTable for products by grouping events by productId. The KTable has productId as key. It is ok.
create a stream from the stream of Grants and joining on the two KTable.
It is no ok. The problem is that joins seem only possible on primary keys. But the key of the stream is an technical identifier of the Grant and keys of users and products tables are not (they are agnostic of Grant).
So how to proceed ?

Well, there is no direct support for Foreign key join at the moment in Kafka Streams.
There is an open KIP : https://issues.apache.org/jira/browse/KAFKA-3705 for the same.
For now, there can be a workaround to solve this problem. You can use KStream-KTable Join.
First Aggregate the User Stream and Module Stream into respective KTable with aggregated collection of Events.
KTable<String,Object> UserTable = userStream.groupBy(<UserId>).aggregate(<... build collection/latest event>) ;
KTable<String,Object> ModuleTable = moduleStream.groupBy(<ModuleId>).aggregate(<... build collection/latest event>);
Now select the moduleID as a key in the Grants stream.
KStream<String,Object> grantRekeyedStream = grantStream.selectKey(<moduleId>);
It will change the key to moduleId. Now you can perform Stream-Table Join with ModuleTable. It will join all the matching records from right side for the key in the left side. Result stream will have Grant and Module data into one stream with ModuleId as key.
KStream<String,Object> grantModuleStream = grantRekeyedStream.join(moduleTable);
Next step is to join with userTable. Hence you need to rekey the grantModuleTable again with userId.
KStream<String,Object> grantModuleRekeyedStream = grantModuleTable.selectKey(<Select UserId>);
Now grantModuleRekeyedStream can be joined with userTable with KStream-KTable Join
KStream<String,Object> grantModuleUserStream = grantModuleRekeyedStream .join(userTable);
Above Stream will have user ID as a key and contain all grants and module details for that user.

This feature was released as part of Kafka Streams 2.4.0.
Here's an official tutorial on using this feature.

Related

How do I set my hasura permission to only see the rows of my table corresponding to a user?

Here's the thing. I have the 3 tables depicted here:
People on my application can place orders. Then, I want
a user with rex permission to see all the orders table's rows
a user with delivery permission to only see the rows of the orders table that have the zip column set to the delivery user's zip
From the orders table, I can get for each order a zip. With the table zip_user, I can get a user_id out of a zip. Out of that user_id, I can get the delivery user from the users table.
While it is trivial to get the rex to see all of the orders table, I have not yet been able to configure the permissions for the delivery user. What do I need to do?
In other words, given the user performing a select on the orders table has x-hasura-user-id set to some user id and x-hasura-role set to delivery, how does that user get only the rows from the orders table that match with the zips associated with that user's user_id?
Hasura has the concept of relations. If you have foreign keys, it makes the relations automatically, if not you can make them yourself in the UI. Once the relationships have been set up, you will be able to set deep permissions, so on the orders table, you'll be able to use users.id.
Start here: https://hasura.io/docs/1.0/graphql/manual/schema/relationships/index.html

KSQL consistency

I am doing a PoC using dotnet and ksql.
https://github.com/pablocastilla/kafkiano/
The overall idea is to see if I can implement business logic using KSQL. In the example I introduce devices in the stock and make orders from it. The example consists in this:
Two main streams:
Inventory stream receives the adding event to inventory.
Orders stream receives orders of products.
With those streams I create two tables:
ProductStock: it just adds the products to the stock
Orders: counts the orders by product
After those two tables I create another table with the difference between the orders and the products in the inventory, just to know if there are products left.
With a join in that last table and the order stream I can have the stock left when that order is processed.
I am introducing the events using the productname as key. So far it works well in my machine, but my questions are:
Is this consistent in a big production environment? I would like to know restrictions about when the consistency is broken when a lot of events are received in parallel.
How can I know which queries are executed before others? I need to count the difference between inventory and orders before the I join that difference with the order stream
Thanks
KSQL:
//INVENTORY STREAMS
CREATE STREAM InventoryEventsStream (ProductName VARCHAR, Quantity INT) WITH (kafka_topic='INVENTORYEVENTS', key='ProductName', value_format='json');
//TABLE GROUPING BY PRODUCT
CREATE TABLE ProductsStock as select ProductName,sum(Quantity) as Stock from InventoryEventsStream group by ProductName;
// ORDERS STREAMS
CREATE STREAM OrdersCreatedStream (ProductName VARCHAR,Quantity INT, OrderId VARCHAR, User VARCHAR) WITH (kafka_topic='ORDERSEVENTS', key='ProductName', value_format='json');
//TABLE GROUPING BY PRODUCT
CREATE TABLE ProductsOrdered as select ProductName as ProductName,sum(Quantity) as Orders from ORDERSCREATEDSTREAM group by ProductName;
// join with the difference
CREATE TABLE StockByProductTable AS SELECT ps.ProductName as ProductName,ps.Stock - op.Orders as Stock FROM PRODUCTSORDERED op JOIN ProductsStock ps ON op.ProductName = ps.ProductName;
//logic: I want the stock left when I make an order
SELECT ocs.OrderId,ocs.User,sbpt.Stock FROM OrdersCreatedStream ocs JOIN StockByProductTable sbpt ON sbpt.ProductName = ocs.ProductName;
I copy and paste the github answer I got from the confluent team:
"I get your question. A minimal answer for your question is, it executes as soon as your message is available in the stream.
A good analogy would be a machinery which is always running.
Whenever a payload enters inside, it will just process it. Now it comes to you on the chaining part. Are you inserting some payload to a new record stream after processing? Then yes, you can call it 'chaining'. Once you run / execute CTAS/ CSAS statements you see something like 'Table/Stream created and Running' , that is exactly what it means.
You have ignited an always running query!"

How to handle many to many in DynamoDB

I am new to NoSql and DynamoDb, but from RDBMS..
My tables are being moved from MySql to DynamoDb. I have tables:
customer (columns: cid [PK], name, contact)
Hardware (columns: hid[PK], name, type )
Rent (columns: rid[PK], cid, hid, time) . => this is the association of customer and Hardware item.
one customer can have many Hardware Items and one Hardware Item can be shared among many customers.
Requirements: seperate lists of customers and hadware items should be able to retrieve.
Rent details- which customer barrowed which Hardeware Item.
I referred this - secondary index table. This is about keeping all columns in one table.
I thought to have 2 DynamoDb tables:
Customer - This has all attributes similar to columns AND set of hardware Item hash keys. (Then my issue is, when customer table is queried to retrieve only customers, all hardware keys are also loaded.)
Any guidance please for table structure? How to save, and load, and even updates ?
Any java samples please? (couldn't find any useful resource which similar to my scenario)
Have a look on DynamoDB's Adjacency List Design Pattern
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-adjacency-graphs.html
In your case, based on Adjacency List Design Pattern, your schema can be designed as following
The prefix of partition key and sort key indicate the type of record.
If the record type is customer, both partition key and sort key should have the prefix 'customer-'.
If the record is that the customer rents the hardware, the partition key's prefix should be 'customer-' and the sort key's prefix should be 'hardware-'
base table
+------------+------------+-------------+
|PK |SK |Attributes |
|------------|------------|-------------|
|customer-cid|customer-cid|name, contact|
|hardware-hid|hardware-hid|name, type |
|customer-cid|hardware-hid|time |
+------------+------------+-------------+
Global Secondary Index Table
+------------+------------+----------+
|GSI-1-PK |GSI-1-SK |Attributes|
|------------|------------|----------|
|hardware-hid|customer-cid|time |
+------------+------------+----------+
customer and hardware should be stored in the same table. customer can refer to hardware by using
SELECT * FROM base_table WHERE PK=customer-123 AND SK.startsWith('hardware-')
if you hardware want to refer back to customer, you should use GSI table
SELECT * FROM GSI_table WHERE PK=hardware-333 AND SK.startsWith('customer-')
notice: the SQL I wrote is just pseudo code, to provide you an idea only.
Take a look at this answer, as it covers many of the basics which are relevant to you.
DynamoDB does not support foreign keys as such. Each table is independent and there are no special tools for keeping two tables synchronised.
You would probably have an attribute in your customers table called hardwares. The attribute would be a list of hardware ids the customer has. If you wanted to see all hardware items belonging to a customer you would:
Perform GetItem on the customer id. Or use Query depending on how you are looking the customer up.
For each hardware id in the customer's hardware attribute, perform a GetItem on the Hardware table.
With DynamoDB you generally end up doing more in the client application relative to an RDBMS solution. The benefits are that its fast and simple. But you will find you probably move a lot of your work from the database server to your client server.

Filter and display database audit / changelog (activity stream)

I'm developing an application with SQLAlchemy and PostgreSQL. Users of the system modify data in 8 or so tables. Consider this contrived example schema:
I want to add visible logging to the system to record what has changed, but not necessarily how it has changed. For example: "User A modified product Foo", "User A added user B" or "User C purchased product Bar". So basically I want to store:
Who made the change
A message describing the change
Enough information to reference the object that changed, e.g. the product_id and customer_id when an order is placed, so the user can click through to that entity
I want to show each user a list of recent and relevant changes when they log in to the application (a bit like the main timeline in Facebook etc). And I want to store subscriptions, so that users can subscribe to changes, e.g. "tell me when product X is modified", or "tell me when any products in store S are modified".
I have seen the audit trigger recipe, but I'm not sure it's what I want. That audit trigger might do a good job of recording changes, but how can I quickly filter it to show recent, relevant changes to the user? Options that I'm considering:
Have one column per ID type in the log and subscription tables, with an index on each column
Use full text search, combining the ID types as a tsvector
Use an hstore or json column for the IDs, and index the contents somehow
Store references as URIs (strings) without an index, and walk over the logs in reverse date order, using application logic to filter by URI
Any insights appreciated :)
Edit It seems what I'm talking about it an activity stream. The suggestion in this answer to filter by time first is sounding pretty good.
Since the objects all use uuid for the id field, I think I'll create the activity table like this:
Have a generic reference to the target object, with a uuid column with no foreign key, and an enum column specifying the type of object it refers to.
Have an array column that stores generic uuids (maybe as text[]) of the target object and its parents (e.g. parent categories, store and organisation), and search the array for marching subscriptions. That way a subscription for a parent category can match a child in one step (denormalised).
Put a btree index on the date column, and (maybe) a GIN index on the array UUID column.
I'll probably filter by time first to reduce the amount of searching required. Later, if needed, I'll look at using GIN to index the array column (this partially answers my question "Is there a trick for indexing an hstore in a flexible way?")
Update this is working well. The SQL to fetch a timeline looks something like this:
SELECT *
FROM (
SELECT DISTINCT ON (activity.created, activity.id)
*
FROM activity
LEFT OUTER JOIN unnest(activity.object_ref) WITH ORDINALITY AS act_ref
ON true
LEFT OUTER JOIN subscription
ON subscription.object_id = act_ref.act_ref
WHERE activity.created BETWEEN :lower_date AND :upper_date
AND subscription.user_id = :user_id
ORDER BY activity.created DESC,
activity.id,
act_ref.ordinality DESC
) AS sub
WHERE sub.subscribed = true;
Joining with unnest(...) WITH ORDINALITY, ordering by ordinality, and selecting distinct on the activity ID filters out activities that have been unsubscribed from at a deeper level. If you don't need to do that, then you could avoid the unnest and just use the array containment #> operator, and no subquery:
SELECT *
FROM activity
JOIN subscription ON activity.object_ref #> subscription.object_id
WHERE subscription.user_id = :user_id
AND activity.created BETWEEN :lower_date AND :upper_date
ORDER BY activity.created DESC;
You could also join with the other object tables to get the object titles - but instead, I decided to add a title column to the activity table. This is denormalised, but it doesn't require a complex join with many tables, and it tolerates objects being deleted (which might be the action that triggered the activity logging).

Efficient way to model azure table storage for social networking

I have tables like this in SQL Server
Users
UserId (Unique)
Name
Age
Friends
UserId
FriendId
Topics
UserId
Subject
There can be several thousands of users. and there are several other properties in the table.
I can query to get following answers.
Give me all the friends of user "Tom".
Give me all the topics created by "Tom".
Give me all the topics created by Tom's friends that contains "abc" in the subject.
If I were to do it in Azure table storage, how do I structure my tables?
I have gone through this and this I would like someone who had more experience on modeling Azure Table storage to give some insights..
1 and 2 are pretty easy. You create two Azure tables - Friends and Topics indexed by user id (with user id in the key).
3rd one is much more difficult with Azure tables, especially "that contains 'abc' in the subject" part.
Azure tables don't support full text search. Basically it is only possible to efficiently retrieve values (or range of values) either using exact keys or using 'startswith' operator. Like "Give me all records where key is equal to 'key value'". Or "give me all records where key is greated than 'key lower bound' and is less than 'key upper bound'".
It is also possible to filter using 'startswith' by any non-key field of a record, but this will involve table scan and is not efficient. It's not possible to do similar filtering with 'contains'.
So I think you need something with full text search support here.