CDC and object transformation - debezium

I have been trying to read about CDC and see how I can use it in my applications. So I have four order booking systems, that store order and order items in their internal database. The tables and columns representing the order and order item in these systems are very different from one another.
I want to build a unified data store of all the orders from the order booking system so that I can make them available for querying by various downstream system and also store them for analytics and ML requirements. All those users who need order information need to only interact with this unified order data store. All the orders from the order booking system will converge to a common order data model in the data store.
Lets say a order item was modified , an entry was created in binlog and a tool like debezium is able to read the binlog for this change. So it has only the row for the changed order item. But I need to build a complete order object as this represents a new state of the order post the update of order item. Where would one build this complete order and then I would like to publish this as an order event to a kafka topic and have it persisted to a time series DB or a temporal RDBMS.
Can you guide me on where the complete order object should be built, does a tool like debezium let you pull date of the parent tables and help build a complete order object?

Related

Is BigQuery suitable for frequent updates of partial data?

I'm on GCP, I have a use case where I want to ingest large-volume events streaming from remote machines.
To compose a final event - I need to ingest and "combine" event of type X, with events of types Y and Z.
event type X schema:
SrcPort
ProcessID
event type Y schema:
DstPort
ProcessID
event type Z schema:
ProcessID
ProcessName
I'm currently using Cloud SQL (PostgreSQL) to store most of my relational data.
I'm wondering whether I should use BigQuery for this use case, since I'm expecting large volume of these kind of events, and I may have future plans for running analysis on this data.
I'm also wondering about how to model these events.
What I care about is the "JOIN" between these events, So the "JOIN"ed event will be:
SrcPort, SrcProcessID, SrcProcessName, DstPort, DstProcessID, DstProcessName
When the "final event" is complete, I want to publish it to PubSub.
I can create a de-normalized table and just update partially upon event (how is BigQuery doing in terms of update performance?), and then publish to pubsub when complete.
Or, I can store these as raw events in separate "tables", and then JOIN periodically complete events, then publish to pubsub.
I'm not sure how good PostgreSQL is in terms of storing and handling a large volume of events.
The thing that attracted me with BigQuery is the comfort of handling large volume with ease.
If you have this already on Postgres, I advise you should see BigQuery a complementary system to store a duplicate of the data for analyses purposes.
BigQuery offers you different ways to reduce costs and improve query performance:
read about Partitioning and Clustering, with this in mind you "scan" only the partitions that you are interested to perform the "event completion".
you can use scheduled queries to run MERGE statements periodically to have materialized table (you can schedule this as often as you want)
you can use Materialized Views for some of the situations
BigQuery works well with bulk imports and frequent inserts like http logging. Inserting into bigquery with segments of ~100 or ~1000 rows every few seconds works well.
Your idea of creating a final view will definitely help. Storing data in BigQuery is cheaper than processing it so it won't hurt to keep a raw set of your data.
How you model or structure your events is up to you.

Kafka Connect and Custom Queries

I'm interested in using the Kafka Source JDBC connector to perform to a publish to Kafka, for when an Invoice gets created. On the source end, it's broken up into 2 tables Invoice, and InvoiceLine.
Is this possible, using custom queries. What would the query look like?
Also since its polling, what gets published could contain one or more invoices in a topic?
Thanks
Yes, you can use custom queries. From the docs:
Custom Query: The source connector supports using custom queries instead of copying whole tables. With a custom query, one of the other update automatic update modes can be used as long as the necessary WHERE clause can be correctly appended to the query. Alternatively, the specified query may handle filtering to new updates itself; however, note that no offset tracking will be performed (unlike the automatic modes where incrementing and/or timestamp column values are recorded for each record), so the query must track offsets itself.

UI5: Retrieve and display thousands of items in sap.m.Table

There is a relational database (MySQL 8) with tens of thousands of items in the table, which need to be displayed in sap.m.Table. The straight forward approach is to retrieve all the items with SQL-query and to deliver it to the client-side in JSON in an async way. The key drawback of this approach is performance and memory consumption at the client-side. The whole table needs to be displayed on the client-side to provide user and ability to conduct fast searches. This is crucial for the app.
Currently, there are two options:
Fetch top 100 records and push them into the table. This way user can search the last 100 records immediately. At the same time to run an additional query in a web worker, which will take about 2…5 seconds and get all records except those 100. Then, to merge two JSONs.
Keep JSON on the application server as a cached variable and update it when the user adds a new record or deletes a record. Then I fetch the JSON which supposed to be much faster than querying the database.
How can I show in OpenUI5's sap.m.Table thousands of items?
My opinion;
You need to create OData backend for your tables. User can filter or search records with OData capabilities. You don't need to push all data to client, sap.m.Table automatically request rest of data with OData protocol while user scroll the table.
Quick answer you can`t.
Use sap.ui.table or provide a proper odata service with top/skip support as shown here under 4.3 and 4.4.
Based on your backend code(java, abap, node) there are libs to help you.
The SAP recommandation says max 100 datasets for sap.m.Table. In praxis, I would advise to follow the recommendation, even on fast PC the rendering will be slowed down.
If you want to test more than 100 datasets, you need to set the size limit on your oModel like oModel.setSizeLimit(1000);

Storing relational data in Apache Flink as State and querying by a property

I have a database with Tables T1(id, name, age) and T2(id, subject).
Flink receives all updates from the database as event stream using something like debezium. The tables are related to each other and required data can be extracted by joining T1 with T2 on id. Currently the whole state of the database is stored in Flink MapState with id as the key. Now the problem is that I need to select the row based on name from T1 without using id. It seems like I need an index on T1(name) for making it faster. Is there any way I can automatically index it, without having to manually create an index for each table. What is the recommended way for doing this?. I know about SQL streaming on tables, but I require support for updates to the tables. By the way, I use Flink with Scala. Any pointers/suggestions would be appreciated.
My understanding is that you are connecting T1 and T2, and storing some representation (in MapState) of the data from these two streams in keyed state, keyed by id. It sounds like T1 and T2 are evolving over time, and you want to be able to interactively query the join at any time by specifying a name.
One idea would be to broadcast in the name(s) you want to select, and use a KeyedBroadcastProcessFunction to process them. In its processBroadcastElement method you could use ctx.applyToKeyedState to compute the results by extracting data from the MapState records (which would have to be held in this operator). I suspect you will want to use the names as the keys in these MapState records, so that you don't have to iterate over all of the entries in each map to find the items of interest.
You will find a somewhat similar example of this pattern in https://training.data-artisans.com/exercises/ongoingRides.html.

Talend open studio run only created or modified records among 15k

I have a job in talend open studio which is working fine, it conects a tMSSqlinput to a tMap then tMysqlOutput, very straight forward. My problem is that i need this job running on daily basis, but only run when a new record is created or modified...any help is highly aprecciated!
It seems that you are searching for a Change Data Capture Tool for Talend.
Unfortunately it is only available on the licenced product.
To implement your need, you do have several ways. I want to show the most popular ones.
CDC from Talend
As Corentin said correctly, you could choose to use CDC (Change Data Capture) from Talend if you use the subscription version.
CDC of MSSQL
Alternatively you can check if you can activate or use CDC in your MSSQL server. This depends on your license. If it is possible, you can use the function to identify new elements and proceed them.
Triggers
Also you can create triggers on your database (if you have access to it). For example, creating a trigger for the cases INSERT, UPDATE, DELETE would help you getting the deltas. Then you could store those records separately or their IDs.
Software driven / API
If your database is connected to a software and you have developers around, you could ask for a service which identifies records on insert / update / delete and shows them to you. This could be done e.g. in a REST interface.
Delta via ID
If the primary key is an ID and it is set to autoincrement, you could also check your MySQL table for the biggest number and only SELECT those from the source which have a bigger ID than you have already got. This depends of course from the database layout.