CDC-Debezium capture data in chain - debezium

CDC-Debezium captures event e.g. Insert, Update or Delete once such event occurs in source system e.g. postgres and it streams data and sends to destination system e.g. NoSQL or Apache-Kafka. I'm very new in this configuration and setup.
I would like to know if there's any way to capture chains of tables while any event triggers e.g. Suppose there's Table A Parent and B Child in source system. Now there's some change occurs in table B it's been successfully captured and flows to destination system. Now I would require some method or configuration in CDC-Debezium which will be able to capture that change has been made in table B and this table is dependent of Table A though no changes occurs on Table A or Vice-Versa.
Please let me know thoughts in this regards.

please check these links, they might be a good starter for the discussion
https://debezium.io/blog/2018/09/20/materializing-aggregate-views-with-hibernate-and-debezium/
https://debezium.io/blog/2018/03/08/creating-ddd-aggregates-with-debezium-and-kafka-streams/
https://debezium.io/blog/2018/03/08/creating-ddd-aggregates-with-debezium-and-kafka-streams/

Related

How to detect if tableview was changed

I have following idea: user can edit database and when he will press exit from database he will be asked if he want to save edited database or if he edited something and then turned it back he wont be asked.
I think i should compare created database and database after editing when user press exit, but don't know how.
This is my code for creating database
model = new QSqlRelationalTableModel(this, *db);
model->setTable("cv");
model->setFilter("cv_id = "+currentCV+"");
model->removeColumns(0,1);
model->select();
ui->tableView->show();
ui->tableView->setModel(model);
You have 2 options.
Create event triggers in the database. Event triggers work when users change database table structures or create tables or change columns. Thus, these triggers work when the user executes any DDL commands. (change table, add column, create index, drop table, etc.) You can insert these commands into your log tables using event triggers.
Your database structures (all tables, columns, indexes, etc.) are stored in information_shema. You can select the data of these tables and save it somewhere and then compare it with the data that has changed.
Take in mind that no change to database is performed before you're submitting the user changes. By default, sumbit is performed when changes occur, and this does not help you. But, you can set
model.setEditStrategy(QSqlTableModel.EditStrategy.OnManualSubmit)
At this point, changes are persisted only when the method model.submitAll() is called.
All you have to do, at this point, is exploiting dataChanged signal of your model and use a flag variable to check if changes have been operated:
data_changed = False
def on_data_changed()
data_changed = True
model.dataChanged.connect(on_data_changed)
Now you can adjust the logic of the code to serve your specific purpose

How to configure a Synapse Mapping Data Flow for INSERT/UPDATE/DELETE when the destination table does not yet exist

I am trying to build a generic Mapping Data Flow for some basic cleansing on tables in my Data Lake. I need it to be able to work both on an ongoing basis after data already exists in my cleansed tables as well as when new tables are added (it would detect them automatically and create and populate the destination). Both the Source and Destination tables with be Delta tables.
The approach I have taken is to have Sources configured to both my actual source and to the target and use either JOIN transformations or EXISTS transformations to identify the new, updated and removed rows.
This works fine for INSERTS and UPDATES, however my issues is dealing with DELETES when there is no data currently in the destination. Obviously there will be nothing to DELETE - that is as expected. However, because I reference the key column that will exist once data is loaded to the table I get an error on an initial run that states:
ERROR Dataflow AppManager: name=BatchJobListener.failed, opId=xxx, message=Job 'xxx failed due to reason: DF-SINK-007 at Sink 'cleansedTableWithDeletes': Sink results in 0 output columns. Please ensure at least one column is mapped.
The overall process looks as follows:
Has anyone developed a pattern that works for a generic flow (this one is parameter driven and ensures schema drift is accommodated) or a way for the Data Flow to think that there IS a column in the destination that it can refer to and get past this issue?
In Source options check Allow no files found.
You can also provide date dynamically in Filter by last modified option.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/data-flow-sink#sink-settings

Cannot repopulate ElectrodeGroup datajoint table

I'm a researcher in Loren Frank's lab at UCSF using datajoint and files in the nwb format. I made some changes to our code for defining entries in our ElectrodeGroup table, and was hoping to test those by deleting an entry in the table and regenerating it with the new code. I was able to delete the entry, but cannot repopulate it. In particular, when I run ElectrodeGroup.populate() or ElectrodeGroup.populate({"nwb_file_name": my_file_name}), no changes are made to the table. I confirmed that the electrode group I deleted and am trying to regenerate is defined in the original nwb file. I am seeking input on why the populate command seems to not be working here. Thanks in advance for any help!
This user also contacted our team through another channel. Sharing the solution below for future users, in reference to this schema. In short, the populate process is reserved for unique upstream primary keys.
Since the ElectrodeGroup's only upstream table dependency is Session, the make method will only be called if there are no electrode groups for that session. This is because from the perspective of DataJoint, the only 'guaranteed' knowledge about what should exist for this table is defined solely by the presence/absence of related upstream records. Since the 'new' primary 'electrode_group_name' attribute is defined by the ElectrodeGroup table itself, DataJoint doesn't know how many copies will be created by make, and so simply invokes make 1 time per Session, expecting the single make invocation to fully define all possible electrode_group_name values the table will use. If there is one value for that session, no work needs to be done, so no make() invocation occurs.
There are a couple possible solutions:
Model the electrode group explicitly, with a table defines the existence of an electrode group (e.g., ElectrodeGroupConfiguration). This ElectrodeGroup would then inherit primary keys from both Session and ElectrodeGroupConfiguration. The ElectrodeGroup make function would be adjusted to load that unique keys across upstream tables.
Adjust the make function to handle the partial insert/update case, and call the make function directly with the desired primary key when these kinds of 'abnormal' updates need to occur.
Method #1 is 'cleanest' w/r/t to the DataJoint data model (explicitly modeled data dependencies using make/populate), whereas #2 is slightly 'escaping' the DataJoint data model in a controlled way to achieve a desired schema/data result.

A way to know if a Firebird table's data has changed without using a trigger

Is there a way of knowing that a table's data has changed (insert/update/delete) without using a trigger on that table? Perhaps a global trigger to indicate changes on a table?
If you want notification of changes, you will need to add a trigger yourself. Firebird 3 added a new feature to simplify identifying changed rows, the pseudo-column RDB$RECORD_VERSION. This pseudo-column contains the transaction that created the current version of a row.
Alternatively, you could try and use the trace facility to monitor for changes, but that is not an out of the box solution, as you will need to write the necessary logic to parse the trace output (and take things like transaction commit/rollback into account).

How do you manage concurrent access to forms?

We've got a set of forms in our web application that is managed by multiple staff members. The forms are common for all staff members. Right now, we've implemented a locking mechanism. But the issue is that there's no reliable way of knowing when a user has logged out of the system, so the form needs to be unlocked. I was wondering if there was a better way to manage concurrent users editing the same data.
You can use optimistic concurrency which is how the .Net data libraries are designed. Effectively you assume that usually no one will edit a row concurrently. When it occurs, you can either throw away the changes made, or try and create some nicer retry logic when you have two users edit the same row.
If you keep a copy of what was in the row when you started editing it and then write your update as:
Update Table set column = changedvalue
where column1 = column1prev
AND column2 = column2prev...
If this updates zero rows, then you know that the row changed during the edit and you can then deal with it, or simply throw an error and tell the user to try again.
You could also create some retry logic? Re-read the row from the database and check whether the change made by your user and the change made in the database are able to be safely combined, then do so automatically. Or you could present a choice to the user as to whether they still wish to make their change based on the values now in the database.
Do something similar to what is done in many version control systems. Allow anyone to edit the data. When the user submits the form, the database is checked for changes. If the record has not been changed prior to this submission, allow it as usual. If both changes are the same, ignore the incoming (now redundant) change.
If the second change is different from the first, the record is now in conflict. The user is presented with a new form, which indicates which fields were changed by the conflicting update. It is then the user's responsibility to resolve the conflict (by updating both sets of changes), or to allow the existing update to stand.
As Spence suggested, what you need is optimistic concurrency. A standard website that does no accounting for whether the data has changed uses what I call "last write wins". Simply put, whichever connection saves to the database last, that version of the data is the one that sticks. In optimistic concurrency, you use a "first write wins" logic such that if two connections try to save the same row at the same time, the first one that commits wins and the second is rejected.
There are two pieces to this mechanism:
The rules by which you fail the second commit
How the system or the user handles the rejected commit.
Determining whether to reject the commit
Two approaches:
Comparison column that changes each time a commit happens
Compare the data with its committed version in the database.
The first one entails using something like SQL Server's rowversion data type which is guaranteed to change each time the row changes. The upside is that it makes it simple to roll your own logic to determine if something has changed. When you get the data, you pull the rowversion column's value and when you commit, you compare that value with what is currently in the database. If they are different, the data has changed since you last retrieved it and you should reject the commit otherwise proceed to save the data.
The second one entails comparing the columns you pulled with their existing committed values in the database. As Spence suggested, if you attempt the update and no rows were updated, then clearly one of the criteria failed. This logic can get tricky when some of the values are null. Many object relational mappers and even .NET's DataTable and DataAdapter technology can help you handle this.
Handling the rejected commit
If you do not leave it up to the user, then the form would throw some message stating that the data has changed since they last edited and you would simply re-retrieve the data overwriting their changes. As you can imagine, users aren't particularly fond of this solution especially in a high volume system where it might happen frequently.
A more sophisticated (and also more complicated) approach is to show the user what has changed allow them to choose which items to try to re-commit, Behind the scenes you would retrieve the data again, overwrite the values picked by the user with their entries and try to commit again. In high volume system, this will still be problematic because by the time the user has tried to re-commit, the data may have changed yet again.
The checkout concept is effectively pessimistic concurrency where users "lock" rows. As you have discovered, it is difficult to implement in a stateless environment. Users are notorious for simply closing their browser while they have something checked out or using the Back button to return a set that was checked out and try to recommit it. IMO, it is more trouble than it is worth to try go this route in a web-based solution. Assuming you write the user name that last changed a given row, with optimistic concurrency, you can inform the user whose changes are rejected who saved the data before them.
I have seen this done two ways. The first is to have a "checked out" column in your database table associated with that data. Your service would have to look for this flag to see if it is being edited. You can have this expire after a time threshold is met (with a trigger) if the user doesn't commit changes. The second way is having a dedicated "checked out" table that stores id's and object names (probably the table name). It would work the same way and you would have less lookup time, theoretically. I see concurrency issues using the second method, however.
Why do you need to look for session timeout? Just synchronize access to your data (forms or whatever) and that's it.
UPDATE: If you mean you have "long transactions" where form is locked as soon as user opens editor (or whatever) and remains locked until user commits changes, then:
either use optimistic locking, implement it by versioning of forms data table
optimistic locking can cause loss of work, if user have been away for a long time, then tried to commit his changes and discovered that someone else already updated a form. In this case you may want to implement explicit "locking" of form, where user "locks" form as soon as he starts work on it. Other user will notice that form is "locked" and either communicate with lock owner to resolve issue, or he can "relock" form for himself, loosing all updates of first user in process.
We put in a very simple optimistic locking scheme that works like this:
every table has a last_update_date
field in it
when the form is created
the last_update_date for the record
is stored in a hidden input field
when the form is POSTED the server
checks the last_update_date in the
database against the date in the
hidden input field.
If they match,
then no one else has changed the
record since the form was created so
the system updates the data.
If they don't match, then someone else has
changed the record since the form was
created. The system sends the user back to the form edit page and tells the user that someone else edited the record and they must reapply their changes.
It is very simple and works well enough.
You can use "timestamp" column on your table. Refer: What is the mysterious 'timestamp' datatype in Sybase?
I understand that you want to avoid overwriting existing data with consecutively updates.
If so, when the user opens a screen you have to get last "timestamp" column to the client.
After changing data just before update, you should check the "timestamp" columns(yours and db) to make sure if anyone has changed tha data while he is editing.
If its changed you will alert an error and he has to startover. If it is not, update the data. Timestamp columns updated automatically.
The simplest method is to format your update statement to include the datetime when the record was last updated. For example:
UPDATE my_table SET my_column = new_val WHERE last_updated = <datetime when record was pulled from the db>
This way the update only succeeds if no one else has changed the record since the last read.
You can message to the user on conflict by checking if the update suceeded via a SELECT after the UPDATE.