Preventing update loops for multiple databases using CDC

Preventing update loops for multiple databases using CDC - sql-server-2008-r2

We have a number of legacy systems that we're unable to make changes to - however, we want to start taking data changes from these systems and applying them automatically to other systems.
We're thinking of some form of service bus (no specific tech picked yet) sitting in the middle, and a set of bus adapters (one per legacy application) to translate between database specific concepts and general update messages.
One area I've been looking at is using Change Data Capture (CDC) to monitor update activity in the legacy databases, and use that information to construct appropriate messages. However, I have a concern - how best could I, as a consumer of CDC information, distinguish changes applied by the application vs changes applied by the bus adapter on receipt of messages - because otherwise, the first update that gets distributed by the bus will get re-distributed by every receiver when they apply that change to their own system.
If I was implementing "poor mans" CDC - i.e. triggers, then those triggers execute within the context/transaction/connection of the original DML statements - so I could either design them to ignore one particular user (the user applying incoming updates from the bus), or set and detect a session property to similar ignore certain updates.
Any ideas?

If I understand your question correctly, you're trying to define a message routing structure that works with a design you've already selected (using an enterprise service bus) and a message implementation that you can use to flow data off your legacy systems that only forward-ports changes to your newer systems.
The difficulty is you're trying to apply changes in such a way that they don't themselves generate a CDC message from the clients receiving the data bundle from your legacy systems. In fact, all you're concerned about is having your newer systems consume the data and not propagate messages back to your bus, creating unnecessary crosstalk that might exponentiate, overloading your infrastructure.
The secret is how MSSQL's CDC features reconcile changes as they propagate through the network. Specifically, note this caveat:
All the changes are logged in terms of LSN or Log Sequence Number. SQL
distinctly identifies each operation of DML via a Log Sequence Number.
Any committed modifications on any tables are recorded in the
transaction log of the database with a specific LSN provided by SQL
Server. The __$operationcolumn values are: 1 = delete, 2 = insert, 3 =
update (values before update), 4 = update (values after update).
cdc.fn_cdc_get_net_changes_dbo_Employee gives us all the records net
changed falling between the LSN we provide in the function. We have
three records returned by the net_change function; there was a delete,
an insert, and two updates, but on the same record. In case of the
updated record, it simply shows the net changed value after both the
updates are complete.
For getting all the changes, execute
cdc.fn_cdc_get_all_changes_dbo_Employee; there are options either to
pass 'ALL' or 'ALL UPDATE OLD'. The 'ALL' option provides all the
changes, but for updates, it provides the after updated values. Hence
we find two records for updates. We have one record showing the first
update when Jason was updated to Nichole, and one record when Nichole
was updated to EMMA.
While this documentation is somewhat terse and difficult to understand, it appears that changes are logged and reconciled in LSN order. Competing changes should be discarded by this system, allowing your consistency model to work effectively.
Note also:
CDC is by default disabled and must be enabled at the database level
followed by enabling on the table.
Option B then becomes obvious: institute CDC on your legacy systems, then use your service bus to translate these changes into updates that aren't bound to CDC (using, for example, raw transactional update statements). This should allow for the one-way flow of data that you seek from the design of your system.
For additional methods of reconciling changes, consider the concepts raised by this Wikipedia article on "eventual consistency". Best of luck with your internal database messaging system.

Related

what does the file Snapshot.scala in databricks?

I am running some streaming query jobs on a databricks cluster, and when i look at the cluster/job logs, I see a lot of
first at Snapshot.scala:1
and
withNewExecutionId at TransactionalWriteEdge.scala:130
A quick search yielded this scala script https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/Snapshot.scala
Any one can explain what this do in laymans term?

Internally this class manages the replay of actions stored in checkpoint or delta file
Generally, this "snapshotting" relies on delta encoding and indirectly allows snaphot isolation as well.
Practically delta-encoding remembers every side-effectful operation like INSERT DELETE UPDATE that you did since the last checkpoint. In case of delta lake it would be SingleAction (source): AddFile (insert) RemoveFile (delete). Conceptually this approach is close to event-sourcing - without it you'd have to literally store/broadcast whole state (database or directory) on every update. It also employed by many classic ACID databases with replication.
Overall it gives you:
ability to continuously replicate file-system/directory/database state (see SnapshotManagement.update). Basically that's why you see a lot of first at Snapshot.scala:1 - it's called in order to catch up with the log every time you start transaction, see DeltaLog.startTransaction. I couldn't find TransactionalWriteEdge sources, but I guess it's called around the same time.
ability to restore state by replaying every action since the last snapshot.
ability to isolate (and store) transactions by keeping their snapshots apart until commit (every SingleAction has txn in order to isolate). Delta-lake uses optimistic locking for that: transaction commits will fail if their logs are not mergeable, while readers don't see uncommitted actions.
P.S. You can see that the log is accessed in line val deltaData = load(files) and actions are stacked on top of previousSnapshot (val checkpointData = previousSnapshot.getOrElse(emptyActions); val allActions = checkpointData.union(deltaData))

CQRS and Passing Data

Suppose I have an aggregate containing some data and when it reaches a certain state, I'd like to take all that state and pass it to some outside service. For argument and simplicity's sake, lets just say it is an aggregate that has a list and when all items in that list are checked off, I'd like to send the entire state to some outside service. Now when I'm handling the command for checking off the last item in the list, I'll know that I'm at the end but it doesn't seem correct to send it to the outside system from the processing of the command. So given this scenario what is the recommended approach if the outside system requires all of the state of the aggregate. Should the outside system build its own copy of the data based on the aggregate events or is there some better approach?

Should the outside system build its own copy of the data based on the aggregate events.
Probably not -- it's almost never a good idea to share the responsibility of rehydrating an aggregate from its history. The service that owns the object should be responsible for rehydration.
First key idea to understand is when in the flow the call to the outside service should happen.
First, the domain model processes the command arguments, computing the update to the event history, including the ChecklistCompleted event.
The application takes that history, and saves it to the book of record
The transaction completes successfully.
At this point, the application knows that the operation was successful, but the caller doesn't. So the usual answer is to be thinking of an asynchronous operation that will do the rest of the work.
Possibility one: the application takes the history that it just saved, and uses that history to create schedule a task to rehydrate a read-only copy of the aggregate state, and then send that state to the external service.
Possibility two: you ditch the copy of the history that you have now, and fire off an asynchronous task that has enough information to load its own copy of the history from the book of record.
There are at least three ways that you might do this. First, you could have the command schedule the task as before.
Second, you could have a event handler listening for ChecklistCompleted events in the book of record, and have that handler schedule the task.
Third, you could read the ChecklistCompleted event from the book of record, and publish a representation of that event to a shared bus, and let the handler in the external service call you back for a copy of the state.
I was under the impression that one bounded context should not reach out to get state from another bounded context but rather keep local copies of the data it needed.
From my experience, the key idea is that the services shouldn't block each other -- or more specifically, a call to service B should not block when service A is unavailable. Responding to events is fundamentally non blocking; does it really matter that we respond to an asynchronously delivered event by making an asynchronous blocking call?
What this buys you, however, is independent evolution of the two services - A broadcasts an event, B reacts to the event by calling A and asking for a representation of the aggregate that B understands, A -- being backwards compatible -- delivers the requested representation.
Compare this with requiring a new release of B every time the rehydration logic in A changes.
Udi Dahan raised a challenging idea - the notion that each piece of data belongs to a singe technical authority. "Raw business data" should not be replicated between services.
A service is the technical authority for a specific business capability.
Any piece of data or rule must be owned by only one service.
So in Udi's approach, you'd start to investigate why B has any responsibility for data owned by A, and from there determine how to align that responsibility and the data into a single service. (Part of the trick: the physical view of a service can span process boundaries; in other words, a process may be composed from components that belong to more than one service).
Jeppe Cramon series on microservices is nicely sourced, and touches on many of the points above.

You should never externalise your state. Reporting on that state is a function of the read side, as it produces reports and you'll need that data to call the service. The structure of your state is plastic, and you shouldn't have an external service that relies up that structure otherwise you'll have to update both in lockstep which is a bad thing.
There is a blog that puts forward a strong argument that the process manager is the correct place to put this type of feature (calling an external service), because that's the appropriate place for orchestrating events.

Update/overwrite DNS record Google Cloud

Does anyone know what is a best practice to overwrite records under Google DNS Cloud, using API? https://cloud.google.com/dns/api/v1/changes/create does not help!
I could delete and create, but it is not nice ;) and could cause an outage.
Regards

The Cloud DNS API uses Changes objects to perform the update actions; you can create Changes but you don't ever delete them. In the Cloud DNS API, you never operate directly on the resource record sets. Instead, you create a Changes object with your desired additions and deletions and if that is created successfully, it applies those updates to the specified resource record sets in your managed DNS zone.
It's an unusual mental model, sort of like editing a file by specifying a diff to be applied, or appending to the commit history of a Git repository to change the contents of a file. Still, you can certainly achieve what you want to do using this API, and it is applied atomically at the authoritative servers (although the DNS system as a whole does not really do anything atomically, due to caching, so if you know you will be making changes, reduce your TTLs before you make the changes). The atomicity here is more about the updates themselves: if you have multiple applications making changes to your managed zones, and there are conflicts in changes to the particular record sets, the create operation will fail, and you will have retry the change with modified deletions (rather than having changes be silently overwritten).
Anyhow, what you want to do is to create a Changes object with deletions that specifies the current resource record set, and additions that specifies your desired replacement. This can be rather verbose, especially if you have a domain name with a lot of records of the same type. For example, if you have four A records for mydomain.example (1.1.1.1, 2.2.2.2, 3.3.3.3, and 4.4.4.4) and want to change the 3.3.3.3 address to 5.5.5.5, you need to list all four original A records in deletions and then the new four (1.1.1.1, 2.2.2.2, 4.4.4.4, and 5.5.5.5) in additions.
The Cloud DNS documentation provides example code boilerplate that you can adapt to do what you want: https://cloud.google.com/dns/api/v1/changes/create#examples, you just need to set the deletions and additions for the Changes object you are creating.

I have never used APIs for this purpose, but if you use command line i.e. gcloud to update DNS records, it binds the change in a single transaction and both tasks of deleting the record and adding the updated record are executed as a single transaction. Since transactions are atomic in nature, it shouldn't cause any outage.
Personally, I never witnessed any outage while using gcloud for updating DNS settings for my domain.

Can I use Time as globally unique event version?

I found time as the best value as event version.
I can merge perfectly independent events of different event sources on different servers whenever needed without being worry about read side event order synchronization. I know which event (from server 1) had happened before the other (from server 2) without the need for global sequential event id generator which makes all read sides to depend on it.
As long as the time is a globally ever sequential event version , different teams in companies can act as distributed event sources or event readers And everyone can always relay on the contract.
The world's simplest notification from a write side to subscribed read sides followed by a query pulling the recent changes from the underlying write side can simplify everything.
Are there any side effects I'm not aware of ?

Time is indeed increasing and you get a deterministic number, however event versioning is not only serves the purpose of preventing conflicts. We always say that when we commit a new event to the event store, we send the new event version there as well and it must match the expected version on the event store side, which must be the previous version plus exactly one. If there will be a thousand or three millions of ticks between two events - I do not really care, this does not give me the information I need. And if I have missed one event on the go is critical to know. So I would not use anything else than incremental counter, with events versioned per aggregate/stream.

Last Updated Date: Antipattern?

I keep seeing questions floating through that make reference to a column in a database table named something like DateLastUpdated. I don't get it.
The only companion field I've ever seen is LastUpdateUserId or such. There's never an indicator about why the update took place; or even what the update was.
On top of that, this field is sometimes written from within a trigger, where even less context is available.
It certainly doesn't even come close to being an audit trail; so that can't be the justification. And if there is and audit trail somewhere in a log or whatever, this field would be redundant.
What am I missing? Why is this pattern so popular?

Such a field can be used to detect whether there are conflicting edits made by different processes. When you retrieve a record from the database, you get the previous DateLastUpdated field. After making changes to other fields, you submit the record back to the database layer. The database layer checks that the DateLastUpdated you submit matches the one still in the database. If it matches, then the update is performed (and DateLastUpdated is updated to the current time). However, if it does not match, then some other process has changed the record in the meantime and the current update can be aborted.

It depends on the exact circumstance, but a timestamp like that can be very useful for autogenerated data - you can figure out if something needs to be recalculated if a depedency has changed later on (this is how build systems calculate which files need to be recompiled).
Also, many websites will have data marking "Last changed" on a page, particularly news sites that may edit content. The exact reason isn't necessary (and there likely exist backups in case an audit trail is really necessary), but this data needs to be visible to the end user.

These sorts of things are typically used for business applications where user action is required to initiate the update. Typically, there will be some kind of business app (eg a CRM desktop application) and for most updates there tends to be only one way of making the update.
If you're looking at address data, that was done through the "Maintain Address" screen, etc.
Such database auditing is there to augment business-level auditing, not to replace it. Call centres will sometimes (or always in the case of financial services providers in Australia, as one example) record phone calls. That's part of the audit trail too but doesn't tend to be part of the IT solution as far as the desktop application (and related infrastructure) goes, although that is by no means a hard and fast rule.
Call centre staff will also typically have some sort of "Notes" or "Log" functionality where they can type freeform text as to why the customer called and what action was taken so the next operator can pick up where they left off when the customer rings back.
Triggers will often be used to record exactly what was changed (eg writing the old record to an audit table). The purpose of all this is that with all the information (the notes, recorded call, database audit trail and logs) the previous state of the data can be reconstructed as can the resulting action. This may be to find/resolve bugs in the system or simply as a conflict resolution process with the customer.

It is certainly popular - rails for example has a shorthand for it, as well as a creation timestamp (:timestamps).
At the application level it's very useful, as the same pattern is very common in views - look at the questions here for example (answered 56 secs ago, etc).
It can also be used retrospectively in reporting to generate stats (e.g. what is the growth curve of the number of records in the DB).

there are a couple of scenarios
Let's say you have an address table for your customers
you have your CRM app, the customer calls that his address has changed a month ago, with the LastUpdate column you can see that this row for this customer hasn't been touched in 4 months
usually you use triggers to populate a history table so that you can see all the other history, if you see that the creationdate and updated date are the same there is no point hitting the history table since you won't find anything
you calculate indexes (stock market), you can easily see that it was recalculated just by looking at this column
there are 2 DB servers, by comparing the date column you can find out if all the changes have been replicated or not etc etc ect

This is also very useful if you have to send feeds out to clients that are delta feeds, that is only the records that have been changed or inserted since the data of the last feed are sent.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse