When looking to JOlivers "EventStore", I see that StreamRevision and CommitSequence are the same if you only commit 1 event. And it is the StreamRevision that is used to select events with.
Suppose I first created an aggregate which comitted 1 event. And after that comitted 10 events which would make my SQL database table look like this (simplified):
Revision Items Sequence
1 1 1
11 10 2
I have 2 question that derive from this:
Is this the difference between StreamRevision and CommitSequence?
The store exposes a "GetFrom" method that takes a "minRevision" and a "maxRevision". With the data from above, how does this work if I request minRevision=4 and maxRevision=8 ? Shouldn't it have been "minSequence" and "maxSequence" instead?
Thanks.
Werner
Commits are a storage concept to prevent duplicates and facilitate optimistic concurrency by storage engines that don't have transactional support such as CouchDB and MongoDB. StreamRevision, on the other hand, represents the number of events committed to the stream.
When you're working with stream and you call GetFrom() with min/max revision of 4-8, that means that you want (according to your example) all events starting at v4 through v8 which is encapsulated by commit #2.
Related
I am running some streaming query jobs on a databricks cluster, and when i look at the cluster/job logs, I see a lot of
first at Snapshot.scala:1
and
withNewExecutionId at TransactionalWriteEdge.scala:130
A quick search yielded this scala script https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/Snapshot.scala
Any one can explain what this do in laymans term?
Internally this class manages the replay of actions stored in checkpoint or delta file
Generally, this "snapshotting" relies on delta encoding and indirectly allows snaphot isolation as well.
Practically delta-encoding remembers every side-effectful operation like INSERT DELETE UPDATE that you did since the last checkpoint. In case of delta lake it would be SingleAction (source): AddFile (insert) RemoveFile (delete). Conceptually this approach is close to event-sourcing - without it you'd have to literally store/broadcast whole state (database or directory) on every update. It also employed by many classic ACID databases with replication.
Overall it gives you:
ability to continuously replicate file-system/directory/database state (see SnapshotManagement.update). Basically that's why you see a lot of first at Snapshot.scala:1 - it's called in order to catch up with the log every time you start transaction, see DeltaLog.startTransaction. I couldn't find TransactionalWriteEdge sources, but I guess it's called around the same time.
ability to restore state by replaying every action since the last snapshot.
ability to isolate (and store) transactions by keeping their snapshots apart until commit (every SingleAction has txn in order to isolate). Delta-lake uses optimistic locking for that: transaction commits will fail if their logs are not mergeable, while readers don't see uncommitted actions.
P.S. You can see that the log is accessed in line val deltaData = load(files) and actions are stacked on top of previousSnapshot (val checkpointData = previousSnapshot.getOrElse(emptyActions); val allActions = checkpointData.union(deltaData))
my challenge:
we receive files every day with about 200.000 records. We keep the files for approx 1 year, to support re-processing, etc..
For the sake of the discussion assume it is some sort of long lasting fulfilment process, with a provisioning-ID that correlates records.
we need to identify flexible patterns in these files, and trigger events
typical questions are:
if record A is followed by record B which is followed by record C, and all records occured within 60 days, then trigger an event
if record D or record E was found, but record F did NOT follow within 30 days, then trigger an event
if both records D and record E were found (irrespective of the order), followed by ... within 24 hours, then trigger an event
some pattern require lookups in a DB/NoSql or joins for additional information either to select the record, or to put into the event.
"Selecting a record" can be simple "field-A equals", but can also be "field-A in []" or "filed-A match " or "func identify(field-A, field-B)"
"days" might also be "hours" or "in previous month". Hence more flexible then just "days". Usually we have some date/timestamp in the record. The maximum is currently "within 6 months" (cancel within setup phase)
The created events (preferably JSON) needs to contain data from all records which were part of the selection process.
We need an approach that allows to flexibly change (add, modify, delete) the pattern, optionally re-processing the input files.
Any thoughts on how to tackle the problem elegantly? May be some python or java framework, or does any of the public cloud solutions (AWS, GCP, Azure) address the problem space especially well?
thanks a lot for your help
After some discussions and readings, we'll try first Apache Flink with the FlinkCEP library. From the docs and blog entries it seems to be able to do the job. It also seems AWS's choice, running on their EMR cluster. We didn't find any managed service on GCP nor Azure providing the functionalities. Of course we can always deploy and manage it ourselves. Unfortunately we didn't find a Python framework
I'm developing a single page web app that will use a NoSQL Document Database (like MongoDB) and I want to generate events when I make a change to my entities.
Since most of these databases support transactions only on a document level (MongoDB just added ASIC support) there is no good way to store changes in one document and then store events from those changes to other documents.
Let's say for example that I have a collection 'Events' and a collection 'Cards' like Trello does. When I make a change to the description of a card from the 'Cards' collection, an event 'CardDescriptionChanged' should be generated.
The problem is that if there is a crash or some error between saving the changes to the 'Cards' collection and adding the event in the 'Events' collection this event will not be persisted and I don't want that.
I've done some research on this issue and most people would suggest that one of several approaches can be used:
Do not use MongoDB, use SQL database instead (I don't want that)
Use Event Sourcing. (This introduces complexity and I want to clear older events at some point, so I don't want to keep all events stored. I now that I can use snapshots and delete older events from the snapshot point, but there is a complexity in this solution)
Since errors of this nature probably won't happen too often, just ignore them and risk having events that won't be saved (I don't want that too)
Use an event/command/action processor. Store commands/action like 'ChangeCardDescription' and use a Processor that will process them and update the entities.
I have considered option 4, but a couple of question occurs:
How do I manage concurrency?
I can queue all commands for the same entity (like a card or a board) and make sure that they are processed sequentially, while events for different entities (different cards) can be processed in parallel. Then I can use processed commands as events. One problem here is that changes to an entity may generate several events that may not correspond to a single command. I will have to break down to very fine-grained commands all user actions so I can then translate them to events.
Error reporting and error handling.
If this process is asynchronous, I have to manage error reporting to the client. And also I have to remove or mark commands that failed.
I still have the problem with marking the commands as processed, as there are no transactions. I know I have to make processing of commands idempotent to resolve this problem.
Since Trello used MongoDB and generates actions ('DeleteCardAction', 'CreateCardAction') with changes to entities (Cards, Boards..) I was wondering how do they solve this problem?
Create a new collection called FutureUpdates. Write planned updates to the FutureUpdates collection with a single document defining the changes you plan to make to cards and the events you plan to generate. This insert will be atomic.
Now take a [ChangeStream][1] of the FutureUpdates collection this will give you the stream of updates you need to make. Take each doc from the change stream and apply the updates. Finally, update the doc in FutureUpdates to mark it as complete. Again this update will be atomic.
When you apply the updates to Events and Cards make sure to include the objectID of the doc used to create the update in FutureUpdates.
Now if the program crashes after inserting the update in FutureUpdates you can check the Events and Cards collections for the existence of records containing the objectID of the update. If they are not present then you can reapply the missing updates.
If the updates have been applied but the FutureUpdate doc is not marked as complete we can update that during recovery to complete the process.
Effectively you are continuously atomically updating a doc for each change in FutureUpdates to track progress. Once an update is complete you can archive the old docs or just delete them.
Consider raw events (alpha set in Druid parlance) of the form timestamp | compoundId | dimension 1 | dimension 2 | metric 1 | metric 2
Normally in Druid data can be loaded in Realtime nodes and historic nodes based on some rules. These rules seem to be related to time-ranges. E.g.:
load the last day of data on boxes A
load the last week (except last day) on boxes B
keep the rest in deep storage but don't load segments.
In contrast I want to support the use-case of:
load the last event for each given compoundId on boxes A. Regardless if that last event happened to be loaded today or yesterday.
Is this possible?
Alternatively, if the above is not possible, I figured it would perhaps be possible as a workaround to create a betaset (finest granulation level as follows):
Given an alphaset with schema as defined above, create a betaset so that:
all events for a given compoundId are rolled-up.
metric1 and metric2 are set to the metrics from the last occurring (largest timestamp) event.
Any advice much appreciated.
I believe the first and last aggregators is what you are looking for.
We have a number of legacy systems that we're unable to make changes to - however, we want to start taking data changes from these systems and applying them automatically to other systems.
We're thinking of some form of service bus (no specific tech picked yet) sitting in the middle, and a set of bus adapters (one per legacy application) to translate between database specific concepts and general update messages.
One area I've been looking at is using Change Data Capture (CDC) to monitor update activity in the legacy databases, and use that information to construct appropriate messages. However, I have a concern - how best could I, as a consumer of CDC information, distinguish changes applied by the application vs changes applied by the bus adapter on receipt of messages - because otherwise, the first update that gets distributed by the bus will get re-distributed by every receiver when they apply that change to their own system.
If I was implementing "poor mans" CDC - i.e. triggers, then those triggers execute within the context/transaction/connection of the original DML statements - so I could either design them to ignore one particular user (the user applying incoming updates from the bus), or set and detect a session property to similar ignore certain updates.
Any ideas?
If I understand your question correctly, you're trying to define a message routing structure that works with a design you've already selected (using an enterprise service bus) and a message implementation that you can use to flow data off your legacy systems that only forward-ports changes to your newer systems.
The difficulty is you're trying to apply changes in such a way that they don't themselves generate a CDC message from the clients receiving the data bundle from your legacy systems. In fact, all you're concerned about is having your newer systems consume the data and not propagate messages back to your bus, creating unnecessary crosstalk that might exponentiate, overloading your infrastructure.
The secret is how MSSQL's CDC features reconcile changes as they propagate through the network. Specifically, note this caveat:
All the changes are logged in terms of LSN or Log Sequence Number. SQL
distinctly identifies each operation of DML via a Log Sequence Number.
Any committed modifications on any tables are recorded in the
transaction log of the database with a specific LSN provided by SQL
Server. The __$operationcolumn values are: 1 = delete, 2 = insert, 3 =
update (values before update), 4 = update (values after update).
cdc.fn_cdc_get_net_changes_dbo_Employee gives us all the records net
changed falling between the LSN we provide in the function. We have
three records returned by the net_change function; there was a delete,
an insert, and two updates, but on the same record. In case of the
updated record, it simply shows the net changed value after both the
updates are complete.
For getting all the changes, execute
cdc.fn_cdc_get_all_changes_dbo_Employee; there are options either to
pass 'ALL' or 'ALL UPDATE OLD'. The 'ALL' option provides all the
changes, but for updates, it provides the after updated values. Hence
we find two records for updates. We have one record showing the first
update when Jason was updated to Nichole, and one record when Nichole
was updated to EMMA.
While this documentation is somewhat terse and difficult to understand, it appears that changes are logged and reconciled in LSN order. Competing changes should be discarded by this system, allowing your consistency model to work effectively.
Note also:
CDC is by default disabled and must be enabled at the database level
followed by enabling on the table.
Option B then becomes obvious: institute CDC on your legacy systems, then use your service bus to translate these changes into updates that aren't bound to CDC (using, for example, raw transactional update statements). This should allow for the one-way flow of data that you seek from the design of your system.
For additional methods of reconciling changes, consider the concepts raised by this Wikipedia article on "eventual consistency". Best of luck with your internal database messaging system.