what does the file Snapshot.scala in databricks? - scala

I am running some streaming query jobs on a databricks cluster, and when i look at the cluster/job logs, I see a lot of
first at Snapshot.scala:1
and
withNewExecutionId at TransactionalWriteEdge.scala:130
A quick search yielded this scala script https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/Snapshot.scala
Any one can explain what this do in laymans term?

Internally this class manages the replay of actions stored in checkpoint or delta file
Generally, this "snapshotting" relies on delta encoding and indirectly allows snaphot isolation as well.
Practically delta-encoding remembers every side-effectful operation like INSERT DELETE UPDATE that you did since the last checkpoint. In case of delta lake it would be SingleAction (source): AddFile (insert) RemoveFile (delete). Conceptually this approach is close to event-sourcing - without it you'd have to literally store/broadcast whole state (database or directory) on every update. It also employed by many classic ACID databases with replication.
Overall it gives you:
ability to continuously replicate file-system/directory/database state (see SnapshotManagement.update). Basically that's why you see a lot of first at Snapshot.scala:1 - it's called in order to catch up with the log every time you start transaction, see DeltaLog.startTransaction. I couldn't find TransactionalWriteEdge sources, but I guess it's called around the same time.
ability to restore state by replaying every action since the last snapshot.
ability to isolate (and store) transactions by keeping their snapshots apart until commit (every SingleAction has txn in order to isolate). Delta-lake uses optimistic locking for that: transaction commits will fail if their logs are not mergeable, while readers don't see uncommitted actions.
P.S. You can see that the log is accessed in line val deltaData = load(files) and actions are stacked on top of previousSnapshot (val checkpointData = previousSnapshot.getOrElse(emptyActions); val allActions = checkpointData.union(deltaData))

Related

Data syncing with pouchdb-based systems client-side: is there a workaround to the 'deleted' flag?

I'm planning on using rxdb + hasura/postgresql in the backend. I'm reading this rxdb page for example, which off the bat requires sync-able entities to have a deleted flag.
Q1 (main question)
Is there ANY point at which I can finally hard-delete these entities? What conditions would have to be met - eg could I simply use "older than X months" and then force my app to only ever displays data for less than X months?
Is such a hard-delete, if possible, best carried out directly in the central db, since it will be the source of truth? Would there be any repercussions client-side that I'm not foreseeing/understanding?
I foresee the number of deleted's growing rapidly in my app and i don't want to have to store all this extra data forever.
Q2 (bonus / just curious)
What is the (algorithmic) basis for needing a 'deleted' flag? Is it that it's just faster to check a flag rather than to check for the omission of an object from, say, a very large list. I apologize if it's kind of a stupid question :(
Ultimately it comes down to a decision that's informed by your particular business/product with regards to how long you want to keep deleted entities in your system. For some applications it's important to always keep a history of deleted things or even individual revisions to records stored as a kind of ledger or history. You'll have to make a judgement call as to how long you want to keep your deleted entities.
I'd recommend that you also add a deleted_at column if you haven't already and then you could easily leverage something like Hasura's new Scheduled Triggers functionality to run a recurring job that fully deletes records older than whatever your threshold is.
You could also leverage Hasura's permissions system to ensure that rows that have been deleted aren't returned to the client. There is documentation and examples for ways to work with soft deletes and Hasura
For your second question it is definitely much faster to check for the deleted flag on records than to have to try and diff the entire dataset looking for things that are now missing.

How doest Trello store generated actions from updates to other documents (boards, cards) in MongoDB without atomic transactions?

I'm developing a single page web app that will use a NoSQL Document Database (like MongoDB) and I want to generate events when I make a change to my entities.
Since most of these databases support transactions only on a document level (MongoDB just added ASIC support) there is no good way to store changes in one document and then store events from those changes to other documents.
Let's say for example that I have a collection 'Events' and a collection 'Cards' like Trello does. When I make a change to the description of a card from the 'Cards' collection, an event 'CardDescriptionChanged' should be generated.
The problem is that if there is a crash or some error between saving the changes to the 'Cards' collection and adding the event in the 'Events' collection this event will not be persisted and I don't want that.
I've done some research on this issue and most people would suggest that one of several approaches can be used:
Do not use MongoDB, use SQL database instead (I don't want that)
Use Event Sourcing. (This introduces complexity and I want to clear older events at some point, so I don't want to keep all events stored. I now that I can use snapshots and delete older events from the snapshot point, but there is a complexity in this solution)
Since errors of this nature probably won't happen too often, just ignore them and risk having events that won't be saved (I don't want that too)
Use an event/command/action processor. Store commands/action like 'ChangeCardDescription' and use a Processor that will process them and update the entities.
I have considered option 4, but a couple of question occurs:
How do I manage concurrency?
I can queue all commands for the same entity (like a card or a board) and make sure that they are processed sequentially, while events for different entities (different cards) can be processed in parallel. Then I can use processed commands as events. One problem here is that changes to an entity may generate several events that may not correspond to a single command. I will have to break down to very fine-grained commands all user actions so I can then translate them to events.
Error reporting and error handling.
If this process is asynchronous, I have to manage error reporting to the client. And also I have to remove or mark commands that failed.
I still have the problem with marking the commands as processed, as there are no transactions. I know I have to make processing of commands idempotent to resolve this problem.
Since Trello used MongoDB and generates actions ('DeleteCardAction', 'CreateCardAction') with changes to entities (Cards, Boards..) I was wondering how do they solve this problem?
Create a new collection called FutureUpdates. Write planned updates to the FutureUpdates collection with a single document defining the changes you plan to make to cards and the events you plan to generate. This insert will be atomic.
Now take a [ChangeStream][1] of the FutureUpdates collection this will give you the stream of updates you need to make. Take each doc from the change stream and apply the updates. Finally, update the doc in FutureUpdates to mark it as complete. Again this update will be atomic.
When you apply the updates to Events and Cards make sure to include the objectID of the doc used to create the update in FutureUpdates.
Now if the program crashes after inserting the update in FutureUpdates you can check the Events and Cards collections for the existence of records containing the objectID of the update. If they are not present then you can reapply the missing updates.
If the updates have been applied but the FutureUpdate doc is not marked as complete we can update that during recovery to complete the process.
Effectively you are continuously atomically updating a doc for each change in FutureUpdates to track progress. Once an update is complete you can archive the old docs or just delete them.

Could I save Postgres transaction and continue work with db within it later

I know about prepared transaction in Postgres, but seems you can just commit or rollback it later. You cannot even view the transaction's db state before you've committed it. Is any way to save transaction for later use?
What I want to achieve actually is a preview (and correcting) of some changes in db (changes are imports from csv file, so user need to see preview before apply it). I want to make changes, add some changes later, see full state of db and apply it (certainly, commit transaction)
I cannot find a very good reference in docs, but I have a very strong feeling that the answer is: No, you cannot do that.
It would mean that when you "save" the transaction, the database would basically have to maintain all of its locks in place for an indefinite amount of time. Even if it was possible, it would mean horrible failure modes and trouble on all fronts.
For the pattern that you are describing, I would use two separate transactions. Import to a staging table and show that to user (or import to the main table but mark rows as "unapproved"). If user approves, in another transactions move or update these rows.
You can always end up in a situation where user can simply leave or crash without clicking "OK" or "Cancel". If what you're describing was possible, you would end up with a hung transaction holding all these resources. In my proposed solution you end up with wasteful rows in "staging" table that you may still show to user later or remove.
You may want to read up on persistence saga. This is actually a very simple example of a well known and researched problem.
To make the long story short, this pattern breaks down a long-running process like yours into smaller operations that are applied and persisted in some way in separate transactions. If any of them happens to fail (or does not occur as expected), you have compensating actions that usually undo what the steps executed so far have done (e.g. by throwing away stale/irrelevant data).
Here's a decent introduction:
https://blog.couchbase.com/saga-pattern-implement-business-transactions-using-microservices-part/#:~:text=The%20SAGA%20Pattern,completion%20of%20the%20previous%20one.
http://vasters.com/clemensv/2012/09/01/Sagas.aspx
This concept was formally introduced in the 80s, but is well alive and relevant today.

How triggers works internally in SQL Server

Please correct me if I am wrong.
What I know about triggers is that they are triggered by events (Insert, Update, Delete). So we can run a stored procedure etc.. in the trigger.
This will give the application a good responsiveness because the query that the user interacts with, is quite small and this "other" longer time taking stuffs are taken care by the server internally as a separate task.
But I do not know about how the the triggers are handled inside the server. What I exactly want to know is what would happen in scenarios as given below.
Take Insert after trigger. And take trigger is executing a longer stored procedure. Then in the middle of the trigger there can be another insert. What I want to know is what will happen to that second trigger. If possible can I make that second trigger ignore itself.
marc_s has given the correct answer. I will copy it for the sake of completeness.
TRIGGERS ARE SYNCHRONOUS
If you want to have a asynchronous functionality go for a SQL broker implimenation.
Triggers are triggered by events - and then they are executed - right now. Since you cannot control when and how often they are triggered, you should keep the processing in those triggers to an absolute minimum - I always try to make - at most - an entry into another table (an "Audit" table) or possibly put a "marker" row into a "command" table. But the actual processing of that info - running stored procedures etc. - should be left to an outside job - don't do extensive processing in a trigger! This will reliably KILL all your performance\responsiveness.

Preventing update loops for multiple databases using CDC

We have a number of legacy systems that we're unable to make changes to - however, we want to start taking data changes from these systems and applying them automatically to other systems.
We're thinking of some form of service bus (no specific tech picked yet) sitting in the middle, and a set of bus adapters (one per legacy application) to translate between database specific concepts and general update messages.
One area I've been looking at is using Change Data Capture (CDC) to monitor update activity in the legacy databases, and use that information to construct appropriate messages. However, I have a concern - how best could I, as a consumer of CDC information, distinguish changes applied by the application vs changes applied by the bus adapter on receipt of messages - because otherwise, the first update that gets distributed by the bus will get re-distributed by every receiver when they apply that change to their own system.
If I was implementing "poor mans" CDC - i.e. triggers, then those triggers execute within the context/transaction/connection of the original DML statements - so I could either design them to ignore one particular user (the user applying incoming updates from the bus), or set and detect a session property to similar ignore certain updates.
Any ideas?
If I understand your question correctly, you're trying to define a message routing structure that works with a design you've already selected (using an enterprise service bus) and a message implementation that you can use to flow data off your legacy systems that only forward-ports changes to your newer systems.
The difficulty is you're trying to apply changes in such a way that they don't themselves generate a CDC message from the clients receiving the data bundle from your legacy systems. In fact, all you're concerned about is having your newer systems consume the data and not propagate messages back to your bus, creating unnecessary crosstalk that might exponentiate, overloading your infrastructure.
The secret is how MSSQL's CDC features reconcile changes as they propagate through the network. Specifically, note this caveat:
All the changes are logged in terms of LSN or Log Sequence Number. SQL
distinctly identifies each operation of DML via a Log Sequence Number.
Any committed modifications on any tables are recorded in the
transaction log of the database with a specific LSN provided by SQL
Server. The __$operationcolumn values are: 1 = delete, 2 = insert, 3 =
update (values before update), 4 = update (values after update).
cdc.fn_cdc_get_net_changes_dbo_Employee gives us all the records net
changed falling between the LSN we provide in the function. We have
three records returned by the net_change function; there was a delete,
an insert, and two updates, but on the same record. In case of the
updated record, it simply shows the net changed value after both the
updates are complete.
For getting all the changes, execute
cdc.fn_cdc_get_all_changes_dbo_Employee; there are options either to
pass 'ALL' or 'ALL UPDATE OLD'. The 'ALL' option provides all the
changes, but for updates, it provides the after updated values. Hence
we find two records for updates. We have one record showing the first
update when Jason was updated to Nichole, and one record when Nichole
was updated to EMMA.
While this documentation is somewhat terse and difficult to understand, it appears that changes are logged and reconciled in LSN order. Competing changes should be discarded by this system, allowing your consistency model to work effectively.
Note also:
CDC is by default disabled and must be enabled at the database level
followed by enabling on the table.
Option B then becomes obvious: institute CDC on your legacy systems, then use your service bus to translate these changes into updates that aren't bound to CDC (using, for example, raw transactional update statements). This should allow for the one-way flow of data that you seek from the design of your system.
For additional methods of reconciling changes, consider the concepts raised by this Wikipedia article on "eventual consistency". Best of luck with your internal database messaging system.