How to ensure that the variable was modified from only one side - hash

I've got an application with a lot of stateless microservices, which passes their variable context one to another. I've got a case when I'm starting few chains of services with the same context in parallel and then waiting for them to finish. Each service can modify its variable context, but after all of chains is finished I have to merge their variable contexts and ensure there is no conflicts.
It's illustrated in the examples below:
It's possible to solve this problem by storing the whole history of variable modifications, but it's a huge data overhead which I'd like to avoid.
Another solution I see is to find some hashing function, which lets to calculate the hash of modification history by the existing hash and new data, and also lets to check if one history data is prefix of another history data by knowing their hashes only. But I'm unable to find such a function.
I'm looking for any applicable algorithm with has as less data overhead as possible.

What you need are Version clocks, an old idea that can be used to merge paralel data modifications and to detect conflicts.
It's possible to solve this problem by storing the whole history of variable modifications, but it's a huge data overhead which I'd like to avoid.
With vector clocks you don't keep the entire history, but a counter for each variable and node (so each variable has a vector of counters).

Storing the whole history of variable modifications doesn't sound too awful, actually. For example, you can put modification information onto a queue, then, have a service that will process that queue by batch of elements at a time and put the result into one single place.
This is a common approach, for example, in situations when there is huge parallel workload and you can't synchronize access to only one place with a lock.
Later you can even scale out workers that process the queue.

Related

Event Sourcing - How to query inside a command?

We would like to be able to read state inside a command use case.
We could get the state from event store for the specific aggregate, but what about querying aggregates by field(not id) or performing more complicated queries, that are not fitted for the event store?
The approach we were thinking was to use our read model for those cases as well and not only for query use cases.
This might be inconsistent, so a solution could be to have the latest version of the aggregate stored in both write/read models, in order to be able to tell if the state is correct or stale.
Does this make sense and if yes, if we need to get state by Id should we use event store or the read model?
If you want the absolute latest state of an event-sourced aggregate, you're going to have to read the latest snapshot (assuming that you are snapshotting) and then replay events since that snapshot from the event store. You can be aggressive about snapshotting (conceivably even saving a snapshot after every command), but you're giving away some write performance to make the read faster.
Updating the read model directly is conceivably possible, though that level of coupling is something that should be considered very carefully. Note also that you will very likely need some sort of two-phase commit to ensure that the read model is only updated when the write model is updated and vice versa. I strongly suggest considering why you're using CQRS/ES in this project, because you are quite possibly undermining that reason by doing this sort of thing.
In general, if you need a query for processing a particular command, it's likely that query will generally be the same, i.e. you don't need free-form query support. In that case, you can often have a read model that's tuned for exactly that query and which only cares about events which could affect that query: often a fairly small subset of the events. The finer-grained the read model, the easier it is to keep in sync (if it ignores 99% of events, for instance, it can't really fall that far behind).
Needing to make complex queries as part of command processing could also be a sign that your aggregate boundaries aren't right and could do with a re-examination.
Does this make sense
Maybe. Let's start with
This might be inconsistent
Yup, they might be. So what?
We typically respond to a query by sending an unlocked copy of the answer. In other words, it's possible that the actual information in the write model will change after this response is dispatched but before the response arrives at its destination. The client will be looking at a copy of the answer taken from the past.
So we might reasonably ask how much better it is to get information no more than one minute old compared to information no more than five minutes old. If the difference in value is pennies, then you should probably deploy the five minute version. If the difference is millions of dollars, then you're in a good position to negotiate a real budget to solve the problem.
For processing a command in our own write model, that kind of inconsistency isn't usually acceptable or wise. But neither of the two common answers require keeping the read and write models synchronized. The most common answer is to just work with the write model alone. The less common answer is to grab a snapshot out of a cache, and then apply any additional events to it to bring it up to date. The latter approach is "just" a performance optimization (first rule: don't.)
The variation that trips everyone up is trying to process a command somewhere else, enforcing a consistency rule on our data here. Once again, you need a really clear picture of how valuable the consistency is to the business. If it's really important, that may be a signal that the information in question shouldn't be split into two different piles - you may be working with the wrong underlying data model.
Possibly useful references
Pat Helland Data on the Outside Versus Data on the Inside
Udi Dahan Race Conditions Don't Exist

What is the fastest way to persist large complex data objects in Powershell for a short term period?

Case in point - I have a build which invokes a lot of REST API calls and processes the results. I would like to split the monolithic step that does all that into 3 steps:
initial data acquisition - gets data from REST Api. Plain objects, no reference loops or duplicate references
data massaging - enriches the data from (1) with all kinds of useful information. May result in duplicate references (the same object is referenced from multiple places) or reference loops.
data processing
The catch is that there is a lot of data and converting it to json takes too much time to my taste. I did not check the Export-CliXml function, but I think it would be slow too.
If I wrote the code in C# I would use some kind of binary serialization, which should be sophisticated enough to handle reference loops and duplicate references.
Please, note that serialization would write to the build staging directory and would be deserialized almost immediately as soon as the next step runs.
I wonder what are my options in Powershell?
EDIT 1
I would like to clarify what do I mean by steps. This is a build running on a CI build server. Each step runs in a separate shell and is reported individually on the build page. There is no memory sharing between the steps. The only way to communicate between the steps is either through build variables or file system. Of course, using databases is also possible, but it is an overkill.
Build variables are set using certain API and are exposed to the subsequent steps as environment variables. As such they are quite limited in length.
So I am talking about communicating through the file system. I am sacrificing performance here for the sake of build granularity - instead of having one monolithic step I want to have 3 smaller steps. This way the build is more transparent and communicates clearly what it is doing. But I have to temporarily persist payloads between steps. If it is possible to minimize the overhead, then the benefits worth it. If the performance is going to degrade significantly, then I will keep the monolithic step.

google dataprep (clouddataprep by trifacta) tip: jobs will not be able to run if they are to large

During my cloud dataprep adventures I have come across yet another very annoying bug.
The problem occurs when creating complex flow structures which need to be connected through reference datasets. If a certain limit is crossed in performing a number of unions or a joins with these sets, dataflow is unable to start a job.
I have had a lot of contact with support and they are working on the issue:
"Our Systems Engineer Team was able to determine the root cause resulting into the failed job. They mentioned that the job is too large. That means that the recipe (combined from all datasets) is too big, and Dataflow rejects it. Our engineering team is still investigating approaches to address this.
A workaround is to split the job into two smaller jobs. The first run the flow for the data enrichment, and then use the output as input in the other flow. While it is not ideal, this would be a working solution for the time being."
I ran into the same problem and have a fairly educated guess as to the answer. Keep in mind that DataPrep simply takes all your GUI based inputs and translates it into Apache Beam code. When you pass in a reference data set, it probably writes some AB code that turns the reference data set into a side-input (https://beam.apache.org/documentation/programming-guide/). DataFlow will perform a Parellel Do (ParDo) function where it takes each element from a PCollection, stuffs it into a worker node, and then applies the side-input data for transformation.
So I am pretty sure if the reference sets get too big (which can happen with Joins), the underlying code will take an element from dataset A, pass it to a function with side-input B...but if side-input B is very big, it won't be able to fit into the worker memory. Take a look at the Stackdriver logs for your job to investigate if this is the case. If you see 'GC (Allocation Failure)' in your logs this is a sign of not enough memory.
You can try doing this: suppose you have two CSV files to read in and process, file A is 4 GB and file B is also 4 GB. If you kick off a job to perform some type of Join, it will very quickly outgrow the worker memory and puke. If you CAN, see if you can pre-process in a way where one of the files is in the MB range and just grow the other file.
If your data structures don't lend themselves to that option, you could do what the Sys Engs suggested, split one file up into many small chunks and then feed it to the recipe iteratively against the other larger file.
Another option to test is specifying the compute type for the workers. You can iteratively grow the compute type larger and larger to see if it finally pushes through.
The other option is to code it all up yourself in Apache Beam, test locally, then port to Google Cloud DataFlow.
Hopefully these guys fix the problem soon, they don't make it easy to ask them questions, that's for sure.

Session-only memtable updates / merges in RocksDB

RocksDB newbie here.
In runtime, I only use RocksDB to read the data. Sometimes, I need to merge session-specific records from other sources.
I don't want them to be merged into the main database.
I want them to exist only during the session lifetime for that specific session.
I can, of course, use a regular std::vector or something and merge the RocksDB and the other sources, but that would duplicate the data.
I see a bunch of concepts like memtable and merge, which sound like they might be used or exploited. For example, if I can tell memtable to never commit, and just abandon the changes, that should work. Is it doable?
The easiest way is probably to separate them into different column families, and just dropping the one you don't want to persist, when you shut down your application. If you need per-entry lifetime, you will probably have to consider something custom like a RAII-holder class, if you are using c++, which inserts on construction and deletes on destruction. I would still go with a separate column family to have the data cleanly separated, in case of a crash-failure.

scala/akka/stm design for large shared state?

I am new to Scala and Akka and am considering using it to solve a problem. Suppose I have a calculation engine (that searches for a solution). I'd like to parallelize that search both across cpus and across nodes by giving each cpu on each node its own engine instance.
The engine inputs consist of a small number of scalar inputs and a very large hash table. Each engine instance would use its scalar inputs to make some small local change to the hash table, calculate a goodness, then discard its changes (they do not need to be committed/seen by any other engine instance). The goodness value would be returned to some coordinator that would choose among the results.
I was reading some about the STM TransactionalMap as a vehicle for shared state. This seems ideal, but I don't really see any complete examples using it as shared state.
Questions:
Does the actor/stm model seem right for this problem?
Can you show a specific example of how to distribute the shared state? (is it Ref[TransactionalMap[,]] as a message?
Is there anything different about distributing the shared state within a node as opposed to across different nodes?
Inquiring Minds Want to Know,
Allan
In terms of handling shared memory it doesn't sound like STM would be the right fit here because you don't want the changes made in engine instances to commit to the shared copy of the hash table.
Instead, an immutable HashMap might be a better fit. The elements that do not change in the map can be shared by the engine instances with only the differences in each map taking additional memory space.
The actor model would fit very well what you want to do. Set up one actor for each engine instance you want and pass it a message with the scalar values and the hashmap. Then have it return the results to the coordinator.