Taking a snapshot of a sequence/pattern query in Siddhi - complex-event-processing

I am currently working on a mechanism to extract the intermediate states of CEP queries, so that I can use this information to restore the query state at a later point in time (for example, after node failure).
Extracing intermediate state works fine for window-based (e.g., timeBatch query) or aggregate-based queries (e.g., sum), given that I can create a custom window/aggregate function and extract the intermediate state (for example, the current events in the window or the current sum).
Even for a sequence/pattern query, I can see how extracting the intermediate state can be useful for restoration purposes. For example, if I have a pattern query where I look for A -> B -> C, and so far, A -> B has already been recognised, I can try extracting this sub-pattern as part of a query snapshot. Later, upon restoration, I only need to search for the events of type C to satisfy the pattern.
So far, I have not come across a viable approach of extracting this information. Does anyone have any ideas on how I can go about this issue? Thanks!

You can start by referring to SnapshotService, which can be used to get the current serialized states of queries.

Related

DDD, Event Sourcing, and the shape of the Aggregate state

I'm having a hard time understanding the shape of the state that's derived applying that entity's events vs a projection of that entity's data.
Is an Aggregate's state ONLY used for determining whether or not a command can successfully be applied? Or should that state be usable in other ways?
An example - I have a Post entity for a standard blog post. I might have events like postCreated, postPublished, postUnpublished, etc. For my projections that I'll be persisting in my read tables, I need a projection for the base posts (which will include all posts, regardless of status, with lots of detail) as well as published_posts projection (which will only represent posts that are currently published with only the information necessary for rendering.
In the situation above, is my aggregate state ONLY supposed to be used to determine, for example, if a post can be published or unpublished, etc? If this is the case, is the shape of my state within the aggregate purely defined by what's required for these validations? For example, in my base post projection, I want to have a list of all users that have made a change to the post. In terms of validation for the aggregate/commands, I couldn't care less about the list of users that have made changes. Does that mean that this list should not be a part of my state within my aggregate?
TL;DR: yes - limit the "state" in the aggregate to that data that you choose to cache in support of data change.
In my aggregates, I distinguish two different ideas:
the history , aka the sequence of events that describes the changes in the lifetime of the aggregate
the cache, aka the data values we tuck away because querying the event history every time kind of sucks.
There's not a lot of value in caching results that we are never going to use.
One of the underlying lessons of CQRS is that we don't need aggregates everywhere
An AGGREGATE is a cluster of associated objects that we treat as a unit for the purpose of data changes. -- Evans, 2003
If we aren't changing the data, then we can safely work directly with immutable copies of the data.
The only essential purpose of the aggregate is to determine what events, if any, need to be applied to bring the aggregate's state in line with a command (if the aggregate can be brought so in line). All state that's not needed for that purpose can be offloaded to a read-side, which can be thought of as a remix of the event stream (with each read-side only maintaining the state it needs).
That said, there are in practice, reasons to use the aggregate state directly, with the primary one being a desire for a stronger consistency for the aggregate: CQRS is inherently eventually consistent. As with all questions of consistent updates, it's important to recognize that consistency isn't free and very often isn't even cheap; I tend to think of a project as having a consistency budget and I'm pretty miserly about spending it.
In your case, there's probably no reason to include the list of users changing a post in the aggregate state, unless e.g. there's something like "no single user can modify a given post more than n times".

Materialize partial set of results with EF Core 2.1

Let's say I have a large collection of tasks stored in DB and I want to retrieve the latest one according to requesting user's permissions. The permissions checking logic is complex and not related to the persistence layer, hence I can't put it in an SQL query. What I'm doing today is retrieving ALL tasks from DB ordered by descending date, then filter them by permissions set and taking first one. Not a perfect solution: I retrieve thousands of objects when I need only one.
My question is: how can I materialize objects coming from DB until I find one that matches my criteria and discards rest of results?
I thought about one solution, but couldn't find information regarding EF Core behavior in this case and don't know how to check it myself:
Build the IQueryable, cast to IEnumerable, then iterate over it and take the first good task. I know that IQueryable part will be executed on Server and IEnumerable on the client, but I don't know if all task will be materialized before applying FilterByPermissions or it will be performed by demand? And I also don't like the synchronous nature of this solution.
IQueryable<MyTask> allTasksQuery = ...;
IEnumerable<MyTask> allTasksEnumerable = allTasksQuery.AsEnumerable();
IEnumerable<MyTask> filteredTasks = FilterByPermissions(allTasksEnumerable);
MyTask latestTask = filteredTasks.FirstOrDefault();
The workaround could be retrieving small sets of data (pages of 50 for example) until one good task is found but I don't like it.

Detecting concurrent data modification of document between read and write

I'm interested in a scenario where a document is fetched from the database, some computations are run based on some external conditions, one of the fields of the document gets updated and then the document gets saved, all in a system that might have concurrent threads accessing the DB.
To make it easier to understand, here's a very simplistic example. Suppose I have the following document:
{
...
items_average: 1234,
last_10_items: [10,2187,2133, ...]
...
}
Suppose a new item (X) comes in, five things will need to be done:
read the document from the DB
remove the first (oldest) item in the last_10_items
add X to the end of the array
re-compute the average* and save it in items_average.
write the document to the DB
* NOTE: the average computation was chosen as a very simple example, but the question should take into account more complex operations based on data existing in the document and on new data (i.e. not something solvable with the $inc operator)
This certainly is something easy to implement in a single-threaded system, but in a concurrent system, if 2 threads would like to follow the above steps, inconsistencies might occur since both will update the last_10_items and items_average values without considering and/or overwriting the concurrent changes.
So, my question is how can such a scenario be handled? Is there a way to check or react-upon the fact that the underlying document was changed between steps 1 and 5? Is there such a thing as WATCH from redis or 'Concurrent Modification Error' from relational DBs?
Thanks
In database system,it uses a memory inspection and roll back scheme which is similar to transactional memory.
Briefly speaking, it simply monitors the share memory parts you specified and do something like compare and swap or load and link or test and set.
Therefore,if any memory content is changed during transaction,it will abort and try again until there is no conflict operation for that shared memory.
For example,GCC implements the following:
https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
type __sync_lock_test_and_set (type *ptr, type value, ...)
type __sync_val_compare_and_swap (type *ptr, type oldval type newval, ...)
For more info about transactional memory,
http://en.wikipedia.org/wiki/Software_transactional_memory

How to handle application death and other mid-operation faults with Mongo DB

Since Mongo doesn't have transactions that can be used to ensure that nothing is committed to the database unless its consistent (non corrupt) data, if my application dies between making a write to one document, and making a related write to another document, what techniques can I use to remove the corrupt data and/or recover in some way?
The greater idea behind NoSQL was to use a carefully modeled data structure for a specific problem, instead of hitting every problem with a hammer. That is also true for transactions, which should be referred to as 'short-lived transactions', because the typical RDBMS transaction hardly helps with 'real', long-lived transactions.
The kind of transaction supported by RDBMSs is often required only because the limited data model forces you to store the data across several tables, instead of using embedded arrays (think of the typical invoice / invoice items examples).
In MongoDB, try to use write-heavy, de-normalized data structures and keep data in a single document which improves read speed, data locality and ensures consistency. Such a data model is also easier to scale, because a single read only hits a single server, instead of having to collect data from multiple sources.
However, there are cases where the data must be read in a variety of contexts and de-normalization becomes unfeasible. In that case, you might want to take a look at Two-Phase Commits or choose a completely different concurrency approach, such as MVCC (in a sentence, that's what the likes of svn, git, etc. do). The latter, however, is hardly a drop-in replacement for RDBMs, but exposes a completely different kind of concurrency to a higher level of the application, if not the user.
Thinking about this myself, I want to identify some categories of affects:
Your operation has only one database save (saving data into one document)
Your operation has two database saves (updates, inserts, or deletions), A and B
They are independent
B is required for A to be valid
They are interdependent (A is required for B to be valid, and B is required for A to be valid)
Your operation has more than two database saves
I think this is a full list of the general possibilities. In case 1, you have no problem - one database save is atomic. In case 2.1, same thing, if they're independent, they might as well be two separate operations.
For case 2.2, if you do A first then B, at worst you will have some extra data (B data) that will take up space in your system, but otherwise be harmless. In case 2.3, you'll likely have some corrupt data in the event of a catastrophic failure. And case 3 is just a composition of case 2s.
Some examples for the different cases:
1.0. You change a car document's color to 'blue'
2.1. You change the car document's color to 'red' and the driver's hair color to 'red'
2.2. You create a new engine document and add its ID to the car document
2.3.a. You change your car's 'gasType' to 'diesel', which requires changing your engine to a 'diesel' type engine.
2.3.b. Another example of 2.3: You hitch car document A to another car document B, A getting the "towedBy" property set to B's ID, and B getting the "towing" property set to A's ID
3.0. I'll leave examples of this to your imagination
In many cases, its possible to turn a 2.3 scenario into a 2.2 scenario. In the 2.3.a example, the car document and engine are separate documents. Lets ignore the possibility of putting the engine inside the car document for this example. Its both invalid to have a diesel engine and non-diesel gas and to have a non-diesel engine and diesel gas. So they both have to change. But it may be valid to have no engine at all and have diesel gas. So you could add a step that makes the whole thing valid at all points. First, remove the engine, then replace the gas, then change the type of the engine, and lastly add the engine back onto the car.
If you will get corrupt data from a 2.3 scenario, you'll want a way to detect the corruption. In example 2.3.b, things might break if one document has the "towing" property, but the other document doesn't have a corresponding "towedBy" property. So this might be something to check after a catastrophic failure. Find all documents that have "towing" but the document with the id in that property doesn't have its "towedBy" set to the right ID. The choices there would be to delete the "towing" property or set the appropriate "towedBy" property. They both seem equally valid, but it might depend on your application.
In some situations, you might be able to find corrupt data like this, but you won't know what the data was before those things were set. In those cases, setting a default is probably better than nothing. Some types of corruption are better than others (particularly the kind that will cause errors in your application rather than simply incorrect display data).
If the above kind of code analysis or corruption repair becomes unfeasible, or if you want to avoid any data corruption at all, your last resort would be to take mnemosyn's suggestion and implement Two-Phase Commits, MVCC, or something similar that allows you to identify and roll back changes in an indeterminate state.

Couchbase: Synchronizing views with a bucket

I've got a question for you couchbase pros: Is it possible to synchronize a subset of documents (eg. the documents within a view) with an other bucket?
So that the other bucket documents are always a direct subset of the "master" bucket?
if so, isn't that to much expensive in terms of perfomance? or does couchbase have any functionality to only create deeplinks to the documents instead of copying it?
Alternatively: is it possible to write views on views?
Thank you in advance!
--- EDIT ----
Let's say I want to have two sets (buckets) of documents S1 and S2. S2 is a subset of S1. Each set contains the same views V1, V2 and V3 since I want to be able to query any of them with the same logic/interface. In my case set S2 is build per user/company/store/whatever, in production there should be like 1000ish subsets S2 - to stay abstract let's call them S2a S2b and S2c.
The selection of documents which to be contained in any subset is done by a filtering instance (for example a view). Let's call these filtering instances F1 for filtering S1 to S2 hence F1a, F1b and F1c.
So with my actual knowledge of couchbase this results in the following design/view architecture: I've got the three "base" views to display V1,V2 and V3, and to realize S2a, S2b and S2c I must create the design views S2aV1, S2aV2, S2aV3, S2bV1, S2bV2, etc. (9 Views).
One could say "Well choose your keys wisely and you can avoid the sub views" but in my opinion this isn't that easy because of the following circumstances: In worst case the filter parameters change every minute and contain many WHERE IN constraints which could (at my actual point of view) not be handled efficiently querying k/v lists.
This leads to the following thoughts and the question I initially asked. If I use the same views in any subset (defined by a filter) shouldn't it be possible to build up an entity which helps me handling complex filtering? For example a function which is called during runtime while generating the view output? This could look like /design/view?filter=F1 or something like that.
Or do you have any other ideas to solve this problem? Or should I use SQL since it's more capable of handling frequently changing filters?
Generally speaking for most models you don't really need to have bucket "subsets", is there a particular reason you are trying to do this and why you would want that data broken out? You can also query your views, or instead of a view on a view, you can just make a separate view that maps/filters further based on your needs (i.e. does the same job as a view on a view).
We are working on Elastic Search integration. Maybe better for your use case
I think what you want to do is write a view on your original bucket, and then copy the key/values from that view, to be documents in a new bucket.
It shouldn't be hard to write an automated framework for managing this so that you can keep the derived data up to date in near real time.