Passing and Handling Large Objects in Workflows - workflow

I am having to pass a fairly large object/file to a workflow when it starts (in the order of hundreds of MBs). I am using secondary storage to dump the object and have as little of it as possible in the RAM at one time on Workflow side. Is there another way to pass and handle the object which is more efficient. Does WF provide any built in function to handle such situations?

what about passing the URI to that object instead ?

Related

How to store sagas’ data?

From what I read aggregates must only contain properties which are used to protect their invariants.
I also read sagas can be aggregates which makes sense to me.
Now I modeled a registration process using a saga: on RegistrationStarted event it sends a ReserveEmail command which will trigger an EmailReserved or EmailReservationFailed given if the email is free or not. A listener will then either send a validation link or a message telling an account already exists.
I would like to use data from the RegistrationStarted event in this listener (say the IP and user-agent). How should I do it?
Storing these data in the saga? But they’re not used to protect invariants.
Pushing them through ReserveEmail command and the resulting event? Sounds tedious.
Project the saga to the read model? What about eventual consistency?
Another way?
Rinat Abdullin wrote a good overview of sagas / process managers.
The usual answer is that the saga has copies of the events that it cares about, and uses the information in those events to compute the command messages to send.
List[Command] processManager(List[Event] events)
Pushing them through ReserveEmail command and the resulting event?
Yes, that's the usual approach; we get a list [RegistrationStarted], and we use that to calculate the result [ReserveEmail]. Later on, we'll get [RegistrationStarted, EmailReserved], and we can use that to compute the next set of commands (if any).
Sounds tedious.
The data has to travel between the two capabilities somehow. So you are either copying the data from one message to another, or you are copying a correlation identifier from one message to another and then allowing the consumer to decide how to use the correlation identifier to fetch a copy of the data.
Storing these data in the saga? But they’re not used to protect invariants.
You are typically going to be storing events in the sagas (to keep track of what has happened). That gives you a copy of the data provided in the event. You don't have an invariant to protect because you are just caching a copy of a decision made somewhere else. You won't usually have the process manager running queries to collect additional data.
What about eventual consistency?
By their nature, sagas are always going to be "eventually consistent"; the "state" of an instance of a saga is just cached copies of data controlled elsewhere. The data is probably nanoseconds old by the time the saga sees it, there's no point in pretending that the data is "now".
If I understand correctly I could model my saga as a Registration aggregate storing all the events whose correlation identifier is its own identifier?
Udi Dahan, writing about CQRS:
Here’s the strongest indication I can give you to know that you’re doing CQRS correctly: Your aggregate roots are sagas.

Drools 6 Fusion Notification

We are working in a very complex solution using drools 6 (Fusion) and I would like your opinion about best way to read Objects created during the correlation results over time.
My first basic approach was to read Working Memory every certain time, looking for new objects and reporting them to external Service (REST).
AgendaEventListener does not seems to be the "best" approach beacuse I dont care about most of the objects being inserted in working memory, so maybe, best approach would be to inject particular "object" in some sort of service inside DRL. Is this a good approach?
You have quite a lot of options. In decreasing order of my preference:
AgendaEventListener is probably the solution requiring the smallest amount of LOC. It might be useful for other tasks as well; all you have on the negative side is one additional method call and a class test per inserted fact. Peanuts.
You can wrap the insert macro in a DRL function and collect inserted fact of class X in a global List. The problem you have here is that you'll have to pass the KieContext as a second parameter to the function call.
If the creation of a class X object is inevitably linked with its insertion into WM, you could add the registry of new objects into a static List inside class X, to be done in a factory method (or the constructor).
I'm putting your "basic approach" last because it requires much more cycles than the listener (#1) and tons of overhead for maintaining the set of X objects that have already been put to REST.

EF - multiple includes to eager load hierarchical data. Bad practice?

I am needing to eager load a hierarchy structure so that I can recursively iterate through it. The eager loading is necessary to prevent multiple db queries while traversing the tree. It seems the consensus is that you can't eager load infinite levels of the tree, so I did something like
var item= db.ItemHierarchies
.Include("Children.Children.Children.Children.Children")
.Where(x => x.condition == condition)
to load 5 levels of children. This seems to get the job done. I'm wondering what the drawback is to doing this? If there is none then theoretically could I add 50 levels of includes here without slowing things down?
I recommend taking a look at the SQL that is generated as you add eager loading to your query.
var item= db.ItemHierarchies
.Include("Children")
.Include("Children.Children")
.Include("Children.Children.Children")
.Include("Children.Children.Children.Children")
.Include("Children.Children.Children.Children.Children")
var sql = ((System.Data.Objects.ObjectQuery) item).ToTraceString()
// http://visualstudiomagazine.com/blogs/tool-tracker/2011/11/seeing-the-sql.aspx
You'll see that the SQL quickly gets very big and complicated and can potentially have serious performance implications. You'd do well to limit your eager loading to data that you are certain you will need and to consider using explicit loading for some of the related entities - especially if you're working with connected entities in which case you can explicitly load collection properties when they're needed.
Also note that you may not need multiple separate Includes. For example, the following needs to be separate Includes because they're addressing separate properties (Widgets and Spanners) of the root.
var item= db.ItemHierarchies
.Include("Widgets")
.Include("Spanners.Flanges")
But the following isn't necessary:
var item= db.ItemHierarchies
.Include("Widgets") //This isn't necessary.
.Include("Widgets.Flanges") //This loads both Widges and Flanges.
Well honestly.. It's an extremely bad practice.
Let's assume you had 50 objects in your root.. and 50 per level.
You may end up retrieving 312500000 "capsules" of information.
Now, one might ask: "So what is wrong with that?!",
I mean if that is what is required than why not do that..
Rule #1: we develop software that should be used by human beings.
And the fact is that no human capable of taking a glimpse at 312500000 items of information at once and learn or conclude something beneficial out of it. (except.. that it does not help him or her to watch it)
Rule #2: UI should be based on what is needed and not what is possible.
And since we already established that showing 312500000 capsules of data is not needed there is no reason to bring all that at once.
And now you might come forward and say - But I don't care about the UI, really! All I need is to iterate in that data in order to process some information!
In that case you would probably want to save your results somewhere for future reference, but that means that its a batch job.. so why not apply batch job rules upon it.. like process it item by item which will also may give you the benefit of splitting it between even more machines if needed.
So you see.. no matter which path you choose there should be no reason to do it.
(= definition of what is a bad practice.)
Update:
After reading interesting concerns in the comments, I would like to update this answer with more analysis:
Deciding what is a bad practice must always be in reference to what is to be achieved or what is the role of each part in the system. In the current situation (after reading the comments) it has been brought or implied that the data storage is actually a persistent medium for objects opposed to a different concept where the data is the 'heart' of the application.
We can define two data types:
1) Data-Center which is being used in data-centric applications such as banks, CRM, ERP, websites or other service based solutions.
VS.
2) Data-Persistence medium which is being used as data to be saved for when the application is not active, in example: any simple app save file or any game save file and etc.
The main difference is that a data persistence medium is to be accessed only by a single instance of the app at a single point in time.. meaning the data is not designed to be shared by many instances. if the data is to be shared - we are dealing with a data-center application.
If your app just need a data-persistence medium - loading all the information cannot be considered as a bad practice - but you still need to make sure you are not exploding the memory. and in that frame of work, SQL Server might not be what you need or the best tool to use.
In the other case of Data-Centric application - my original answer remains as it will be a bad practice to bring all the information per instance of the application.

Updating last accessed time when separating Commands and Queries

Consider a function: IsWalletValid(walletID). It returns true if the walletID exists in the database, and updates a 'last_accessed_time' field.
A task runs periodically to remove any wallets that have not been accessed for a set period of time.
Seems like an easy solution for what we want to do, but IsWalletValid() has a side effect because it writes to the database.
Should we add an additional 'UpdateLastAccessedTime(walletID)' function? Everytime we call IsWalletValid() we will also need to remember to call UpdateLastAccessedTime(walletID).
Do verifying that a wallet is valid and updating it's last_accessed_time field need to be transactionally consistent (ACID)? You could use eventual consistency here:
The method IsWalletValid publishes an WalletAccessed event, then an event handler updates last_accessed_time asynchronously.
if last_accessed_time is not accessed by domain logic to make decisions on any write handling this could just be a facet of the read only projection. Seems like this is the same concern as other more verbose read audit concerns. Just because data is being written and maintained doesn't mean that it necessarily needs to be part of the write model of the system. If you did however want to implement this as part of the domain and perhaps stored within the same event store it could be considered a separate auditing context outside of the boundary of the original aggregate being audited.

When to truncate strings longer than the storage location allows?

Let's say I have a function that inserts records into a database table with string fields of limited length. In general, at what point should I be truncating strings that are too long for the storage location, in the insert function itself, or at every point in the code where it's called?
(I'm assuming here that truncation of strings that are too long is more desirable than having an exception thrown.)
I think it depends on where the function is and how accessible it is.
If it's a private function that just makes up your own SQL library then you can probably get away with truncating it in the function.
If it's in a library that, say, your team at work all use then perhaps you need to at least parse the string before attempting to insert it.
If it's a public API, then you shouldn't be silently truncating anything - throw a meaningful exception instead.
This should sit in the insert function - it's specific to the database implementation rather than the calling application. If you manage to change your data structure, you don't want to have to go back through all the client code to ensure the full string is used.
As per Widor, but may I also add:
Your application should ideally be structured so that there is a distinct data layer that separates the rest of your code from the database and its implementation logic.
In high traffic systems you will ideally want to limit the amount of data passing back and forth between the database and your code, hence data validation should be performed by your data layer BEFORE passing it on to your database. It is here that you can raise a meaningful exception for your business logic to handle.
The object data presented by the data layer need bear no relation to what is actually stored in or by the database. For instance it may present a data object class that is actually a composite of data stored in several tables.
The data layer itself can be structured in such a way that it can handle different database implementations.
I have used a factory pattern in the past that has allowed me to switch between SQL, MySQL databases, XML file storage and compiled test data as required at runtime without the need for recompilation.
edit
Your application data layer is the interface between your application code e.g. business logic and GUI, and your database.
The business logic will trigger the data layer to update the database with your string.
In your example the data layer contains your update function.
You can validate the string, truncate it if too long, and then update the database (through stored procedure call or direct write for instance) within that function if you wish.
In reality you'll have many strings that will have to be restricted to the same length, so it is advisable to have the validation performed by a seperate function to save duplicating code.
Also you may wish to validate/truncate the string and notify the user/calling code of this without writing the data to the database.
Essentially though this is performed by your application data layer code, which may be encapsulated within a class library/dll for instance and not left to the database to handle nor the business logic (other than to react to any error event/response fed back).