Trying to determine best design for this workflow - c# - 3.0 - c#-3.0

Input Server - files of type jpg,tif, raw,png, mov come in via FTP
Each file needs to be watermarked, if applicable, and meta data added to the file
Then each file needs to be moved to an orders directory where an order file is generated and then packaged as zip file and moved to processing server.
The file names are of [orderid_userid_guid].[jpg|tif|mov|png...]
As I expect the volume to grow I dont want to work on one file at a time and move it through the work flow. I would prefer multi threaded/asynchronous if possible..

I might setup a message queuing and processing system for this.
One process/thread/service will monitor the FTP server, and when new files appear, will grab them and dump them into a queue (possibly MSMQ, or just a staging folder, etc.)
Another process monitors this queue, and when a file appears, it grabs it and does watermarking/metadata/etc., then drops it in another queue/folder.
Another process monitors this queue and grabs new files for zipping. After zipping, drop in another queue.
...and so on.
You can setup "work dispatchers" at the end of each queue to grab files and dispatch them to however many worker threads you want.
You don't necessarily have to split it out into this many separate processes and queues - that's up to you to decide. The "queues" can be implemented in a number of different ways as well. You could look at MSMQ as a start, but you might also consider just moving files between folders, etc. WCF and Windows Workflow Foundation might be good technologies to look at first.

Related

Eclipse BPMN2 Modeller - linking from one BPMN file to another?

I'm working on a project that uses an extremely complex BPMN file, so I've been tasked with seeing if splitting it into multiple BPMNs can be done i.e. have it go from one BPMN file into another. We are using Eclipse's BPMN2 Modeler, are there any ways of doing this outside of implementing a Sub-Process? And is there a way for it to happen as a user carries out tasks rather than right at the start, for instance when the user reaches a certain point in the sequence it jumps to another BPMN, otherwise it does not?
You could use message events to signal to different lanes/flows of your original BPMN.
This would enable you to split the flow into sub-BPMN diagrams which can accept message events to start the sub-flow, and emit message events when they're complete to continue the wider process.
Subprocesses is the best way to split chunks of processes into separate units. Based on your question: "for instance when the user reaches a certain point in the sequence it jumps to another BPMN" that is when you place a sub process activity.
I wonder why you are discarding that approach.

Complicated job aggregate

I have a very complicated job process and it's not 100% clear to me where to handle what.
I don't want to have code, it just the question who is responsible for what.
Given is the following:
There is a root directory "C:\server"
Inside are two directories "ftp" and "backup"
Imagine the following process:
An external customer sends a file into the ftp directory.
An importer application get's the file and now the fun starts.
A job aggregate have to be created for this file.
The command "CreateJob(string file)" is fired.
?. The file have to be moved from ftp to backup. Inside the CommandHandler or inside the Aggregate or on JobCreated event?
StartJob(Guid jobId) get's called. A third folder have to be created "in-progress", File have to be copied from backup to in-progress. Who does it?
So it's unclear for me where Filesystem things have to be handled if the Aggregate can not work correctly without the correct filesystem.
Because my first approach was to do that inside an Infrastructure layer/lib which listen to the events from the job layer. But it seems not 100% correct?!
And top of this, what is with replaying?
You can't replay things/files that were moved, you have to somehow simulate that a customer sends the file to the ftp folder...
Thankful for answers
The file have to be moved from ftp to backup. Inside the CommandHandler or inside the Aggregate or on JobCreated event?
In situations like this, I move the file to the destination folder in the Application service that sends the command to the Aggregate (or that calls a command-like method on the Aggregate, it's the same) before the command is sent to the Aggregate. In this way, if there are some problems with the file-system (not enough permissions or space is not available etc) the command is not sent. These kind of problems should not reach our Aggregate. We most protect it from the infrastructure. In fact we should keep the Aggregate isolated from anything else; it must contain only pure business logic that is used to decide what events get generated.
Because my first approach was to do that inside an Infrastructure layer/lib which listen to the events from the job layer. But it seems not 100% correct?!
Indeed, this seems like over engineering to me. You must KISS.
StartJob(Guid jobId) get's called. A third folder have to be created "in-progress", File have to be copied from backup to in-progress. Who does it?
Whoever's calling the StartJob could do the moving, before the StartJob gets called. Again, keep the Aggregate pure. In this case it depends on your framework/domain details.
And top of this, what is with replaying? You can't replay things/files that where moved, you have to somehow simulate that a customer sends the file to the ftp folder...
The events are loaded from the event store and replayed in two situations:
Before every command gets sent to the Aggregate, the Aggregate Repository loads all the events from the event store then it applies every one of them to the Aggregate, probably calling some applyThisEvent(TheEvent) method on the Aggregate. So, this methods should be with no side effects (pure) otherwise you change the outside world again and again at every command execution and you don't want that.
The read-models (the projections, the query-models) that present data to the user listen to those events and update the database tables that hold the data that the users see. The events are sent to those read-models after they are generated and every time the read-models are being recreated. When you invent a new read-model, you must pass it all the events that were previous generated by the aggregates in order to build the correct/complete state on them. If your read-model's event listeners have side effects what do you think happens when you replay those long past events? The outside world is modified again and again and you don't want that! The read-models only interpret the events, they don't generate other events and they don't change the outside world.
There is a special third case when events reach another type of model, a Saga. A Saga must receive an event only once! This is the case that you thought to use in Because my first approach was to do that inside an Infrastructure layer/lib which listen to the events from the job layer. You could do this in your case but is not KISS.
I have a very complicated job process and it's not 100% clear to me where to handle what. I don't want to have code, it just the question who is responsible for what.
The usual answer is that the domain model -- aka the "aggregate" makes decisions, and saves them. Observing those decisions, some event handler induces side effects.
And top of this, what is with replaying? You can't replay things/files that where moved, you have to somehow simulate that a customer sends the file to the ftp folder...
You replay the events to the aggregate, so that it is restored to the state where it made the last decision. That's a separate concern from replaying the side effects -- which is part of the motivation for handling the side effects elsewhere.
Where possible, of course, you prefer to have the side effects be idempotent, so that a duplicated message doesn't create a problem. But notice that from the point of view of the model, it doesn't actually matter whether the side effect succeeds or not.

Can watchman send why a file changed?

Is watchman capable of posting to the configured command, why it's sending a file to that command?
For example:
a file is new to a folder would possibly be a FILE_CREATE flag;
a file that is deleted would send to the command the FILE_DELETE flag;
a file that's modified would send a FILE_MOD flag etc.
Perhaps even when a folder gets deleted (and therefore the files thereunder) would send a FOLDER_DELETE parameter naming the folder, as well as a FILE_DELETE to the files thereunder / FOLDER_DELETE to the folders thereunder
Is there such a thing?
No, it can't do that. The reasons why are pretty fundamental to its design.
The TL;DR is that it is a lot more complicated than you might think for a client to correctly process those individual events and in almost all cases you don't really want them.
Most file watching systems are abstractions that simply translate from the system specific notification information into some common form. They don't deal, either very well or at all, with the notification queue being overflown and don't provide their clients with a way to reliably respond to that situation.
In addition to this, the filesystem can be subject to many and varied changes in a very short amount of time, and from multiple concurrent threads or processes. This makes this area extremely prone to TOCTOU issues that are difficult to manage. For example, creating and writing to a file typically results in a series of notifications about the file and its containing directory. If the file is removed immediately after this sequence (perhaps it was an intermediate file in a build step), by the time you see the notifications about the file creation there is a good chance that it has already been deleted.
Watchman takes the input stream of notifications and feeds it into its internal model of the filesystem: an ordered list of observed files. Each time a notification is received watchman treats it as a signal that it should go and look at the file that was reported as changed and then move the entry for that file to the most recent end of the ordered list.
When you ask Watchman for information about the filesystem it is possible or even likely that there may be pending notifications still due from the kernel. To minimize TOCTOU and ensure that its state is current, watchman generates a synchronization cookie and waits for that notification to be visible before it responds to your query.
The combination of the two things above mean that watchman result data has two important properties:
You are guaranteed to have have observed all notifications that happened before your query
You receive the most recent information for any given file only once in your query results (the change results are coalesced together)
Let's talk about the overflow case. If your system is unable to keep up with the rate at which files are changing (eg: you have a big project and are very quickly creating and deleting files and the system is heavily loaded), the OS can't fit all of the pending notifications in the buffer resources allocated to the watches. When that happens, it blows those buffers and sends an overflow signal. What that means is that the client of the watching API has missed some number of events and is no longer in synchronization with the state of the filesystem. If that client is maintains state about the filesystem it is no longer valid.
Watchman addresses this situation by re-examining the watched tree and synthetically marking all of the files as being changed. This causes the next query from the client to see everything in the tree. We call this a fresh instance result set because it is the same view you'd get when you are querying for the first time. We set a flag in the result so that the client knows that this has happened and can take appropriate steps to repair its own state. You can configure this behavior through query parameters.
In these fresh instance result sets, we don't know whether any given file really changed or not (it's possible that it changed in such a way that we can't detect via lstat) and even if we can see that its metadata changed, we don't know the cause of that change.
There can be multiple events that contribute to why a given file appears in the results delivered by watchman. We don't them record them individually because we can't track them with unbounded history; imagine a file that is incrementally being written once every second all day long. Do we keep 86400 change entries for it per day on hand and deliver those to our clients? What if there are hundreds of thousands of files like this? We'd have to truncate that data, and at that point the loss in the data reduces how well you can reason about it.
At the end of all of this, it is very rare for a client to do much more than try to read a file or look at its metadata, and generally speaking, they want to do that only when the file has stopped changing. For this use case, watchman-wait, watchman-make and trigger all have the concept of a settle period that causes the change notifications to be delayed in delivery until after the filesystem has stopped changing.

How to add a file to ClearCase database, but not in source control?

On my project I have some files that are generated automatically, so you'd normally don't put those in Source Control.
But since this process takes a long time and they change quite periodically, I'd rather keep them in Clear Case database to not impose this process to every one that desires to compile the source that isn't directly related to these files.
So, is there a way that I could add files on ClearCase UCM without creating a version tree?
More directly, I'd like to know if there a way to only one version per branch. As if when delivering this file to the main branch, it would delete the old version an replace it by the new one.
I know that this is a bit unorthodox, but I ask this because I'm not interested by the generated files history and I'd like to save space in the server.
So, is there a way that I could add files on ClearCase UCM without creating a version tree?
No.
Unless those files are radically different from one generation to the next, (or are huge binary), ClearCase would only record the delta, which wouldn't consume too much space.
One trick would be to rename the stream in which the import of the newly generated source is done, and create a new stream, in order to not have a huge version tree over time.

Using Mojo Event Loop for a long-running script processing huge text files?

I have a script implemented as a Mojo::Command.
It reads a huge text file and extracts data from it. The file contains simple tab separated (C/TSV) records. One record per line.
How can I use the Mojo Event loop to store those records in small files - one file per record - so my script does not wait for each record to be stored but continues to the next record.
Here is a stripped down example:
package My::task;
use Mojo::Base 'Mojolicious::Command';
#in My::task::run
#use Text::CSV to open and read the file
while (!$csv->eof()) {
my $row = $csv->getline($fh)
do_something_time_consuming_and_store_the_record_somewhere($row)
}
I was thinking Mojo Event Loop can be used and avoid forking/threading.
I used successfully previously Parallel::Forker, but I was thinking Mojo has what to offer to speedup the execution.
Is that possible? How?
It depends on the nature of do_something_time_consuming. If that is something that has your process CPU-busy, then you're looking for parallelism, which an event loop doesn't try to give you. In that event you might want to feed each row to redis (via mojo::redis) and have worker processes consume, process, store each record. Then throughput is down to how many parallel workers you can run.
On the other hand, if do_something_time_consuming involves a lot of waiting, eg post to a web service and wait for results, then an event loop (incl mojo's) can be a big win, and handle the concurrency that you want. It's hard to guess which of the non-blocking UserAgent examples is closest to your scenario, since you're short on detail. The gist is to create a callback that does what you want (eg store_the_record_somewhere) when it gets a response back from the remote service.