Streaming data Complex Event Processing over files and a rather long period - streaming

my challenge:
we receive files every day with about 200.000 records. We keep the files for approx 1 year, to support re-processing, etc..
For the sake of the discussion assume it is some sort of long lasting fulfilment process, with a provisioning-ID that correlates records.
we need to identify flexible patterns in these files, and trigger events
typical questions are:
if record A is followed by record B which is followed by record C, and all records occured within 60 days, then trigger an event
if record D or record E was found, but record F did NOT follow within 30 days, then trigger an event
if both records D and record E were found (irrespective of the order), followed by ... within 24 hours, then trigger an event
some pattern require lookups in a DB/NoSql or joins for additional information either to select the record, or to put into the event.
"Selecting a record" can be simple "field-A equals", but can also be "field-A in []" or "filed-A match " or "func identify(field-A, field-B)"
"days" might also be "hours" or "in previous month". Hence more flexible then just "days". Usually we have some date/timestamp in the record. The maximum is currently "within 6 months" (cancel within setup phase)
The created events (preferably JSON) needs to contain data from all records which were part of the selection process.
We need an approach that allows to flexibly change (add, modify, delete) the pattern, optionally re-processing the input files.
Any thoughts on how to tackle the problem elegantly? May be some python or java framework, or does any of the public cloud solutions (AWS, GCP, Azure) address the problem space especially well?
thanks a lot for your help

After some discussions and readings, we'll try first Apache Flink with the FlinkCEP library. From the docs and blog entries it seems to be able to do the job. It also seems AWS's choice, running on their EMR cluster. We didn't find any managed service on GCP nor Azure providing the functionalities. Of course we can always deploy and manage it ourselves. Unfortunately we didn't find a Python framework

Related

Data syncing with pouchdb-based systems client-side: is there a workaround to the 'deleted' flag?

I'm planning on using rxdb + hasura/postgresql in the backend. I'm reading this rxdb page for example, which off the bat requires sync-able entities to have a deleted flag.
Q1 (main question)
Is there ANY point at which I can finally hard-delete these entities? What conditions would have to be met - eg could I simply use "older than X months" and then force my app to only ever displays data for less than X months?
Is such a hard-delete, if possible, best carried out directly in the central db, since it will be the source of truth? Would there be any repercussions client-side that I'm not foreseeing/understanding?
I foresee the number of deleted's growing rapidly in my app and i don't want to have to store all this extra data forever.
Q2 (bonus / just curious)
What is the (algorithmic) basis for needing a 'deleted' flag? Is it that it's just faster to check a flag rather than to check for the omission of an object from, say, a very large list. I apologize if it's kind of a stupid question :(
Ultimately it comes down to a decision that's informed by your particular business/product with regards to how long you want to keep deleted entities in your system. For some applications it's important to always keep a history of deleted things or even individual revisions to records stored as a kind of ledger or history. You'll have to make a judgement call as to how long you want to keep your deleted entities.
I'd recommend that you also add a deleted_at column if you haven't already and then you could easily leverage something like Hasura's new Scheduled Triggers functionality to run a recurring job that fully deletes records older than whatever your threshold is.
You could also leverage Hasura's permissions system to ensure that rows that have been deleted aren't returned to the client. There is documentation and examples for ways to work with soft deletes and Hasura
For your second question it is definitely much faster to check for the deleted flag on records than to have to try and diff the entire dataset looking for things that are now missing.

How to mark a object for deletion in x days time?

I have a regional object store. I would like to be able to tell a particular object that I want you deleted in 5 days time from now.
How do you suggest I implement?
I don't really want to keep track of the object in a database, and based on time send delete commands as a separate process. Is there any tag that could be set to get deletion to occur at a later time (from now, not a specific time in the past)?
There's no functionality built into Google Cloud Storage to do this.
You can configure Lifecycle Management to delete objects according to a number of criteria (including age) - but deleting at a particular date in the future isn't one of the supported conditions and in fact there's no guarantee that a lifecycle condition will run the same day the condition becomes true. Instead you would have to implement this functionality yourself (e.g., in a Compute Engine or App Engine implementation).

PowerApps datasource to overcome 500 visible or searchable items limit

For PowerApps, what data source, other than SharePoint lists are accessible via Powershell?
There are actually two issues that I am dealing with. The first is dynamic updating and the second is the 500 item limit that SharePoint lists are subject to.
I need to dynamically update my data source, which I am currently doing with PowerShell. My data source is not static and updating records by hand is time-consuming and error prone. The driving force behind my question is that the SharePoint list view threshold is 5,000 records however you are limited to 500 visible and searchable records when using SharePoint lists in the Gallery View and my data source contains greater than 500 but less than 1000 records. If you have any items beyond the 500th record that should match the filter criteria, they will not be found. So SharePoint lists are not optional for me until that limitation is remediated
Reference: https://powerapps.microsoft.com/en-us/tutorials/function-filter-lookup/
To your first question, Powershell can be used for almost anything on the Microsoft stack. You could use SQL server, Dynamics 365, SP, Azure, and in the future there will be an SDK for the Common Data Service. There are a lot of connectors, and Powershell can work with a good majority of them.
Take note that working with these data structures through Powershell is independent from Powerapps. Powerapps just takes the data that the data connector gives it, and if you have something updating the data in the background (Powershell, cron job, etc.), In order to get a dynamic list of items, you can use a Timer control and a Refresh function on your data source to update the list every ~5-20 seconds.
To your second question about SharePoint, there is an article that came out around the time you asked this regarding working with large lists. I wouldn't say it completely solves your question, but this article seems to state using the "Filter" function on basic column types would possibly work for you:
...if you’d like to filter the set of items that you are showing in the gallery control, you will make use of a “Filter” expression, rather than the “Search” expression, which is the default that existing apps used. With our changes, SharePoint connector now supports “equals” type of queries on columns that support filtering (Single line of text, choice, numbers, dates and people), so make sure that the columns and the expressions you use are supported and watch for the same warning to avoid reverting back to the top 500 items.
It also notes that if you want to pull from a list larger than the 5k threshold, you would need to use indexes, I have not fully tested this yet but it seems that this could potentially solve your problem.

Billing by tag in Google Compute Engine

Google Compute Engine allows for a daily export of a project's itemized bill to a storage bucket (.csv or .json). In the daily file I can see X-number of seconds of N1-Highmem-8 VM usage. Is there a mechanism for further identifying costs, such as per tag or instance group, when a project has many of the same resource type deployed for different functional operations?
As an example, Qty:10 N1-Highmem-8 VM's are deployed to a region in a project. In the daily bill they just display as X-seconds of N1-Highmem-8.
Functionally:
2 VM's might run a database 24x7
3 VM's might run batch analytics operation averaging 2-5 hrs each night
5 VM's might perform a batch operation which runs in sporadic 10 minute intervals through the day
final operation writes data to a specific GS Buckets, other operations read/write to different buckets.
How might costs be broken out across these four operations each day?
The Usage Logs do not provide 'per-tag' granularity at this time and it can be a little tricky to work with the usage logs but here is what I recommend.
To further break down the usage logs and get better information out of em, I'd recommend trying to work like this:
Your usage logs provide the following fields:
Report Date
MeasurementId
Quantity
Unit
Resource URI
ResourceId
Location
If you look at the MeasurementID, you can choose to filter by the type of image you want to verify. For example VmimageN1Standard_1 is used to represent an n1-standard-1 machine type.
You can then use the MeasurementID in combination with the Resource URI to find out what your usage is on a more granular (per instance) scale. For example, the Resource URI for my test machine would be:
https://www.googleapis.com/compute/v1/projects/MY_PROJECT/zones/ZONE/instances/boyan-test-instance
*Note: I've replaced the "MY_PROJECT" and "ZONE" here, so that's that would be specific to your output along with the name of the instance.
If you look at the end of the URI, you can clearly see which instance that is for. You could then use this to look for a specific instance you're checking.
If you are better skilled with Excel or other spreadsheet/analysis software, you may be able to do even better as this is just an idea on how you could use the logs. At that point it becomes somewhat a question of creativity. I am sure you could find good ways to work with the data you gain from an export.
9/2017 update.
It is now possible to add user defined labels, then track usage and billing by these labels for Compute and GCS.
Additionally, by enabling the billing export to Big Query, it is then possible to create custom views or hit Big Query in a tool more friendly to finance people such as Google Docs, Data Studio, or anything which can connect to Big Query. Here is a great example of labels across multiple projects to split costs into something friendlier to organizations, in this case a Data Studio report.

Last Updated Date: Antipattern?

I keep seeing questions floating through that make reference to a column in a database table named something like DateLastUpdated. I don't get it.
The only companion field I've ever seen is LastUpdateUserId or such. There's never an indicator about why the update took place; or even what the update was.
On top of that, this field is sometimes written from within a trigger, where even less context is available.
It certainly doesn't even come close to being an audit trail; so that can't be the justification. And if there is and audit trail somewhere in a log or whatever, this field would be redundant.
What am I missing? Why is this pattern so popular?
Such a field can be used to detect whether there are conflicting edits made by different processes. When you retrieve a record from the database, you get the previous DateLastUpdated field. After making changes to other fields, you submit the record back to the database layer. The database layer checks that the DateLastUpdated you submit matches the one still in the database. If it matches, then the update is performed (and DateLastUpdated is updated to the current time). However, if it does not match, then some other process has changed the record in the meantime and the current update can be aborted.
It depends on the exact circumstance, but a timestamp like that can be very useful for autogenerated data - you can figure out if something needs to be recalculated if a depedency has changed later on (this is how build systems calculate which files need to be recompiled).
Also, many websites will have data marking "Last changed" on a page, particularly news sites that may edit content. The exact reason isn't necessary (and there likely exist backups in case an audit trail is really necessary), but this data needs to be visible to the end user.
These sorts of things are typically used for business applications where user action is required to initiate the update. Typically, there will be some kind of business app (eg a CRM desktop application) and for most updates there tends to be only one way of making the update.
If you're looking at address data, that was done through the "Maintain Address" screen, etc.
Such database auditing is there to augment business-level auditing, not to replace it. Call centres will sometimes (or always in the case of financial services providers in Australia, as one example) record phone calls. That's part of the audit trail too but doesn't tend to be part of the IT solution as far as the desktop application (and related infrastructure) goes, although that is by no means a hard and fast rule.
Call centre staff will also typically have some sort of "Notes" or "Log" functionality where they can type freeform text as to why the customer called and what action was taken so the next operator can pick up where they left off when the customer rings back.
Triggers will often be used to record exactly what was changed (eg writing the old record to an audit table). The purpose of all this is that with all the information (the notes, recorded call, database audit trail and logs) the previous state of the data can be reconstructed as can the resulting action. This may be to find/resolve bugs in the system or simply as a conflict resolution process with the customer.
It is certainly popular - rails for example has a shorthand for it, as well as a creation timestamp (:timestamps).
At the application level it's very useful, as the same pattern is very common in views - look at the questions here for example (answered 56 secs ago, etc).
It can also be used retrospectively in reporting to generate stats (e.g. what is the growth curve of the number of records in the DB).
there are a couple of scenarios
Let's say you have an address table for your customers
you have your CRM app, the customer calls that his address has changed a month ago, with the LastUpdate column you can see that this row for this customer hasn't been touched in 4 months
usually you use triggers to populate a history table so that you can see all the other history, if you see that the creationdate and updated date are the same there is no point hitting the history table since you won't find anything
you calculate indexes (stock market), you can easily see that it was recalculated just by looking at this column
there are 2 DB servers, by comparing the date column you can find out if all the changes have been replicated or not etc etc ect
This is also very useful if you have to send feeds out to clients that are delta feeds, that is only the records that have been changed or inserted since the data of the last feed are sent.