How to propagate PubSub metadata with Apache Beam? - google-cloud-storage

Context: I have a pipeline that listen to pub sub, the message to pubsub is published by an object change notification from a google cloud storage. The pipeline process the file using a XmlIO splitting it, so far so good.
The problem is: In the pubsub message (and in the object stored in the google cloud storage) I have some metadata that I would like to merge with the data from the XmlIO to compose the elements that the pipeline will process, how can I achieve this?

You can create a custom window and windowfn that stores the metadata from the pubsub message that you want to use later to enrich the individual records.
Your pipeline will look as follows:
ReadFromPubsub -> Window.into(CopyMetadataToCustomWindowFn) -> ParDo(ExtractFilenameFromPubsubMessage) -> XmlIO -> ParDo(EnrichRecordsWithWindowMetadata) -> Window.into(FixedWindows.of(...))
To start, you'll want to create a subclass of IntervalWindow that stores the metadata that you need. After that, create a subclass of WindowFn where in #assignWindows(...) you copy the metadata from the pubsub message into the IntervalWindow subclass you created. Apply your new windowfn using the Window.into(...) transform. Now each of the records that flow through the XmlIO transform will be within your custom windowfn that contains the metadata.
For the second step, you'll need to extract the relevant filename from the pubsub message to pass to the XmlIO transform as input.
For the third step, you want to extract out the custom metadata from the window in a ParDo/DoFn that is after the XmlIO. The records within XmlIO will preserve the windowing information that was passed through it (note that not all transforms do this but almost all do). You can state that your DoFn needs the window to be passed to your #ProcessElement, for example:
class EnrichRecordsWithWindowMetadata extends DoFn<...> {
#ProcessElement
public void processElement(#Element XmlRecord xmlRecord, MyCustomMetadataWindow metadataWindow) {
... enrich record with metadata on window ...
}
}
Finally, it is a good idea to revert to one of the standard windowfns such as FixedWindows since the metadata on the window is no longer relevant.

You can use directly pub/sub notification from Google Cloud Storage instead of introducing OCN in middle.
Google also suggest to use pub/sub. If you receive the pub/sub notification you can get the message attributes in it.
data = request.get_json()
object_id = data['message']['attributes']['objectGeneration']
bucket_name = data['message']['attributes']['bucketId']
object_name = data['message']['attributes']['objectId']

Related

Passing Cloud Storage custom metadata into Cloud Storage Notification

We have a Python script that copies/creates files in a GCS bucket.
# let me know if my setting of the custom-metadata is correct
blob.metadata = { "file_capture_time": some_timestamp_var }
blob.upload(...)
We want to configure the bucket such that it generates Cloud Storage notifications whenever an object is created. We also want the custom metadata above to be passed along with the Pub/Sub message to the topic and use that as an ordering key in the Subscription side. How can we do this?
The recommended way to receive notification when an event occurs on the intended GCS bucketis to create a Cloud Pub/Sub topic for new objects and to configure your GCS bucket to publish messages to that topic when new objects are created.
Initially, make sure you've activated the Cloud Pub/Sub API, and use the gsutil command similar to below:
gsutil notification create -f json -e OBJECT_FINALIZE gs://example-bucket
The -e specifies that you're only interested in OBJECT_FINALIZE messages (objects being created)
The -f specifies that you want the payload of the messages to be the object metadata for the JSON API
The -m specifies a key:value attribute that is appended to the set of attributes sent to Cloud Pub/Sub for all events associated with this notification config.
You may specify this parameter multiple times to set multiple attributes.
The full Firebase example which explains the parsing the filename and other info from its context/data with
Here is a good example with a similar context.

Accessing per-key state store in Apache Flink that changes dynamically

I have a stream of messages with different keys. For each key, I want to create an event time session window and do some processing on it only if:
MIN_EVENTS number of events has been accumulated in the window (essentially a keyed state)
For each key, MIN_EVENTS is different and might change during runtime. I am having difficulty implementing this. In particular, I am implementing this logic like so:
inputStream.keyBy(key).
window(EventTimeSessionWindow(INACTIVITY_PERIOD).
trigger(new MyCustomCountTrigger()).
apply(new MyProcessFn())
I am trying to create a custom MyCustomCountTrigger() that should be capable of reading from a state store such as MapState<String, Integer> stateStore that maps key to it's MIN_EVENTS parameter. I am aware that I can access a state store using the TriggerContext ctx object that is available to all Triggers.
How do I initialize this state store from outside the CountTrigger() class? I haven't been able to find examples to do so.
You can initialize the state based on parameters sent to the constructor of your Trigger class. But you can't access the state from outside that class.
If you need more flexibility, I suggest you use a process function instead of a window.

Custom events data in Firebase Events

I have a loggin mechanizm,which logs custom events to the firebase.
Event logged. Event name, event params: Session, {
"_o" = app;
deviceId = "21957A5C-5344-4D93-BCFB-3D01EDCC8886";
type = "Manual Logout";
userId = 2;}
It successfully logs events,I can see keys of my custom data, but I cant see values for that keys
For example here is what I see for
deviceId
Unfortunately, Firebase presents aggregated data, in most cases. Secondly, in case of custom events, value will not be available in the dashboard for you to see.
Your best shot if you are going for free is to use pre-existing event template and fit your events to suit your need, such as:
static func levelChange(level: Int, gameIndex: Int) {
Analytics.logEvent(AnalyticsEventLevelUp, parameters: [AnalyticsParameterLevel: level as NSNumber,
AnalyticsParameterCharacter: String(gameIndex) as NSString])
}
Here I have used the pre-existing event template, though my event is not actually a level-up, but at least I am able to see values (aggregated) in the dashboard, as below:
Alternatively, which I have now moved onto, is to activate Blaze plan within Firebase, wherein all your raw events will be pushed to BigQuery DB, from where you can access whatever custom event parameters you've stored.

How to control data failures in Azure Data Factory Pipelines?

I receive an error from time and time due to incompatible data in my source data set compared to my target data set. I would like to control the action that the pipeline determines based on error types, maybe output or drop those particulate rows, yet completing everything else. Is that possible? Furthermore, is it possible to get a hold of the actual failing line(s) from Data Factory without accessing and searching in the actual source data set in some simple way?
Copy activity encountered a user error at Sink side: ErrorCode=UserErrorInvalidDataValue,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Column 'Timestamp' contains an invalid value '11667'. Cannot convert '11667' to type 'DateTimeOffset'.,Source=Microsoft.DataTransfer.Common,''Type=System.FormatException,Message=String was not recognized as a valid DateTime.,Source=mscorlib,'.
Thanks
I think you've hit a fairly common problem and limitation within ADF. Although the datasets you define with your JSON allow ADF to understand the structure of the data, that is all, just the structure, the orchestration tool can't do anything to transform or manipulate the data as part of the activity processing.
To answer your question directly, it's certainly possible. But you need to break out the C# and use ADF's extensibility functionality to deal with your bad rows before passing it to the final destination.
I suggest you expand your data factory to include a custom activity where you can build some lower level cleaning processes to divert the bad rows as described.
This is an approach we often take as not all data is perfect (I wish) and ETL or ELT doesn't work. I prefer the acronym ECLT. Where the 'C' stands for clean. Or cleanse, prepare etc. This certainly applies to ADF because this service doesn't have its own compute or SSIS style data flow engine.
So...
In terms of how to do this. First I recommend you check out this blog post on creating ADF custom activities. Link:
https://www.purplefrogsystems.com/paul/2016/11/creating-azure-data-factory-custom-activities/
Then within your C# class inherited from IDotNetActivity do something like the below.
public IDictionary<string, string> Execute(
IEnumerable<LinkedService> linkedServices,
IEnumerable<Dataset> datasets,
Activity activity,
IActivityLogger logger)
{
//etc
using (StreamReader vReader = new StreamReader(YourSource))
{
using (StreamWriter vWriter = new StreamWriter(YourDestination))
{
while (!vReader.EndOfStream)
{
//data transform logic, if bad row etc
}
}
}
}
You get the idea. Build your own SSIS data flow!
Then write out your clean row as an output dataset, which can be the input for your next ADF activity. Either with multiple pipelines, or as chained activities within a single pipeline.
This is the only way you will get ADF to deal with your bad data in the current service offerings.
Hope this helps

Using visjs manipulation to create workflow dependencies

We are currently using visjs version 3 to map the dependencies of our custom built workflow engine. This has been WONDERFUL because it helps us to visualize the flow and find invalid or missing dependencies. What we want to do next is simplify the process of building the dependencies using the visjs manipulation feature. The idea would be that we would display a large group of nodes and allow the user to order them correctly. We then want to be able to submit that json structure back to the server for processing.
Would this be possible?
Yes, this is possible.
Vis.js dispatches various events that relate to user interactions with graph (e.g. manipulations, or position changes) for which you can add handlers that modify or store the data on change. If you use DataSets to store nodes and edges in your network, you can always use the DataSets' get() function to retrieve all elements in you handler in JSON format. Then in your handler, just use an ajax request to transmit the JSON to your server to store the entire graph in your DB or by saving the JSON as a file.
The oppposite for loading the graph: simply query the JSON from your server and inject it into the node and edge DataSets' using the set method.
You can also store the networks current options using the network's getOptions method, which returns all applied options as json.