How to access meta information of a job inside Spark application

How to access meta information of a job inside Spark application - google-cloud-dataproc

I'd like to get notified when a Cloud Dataproc job finished. Unfortunately Cloud Dataproc does not seem to provide hooks or some ways to notify a job's lifecycle, I want to implement the mechanism in my own.
I'm planning to push to Pub/Sub when a job finished from my Spark application. But how do I know an information to identify the job inside a Spark application? If I could get User Labels of the job from the application underneath, I'm utilizing it by giving a unique label on submit, then include the label key and value in the Pub/Sub message.

Related

Configure email alert for any job/workflow failures in Hue/Oozie?

I would like to get email notifications if any job/workflow failed in Oozie. I am using Hue to monitor the workflows.
I don't want to add email action in each and every workflow because I have around 60 workflows already running.
I am also aware of the approach of sub-workflow, even with this approach I have to edit all my 60 workflows and restart co-coordinator to reflect the change.
Is it possible in Oozie or Hue to get notification for any job failures without modifying the workflow? Can we configure something at Oozie/Hue level to get email notifications?

There is no option out of the box, Oozie SLA connecting to your monitoring is often used for that. But this would require an update of the workflows.
In the future, an option could also be added to Hue to automatically add the email action on failure to any workflow, but this would need to be developed.
Without touching your workflows you would need to scrape the Oozie jobs API, but this is also kind of rebuilding the wheel.

Run a Google Cloud Function for each file in a bucket

I have a Google Cloud Function triggered by a Google Cloud Storage object.finalize event. When I deploy a new version of this function, I would like to run it for every existing file in the bucket (which have already been processed by the previous version of the function). Processing all the existing files in the bucket is a long running task, hence I don't think a Google Cloud Function which will process all files in a row is an option.
The best option I can see for now is to make a Google Cloud Function I can triggered via HTTP that will list all the files in the bucket and publish one event per file via Google PubSub, and then process each of these events with a slightly modified version of my initial Google Cloud Function which accepts a PubSub event in place of the object.finalize storage event.
I think it can work but I was wondering if there was an easier way to perform this operation.

If the operation you're trying to perform may take longer than the maximum time that a Cloud Function can run, you will need to split that operation into multiple steps. Your approach of using a PubSub trigger for each individual file, sounds like a valid approach to do that for me.

One option might be to write a small program that lists all of the objects in a bucket and, for each object, posts a message to Cloud Pub/Sub that triggers your function in the same way a GCS change would.

Google Cloud Dataflow: while in PubSub streaming mode, TextIO.Read uses massive amounts of vCPU time

I'm using Google Cloud Platform to transfer data from an Azure server to a BigQuery table (working nice and smoothly, functionally speaking).
The pipeline looks like this:
Dataflow streaming pipeline
The 'FetchMetadata' part of the pipeline is a simple TextIO.Read implementation where I read a 66-line .csv file with metadata from a GCP Storage bucket:
PCollection<String> metaLine = p.apply(TextIO.Read.named("FetchMetadata")
.from("gs://my-bucket"));
When I use my pipeline in Batch mode this works like a charm: first the metadata file is loaded in the pipeline in less than a second of vCPU time and then the data itself is loaded in the pipeline. Now when running in Streaming mode I would love to replicate that behaviour to some extent but when I just use the same code there is a problem: when running the pipeline for just 15 minutes (actual time) the TextIO.Read block uses a whopping 4 hours of vCPU time. For a pipeline that will be permanently running for a low budget project this is unacceptable.
So my question: is it possible to change the code so the file is periodically read again (if the file changes I want the pipeline to be updated, so let's say hourly updates) and not continiously like it's doing right now.
I've found some documentation where there is mention of TextIO.Read.Bound which looks like a good place to start solving this issue, but it's no solution for my periodical update problem (as far as I know)

I was stuck in a similar situation. The way I solved this problem is a bit different. I would like the community's insights into this solution.
I had files being updated every hour in a GCS bucket. I followed the blog post about Scheduling Dataflow Jobs from App Engine or Google Cloud Function.
I had the app engine endpoint configured to receive the object change notifications from the GCS bucket which contained the files to be processed. For every file that was created (update is also a create operation in an object store), app engine application would submit a job to google dataflow. The job would read the lines from the file (file name in the HTTP request body) and publish it to a Google PubSub topic.
A streaming pipeline then had been subscribed to the Google PubSub topic that would process and output the relevant rows to big query. This way, streaming pipeline ran at the minimum worker count when idle, the ingest of the files happened through a batch pipeline and the streaming pipeline scaled with respect to the volume of the publications in the Google PubSub topic.
In the tutorial for submitting jobs to Google Dataflow, the jar is executed on the underlying terminal. I modified the code to submit a job to google dataflow using templates which can be executed with parameters. This way, the job submission operation becomes super light weight while still creating a job for every new file upload to the GCS bucket. Please refer this link for details about executing google dataflow job templates.
Note: Please mention in the comments if the answer needs to be modified for the code snippets of the dataflow job template and app engine application and I will update the answer accordingly.

Using Kafka instead of Redis for the queue purposes

I have a small project that uses Redis for the task queue purposes. Here is how it basically works.
I have two components in the system: desktop client (can be more than one) and a server-side app. Server-side app has a pull of tasks for the desktop client(s). When a client comes, the first available task from the pull is given to it. As the task has an id, when the desktop client gets back with the results, the server-side app can recognize the task by its id. Basically, I do the following in Redis:
Keep all the tasks as objects.
Keep queue (pool) of tasks in several lists: queue, provided, processing.
When a task is being provided to the desktop client, I use RPOPLPUSH in Redis to move the id from the queue list to the provided list.
When I get a response from the desktop client, I use LREM for the given task ID from the provided list (if it fails, I got a task that was not provided or was already processed, or just never existed - so, I break the execution). Then I use LPUSH to add the task id to the processing list. Given that I have unique task ids (controlled on the level of my app), I avoid duplicates in the Redis lists.
When the task is finished (the result got from the desktop client is processed and somehow saved), I remove the task from the processing list and delete the task object from Redis.
If anything goes wrong on any step (i.e. the task gets stuck on the processing or provided list), I can move the task back to the queue list and re-process it.
Now, the question: is it somehow possible to do the similar stuff in Apache Kafka? I do not need the exact behavior as in Redis - all I need is to be able to provide a task to the desktop client (it shouldn't be possible to provide the same task twice) and mark/change its state according to the actual processing status (new, provided, processing), so that I could control the process and restore the tasks that were not processed due to some problem. If it's possible, could anyone please describe the applicable workflow?

It is possible for kafka to act as a standard queue. Check the consumer group feature.
If the question is about the appropriateness, please also refer Is Apache Kafka appropriate for use as a task queue?
We are using kafka as a task queue, one of the consideration went in favor of kafka was that it is already in our application ecosystem, found it easier than adding one more component.

Bluemix Auto Scaling API

Is there a way for me to programmatically get notified when Bluemix auto scaling has scaled up or down?
I'm reading streaming data from a queue and would like to make sure the number of instances that I have are balanced and data is partitioned correctly

At present this kind of notification service is not available, only you can do is query the instance scaling history in Web UI. I think this requirement is interesting and should be considered to provide to developer in the future.

This kind of alert isn't available yet but you can write a simple script monitoring output of
cf app (appname)
It returns the number of instances running and the state of each one, with the right combination of awk and grep (or a perl script for example) you could have your own alerter while waiting for this of functionality

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse