Does draining a dataflow job that uses FILE_LOAD write method ensure that all elements are written? - dataflow

You are writing elements to bigquery in the following way:
pcoll.apply(BigQueryIO.writeTableRows()
.to(destination)
.withSchema(tableSchema)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withTriggeringFrequency(org.joda.time.Duration.standardMinutes(10))
.withNumFileShards(10)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
And drain the job either via the gcloud cli tool or the google cloud console, it seems that the job is considered "drained" almost instantly, even if the withTriggeringFrequency had just triggered before. Is the behaviour of the drain function such that it triggers all writes if any are pending?

Yes Dataflow immediately closes any in-process windows and fires all triggers.
Once Drain is triggered, the pipeline will stop accepting new inputs. The input watermark will be advanced to infinity. Elements already in the pipeline will continue to be processed. Drained jobs can safely be cancelled.When you issue the Drain command.
For reference see this doc from
Google Effects of draining a job Effects of draining a job

Related

Retry a Google Cloud Data Fusion Pipeline if it fails

We have several cloud data fusion pipelines and some fail, for seemingly random reasons. Albeit rarely.
Is there a way to automatically retry a pipeline if it fails, say 3 times?
You can automatically retry a pipeline if it fails using triggers. When it fails, you need to STOP and START the pipeline.
To use triggers with Data Fusion, select the event you need to execute, in this case it will be the event Fails. You can see this documentation about creating triggers.
Next you need to STOP and START the pipeline. This is documentation about stop and start the pipelines.

Uber Cadence task list management

We are using Uber Cadence and periodically we run into issues on the production environment.
The setup is the following:
One Java 14 BE with Cadence client 2.7.5
Cadence service version 0.14.1 with Postgres DB
There are multiple domains, for all domains the single BE server is registered as a worker.
What is visible in the logs is that sometimes during a query the cadence seems to lose stickiness to the BE service:
"msg":"query direct through matching failed on sticky, clearing sticky before attempting on non-sticky","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
...
In the backend in the meanwhile nothing is visible. However, during this time if I check the pollers on the cadence web client I see that the task list is there, but it is not considered as a decision handler any more (http://localhost:8088/domains/mydomain/task-lists/mytasklist/pollers). Because of this pretty much the whole environment is dead because there is nothing that can progress with the decision. The only option is to restart the backend service and let it re-register as a worker.
At this point the investigation is stuck, so some help would be appreciated.
Does anyone know about how a worker or task list can lose its ability to be a decision handler? Is it managed by cadence, like based on how many errors the worker generates? I was not able to find anything about this.
As I understand when the stickiness is lost, cadence will check for another worker to replay the workflow and continue it (in my case this will be the same worker as there is only one). Is it possible that replaying the flow is not possible (although I think it would generate something in the backend log from the cadence client) or at that point the worker is already removed from the list and that causes the time-out?
Any help would be more than welcome! Thanks!
Does anyone know about how a worker or task list can lose its ability to be a decision handler
This will happen when worker stops polling for decision tasks. For example if you configure the worker only polls for activity tasks, then it will show like that. So apparently it will also happen if for some reason the worker stops polling for decision tasks.
As I understand when the stickiness is lost, cadence will check for another worker to replay the workflow and continue
Yes, as long as there is another worker polling for decision tasks. Note that Query tasks is considered as of the the decision task types. (this is a wrong design, we are working on to separate it).
From your logs:
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
This means that Cadence dispatch the Query tasks to a worker, and a worker accepted, but didn't respond back within timeout.
It's very highly possible that there is some bugs in your Query handler logic. The bug caused decision worker to crash(which means Cadence java client also has a bug too, user code crashing shouldn't crash worker). And then a query task loop over all instances of your worker pool, finally crashed all your decision workers.

How to add a new Job to a Quartz cluster that needs to start running during the rolling update of the cluster?

We have clustered Quartz scheduler runner on a couple of application nodes. The application nodes need to be updated, and for high-availability reasons, the update is done as rolling update.
Together with the update, we need to add a new job, and that job needs to start running immediately - i.e. it can't wait until all nodes have been updated. The problem is that I can't control which node will run the new job, and if one of the old nodes runs the job, the job instantiation will faill (with a ClassNotFoundException), the trigger will be set to the state ERROR and the job won't run again.
One solution for this problem would be to do two updates: one to add the class in all nodes, and one to add the trigger. The main reason against this approach is that our ops procedures don't support this.
So is there also a way to schedule the new job and make it run reliably with a single update?
I just tried it and it turned out that Quartz gets a ClassCastException while trying acquire the trigger. The exception is wrapped into a JobPersistenceException and the trigger is left in WAITING state.
So, although this could cause an error log entry in one of the old nodes, Quartz doesn't leave the trigger in a non-working state.

Zookeeper priority queue

My problem description is follows:
I have n state based database infinite crawlers:
Currently how it is happening:
We are using single machine for crawling.
We have three level of priority queue. High, Medium and LOW.
At starting all Database job are put into lower level queue.
Worker reads a job from queue and do operation.
After finishing job it reschedule it with a delay of 5 minutes.
Solution I found
For Priority Queue I can use:
-
http://zookeeper.apache.org/doc/r3.2.2/recipes.html#sc_recipes_priorityQueues
Problem solution I am still searching are:
How to reschedule a job in queue with future schedule time. Is there
a way to do that in zookeeper ?
Canceling a already started job. Suppose user change his database
authentication details. I want to stop already running job for that
database and restart with new details.
What I thought is while starting a worker It will subscribe for that
it's znode changes and if something happen, It will stop that job and
reschedule it.
Infinite Queue
What I thought is that after finishing it will remove it from queue and
readd it with future schdule time. (It implementation depend on point 1)
Is it correct way of doing this task infinite task?

Select node in Quartz cluster to execute a job

I have some questions about Quartz clustering, specifically about how triggers fire / jobs execute within the cluster.
Does quartz give any preference to nodes when executing jobs? Such as always or never the node that executed the same job the last time, or is it simply whichever node that gets to the job first?
Is it possible to specify the node which should execute the job?
The answer to this will be something of a "it depends".
For quartz 1.x, the answer is that the execution of the job is always (only) on a more-or-less random node. Where "randomness" is really based on whichever node gets to it first. For "busy" schedulers (where there are always a lot of jobs to run) this ends up giving a pretty balanced load across the cluster nodes. For non-busy scheduler (only an occasional job to fire) it may sometimes look like a single node is firing all the jobs (because the scheduler looks for the next job to fire whenever a job execution completes - so the node just finishing an execution tends to find the next job to execute).
With quartz 2.0 (which is in beta) the answer is the same as above, for standard quartz. But the Terracotta folks have built an Enterprise Edition of their TerracottaJobStore which offers more complex clustering control - as you schedule jobs you can specify which nodes of the cluster are valid for the execution of the job, or you can specify node characteristics/requisites, such as "a node with at least 100 MB RAM available". This also works along with ehcache, such that you can specify the job to run "on the node where the data keyed by X is local".
I solved this question for my web application using Spring + AOP + memcached. My jobs do know from the data they traverse if the job has already been executed, so the only thing I need to avoid is two or more nodes running at the same time.
You can read it here:
http://blog.oio.de/2013/07/03/cluster-job-synchronization-with-spring-aop-and-memcached/