How make BigQueryIO wait for some DoFn input - apache-beam

In ApacheBeam once you have some PCollection input you can do
input.aplly(new ParDo())
however BigQueryIO.read() can be applied only on the Pipeline instance, so my question is
how can I make BigQueryIO.read() wait till some other DoFn finishes or produces at least 1 output, should it be a different pipeline where I'll put BigQueryIO or can it be done within the same one?

I don't think it's possible to make BigQueryIO.read() wait for some input since, actually, it creates a PTransform<PBegin, PCollection<T>> where PBegin input type says that it's supposed to be executed in the beginning of your pipeline.
I also don't see any other "read" PTransform's implemented in BigQueryIO connector that would accept any input PCollection.
So, very likely it will be easier run it as a different pipeline and use something like Apache Airflow to orchestrate them.

Related

spark iterative programming - exit condition without launching a job

When writing iterative programs, a common situation is that you need to define a condition at which the program will stop execution and return the result. This stop condition can be for example rdd.isEmpty. The problem is that this "condition test" is an action which triggers a job to be executed and therefore schedueling,serialisation and others costs for each iteration
def iterate(layer:RDD[Long])={
layer.cache()
if(layer.isEmpty) return null;
val nextlayer=process(layer)//contains hashjoins, joins, filters, cache
iterate(nextlayer)
}
The timeline will look like:
[isempty][------spacing----][isempty][------spacing----][isempty]
what is the best way for iterative programming in such situtation? we should not be forced to launch a job in each iteration.
is there a method to check for empty rdd without executing an action?
possible solution:
as you can see in the image below, the is empty is now executed every 5 iterations.each iteration is represented by a periodic triplets of blue rectangles. i did this by modifying the stop condition to the following:
if(layer.index%5==0 && layer.isEmpty) return null;
But as you can see in the figure below am still getting actions that get executed as "run at ThreadPoolExecutor.java". A research shows that those actions are happenning because i am doing "broadcast hash joins" of small DFs with larger ones
threadpoolexecutor reason
timeline
You can try using
layer.cache()
layer.isEmpty
This means that the check for empty is going to trigger an action, but the rdd will be cached, so when you pass it to the process method, the things that were done in isEmpty will be "skipped".

How to change process variable value through remote rest api call for current human task in jbpm 6.5.0Final

I have a many human task. After start the process i want to update some process variable value with rest API call that's relates to current task. If anyone knows how to do that commend bellow.
I try with /execute this is only start the task then how to update the process variable for already started process instance?
Based on the documentation
Here is how to do update the process variable. However, this will update an entire process rather than only that specific task.
server/containers/{id}/processes/instances/{pInstanceId}/variables - POST
If you want to update process variable from a task, you should do it during task completion. However, this requires you to have output variables from that task. Otherwise, it won't take any effect.
server/containers/{id}/tasks/{tInstanceId}/states/completed - PUT
Anyway, the full documentation of rest can be viewed in
{localhost}:{port}/kie-server/docs

Talend: how to fork the output of a parent job and call either of child job based on some condition

I am learning Talend. I have a scenario where I have to apply if else if condition to the output from the parent job and based on the outcome, call either of the child jobs. I have thought of few options such as using global variables or context variables. Is it possible to configure the child jobs to listen to the global/context variable change and run if the condition match? I tried looking to configure this, but failed to understand where can I do these configurations.
I even tried taking the output from the parent job component's into a tjavarow where I can write java code with if else if conditions. I was thinking to explicitly call the sub jobs from the if else branching, but I am not able to make any headway. Can someone please direct me through the right approach? Any new approach is also welcomed.
NOTE: We are using free version of Talend.
If I understand correctly, this can be achieved using "Run If" triggers, like this:
Inside tJava, you can write some logic to calculate your variables.
On the If trigger, you write a condition that determines whether or not the component after it is run.
In my example, I'm not actually making use of what's inside tJava, I'm just getting the number of lines output by tLogRow, so it can be left out and the "Run If" triggers connected directly to tLogRow.

JBPM 5.4 process variables

I am new to Jbpm 5.4. I want to set a process variable after completing one Human task. I know we can add process variables at start up using a map. Can any one tell me how set process variable in the middle of the process please???
Result mapping & data associations. Like any other type of task.
See
https://docs.jboss.org/jbpm/v5.4/userguide/ch.human-tasks.html#d0e4970
Result mapping: Allows copying the value of result parameters of the
human task to a process variable. Upon completion of the human task,
the values will be copied. A human task has a result variable "Result"
that contains the data returned by the human actor. The variable
"ActorId" contains the id of the actor that actually executed the
task.
also
https://github.com/droolsjbpm/jbpm/blob/master/jbpm-bpmn2/src/test/resources/BPMN2-DataOutputAssociations.bpmn2
for an example

How to save file at some point without closing it?

OUTPUT TO "logfile.txt".
FOR EACH ...:
...
PUT "Some log data". OUTPUT CLOSE. OUTPUT TO "logfile.txt" APPEND.
...
END.
Haven't found an appropriate statement to save file at some point. I don't wanna use UNBUFFERED APPEND because it is supposedly slower. Maybe there is built-in logging tools? Maybe STREAMS could help me? Problem in my solution that I have to specify log filename each time i open it with OUTPUT TO statement. A nested procedure may not have a clue about filename.
The question as it stands is still ambiguous.
If you want a way to route the output through a standard "service" similar to what LOG-MANAGER does, you can do that by using
static members of a class,
by using an API in a persistent procedure and PUBLISHing to it,
by using an API in a session super-procedure and calling it's API
STREAMS will give you a way to segregate output for a single procedure or class to a single file, and keep that output from getting mingled with the production output, however it's limited to the current program, which means it's not a general solution as an application-wide logging facility.
There is no "save" option.
However... you can force output to be flushed with:
put control null(0).
"Supposedly slower" is awfully vague. Yes, there is potentially more IO with unbuffered output. But whether or not that really matters depends heavily on what you are doing and how it will be used. It is very unlikely that it actually matters.
A STREAM would certainly help to keep things organized and make it so that you don't have to know the name of the file in nested procedures.
Yes, there are built in logging tools. Look at the LOG-MANAGER system handle.
The code in the question would be better written as:
define stream logStream.
output stream logStream to value( "log.txt" ) append unbuffered.
for each customer no-lock:
put stream logStream custName skip.
/* put stream logStream control null(0). */ /* if you want to try fooling with buffered output... */
end.
output stream logStream close.