Force Apache Flink to execute at a given point - scala

It is my understanding that Apache Flink does not actually run the operations that you ask it to until the result of those operations is needed for something. This makes it difficult to time exactly how long each operation takes, which is exactly what i am trying to do in order to compare its efficiency to Apache Spark. Is there a way to force it to run the operations when I want it to?

When running a Flink program one defines the topology and operators to be executed on a cluster. One triggers the job execution by calling env.execute where env is either an ExecutionEnvironment or a StreamExecutionEnvironment. There is one exception for batch jobs which are the API calls collect and print which trigger an eager execution.
You could use the web ui to extract the runtime of different operators. For each operator you see when it's deployed and when it finished execution.

Related

Flink StreamingEnvronment does not terminate when using Kafka as source

I am using Flink for a Streaming Application. When creating the stream from a Collection or a List, the application terminates and everything after the "env.execute" gets executed normally.
I need to use a different source for the stream. More precisely, I use Kafka as a source (env.addSource(...)). In this case, the program just block while reaching the end of the stream.
I created an appropriate Deserialization Schema for my stream, having an extra event that signals the end of the stream.
I know that the isEndOfStream() condition succeeds on that point (I have an appropriate message printed on the screen in this case).
At this point the program just stops and does nothing, and so the commands that follow the "execute" line aren't on my disposal.
I am using Flink 1.7.2 and the flink-connector-kafka_2.11, with Scala 2.11.12. I am executing using the IntelliJ environment and Maven.
While researching, I found a suggestion to throw an exception while getting at the end of the stream (using the Schema's capabilities). That does not support my goal because I also have more operators/commands within the execution of the environment that need to be executed (and do get executed correctly at this point). If i choose to disrupt the program by throwing an exception I would lose everything else.
After the execution line I use the .getNetRuntime() function to measure the running time of my operators within the stream.
I need to have StreamingEnvironment end like it does when using a List as a source. Is there a way to remove Kafka at this point for example?

Spring batch job is thread-safe?

I need to parallelize a single step of a batch spring job. Before the step to be parallelized, tasklets are run that put some results in the parameters of the job.
The results produced by the tasklets, are necessary to execute the Partitioner and the Items of the step to be parallelized.
A doubt is arising that I really can't solve. Since I can have the same job running simultaneously multiple times with different initial parameters, are the tasklets and step items safe thread-safe?
No, tasklets and chunk-oriented step components are not thread-safe. If they are shared between multiple job instances/executions running concurrently, you need to make them thread-safe.
You can achieve this by using JobScoped steps and StepScoped readers/writers. You can also use the SynchronizedItemStreamReader and the (upcoming) SynchronizedItemStreamWriter to make readers and writers thread-safe. All item readers and writers provided by Spring Batch have a mention about their thread-safety in the Javadoc.
You do not want to run multiple instances of the same job. It would be better to run multiple tasks or processes in the same step and or job. You might want to lookup job partitioning, and or Remote Chucking to do concurrent processing.
If it has to be isolated jobs then you might have your concurrent jobs write out to say a message que as their end (writer) step, and then have another job listen to read from that que.
https://docs.spring.io/spring-batch/2.1.x/cases/parallel.html

can spring batch be used as job framework for non batch jobs (regular job)

Is it possible to use spring batch as a regular job framework?
I want to create a device service (microservice) that has the responsibility
to get events and trigger jobs on devices. The devices are remote so it will take time for the job to be complete, but it is not a batch job (not periodically running or partitioning large data set).
I am wondering whether spring batch can still be used a job framework, or if it is only for batch processing. If the answer is no, what jobs framework (besides writing your own) are famous?
Job Description:
I need to execute against a specific device a job that will contain several steps. Each step will communicate with a device and wait for a device to confirm it executed the former command given to it.
I need retry, recovery and scheduling features (thought of combining spring batch with quartz)
Regarding read-process-write, I am basically getting a command request regarding a device, I do a little DB reads and then start long waiting periods that all need to pass in order for the job/task to be successful.
Also, I can choose (justify) relevant IMDG/DB. Concurrency is outside the scope (will be outside the job mechanism). An alternative that came to mind was akka actors. (job for a device will create children actors as steps)
As far as I know - not periodically running or partitioning large data set are not primary requirements for usage of Spring Batch.
Spring Batch is basically a read - process - write framework where reading & processing happens item by item and writing happens in chunks ( for chunk oriented processing ) .
So you can use Spring Batch if your job logic fits into - read - process - write paradigm and rest of the things seem secondary to me.
Also, with Spring Batch , you should also evaluate the part about Job Repository . Spring Batch needs a database ( either in memory or on disk ) to store job meta data and its not optional.
I think, you should put more explanation as why you need a Job Framework and what kind of logic you are running that you are calling it a Job so I will revise my answer accordingly.

Using futures in Spark-Streaming & Cassandra (Scala)

I am rather new to spark, and I wonder what is the best practice when using spark-streaming with Cassandra.
Usually, when performing IO, it is a good practice to execute it inside a Future (in Scala).
However, a lot of the spark-cassandra-connector seems to operate synchronously.
For example: saveToCassandra (com.datastax.spark.connector.RDDFunctions)
Is there a good reason why those functions are not async ?
should I wrap them with a Future?
While there are legitimate cases when you can benefit from asynchronous execution of the driver code it is not a general rule. You have to remember that the driver itself is not the place where actual work is performed and Spark execution is a subject of different types of constraints in particular:
scheduling constraints related to resource allocation and DAG topology
batch order in streaming applications
Moreover thinking about the actions like saveToCassandra as IO operation is a significant oversimplification. Spark actions are just entry points for Spark jobs where typically IO activity is just a tip of the iceberg.
If you perform multiple actions per batch and have enough resources to do it without negative impact on individual jobs or you want to perform some type of IO in the driver thread itself then async execution can be useful. Otherwise you probably wasting your time.

Total number of jobs in a Spark App

I already saw this question How to implement custom job listener/tracker in Spark? and checked the source code to find out how to get the number of stages per job but is there any way to track programatically the % of jobs that got completed in a Spark app?
I can probably get the number of finished jobs with the listeners but I'm missing the total number of jobs that will be run.
I want to track progress of the whole app and it creates quite a few jobs but I can't find to find it anywhere.
#Edit: I know there's a REST endpoint for getting all the jobs in an app but:
I would prefer not to use REST but to get it in the app itself (spark running on AWS EMR/Yarn - getting the address probably is doable but I'd prefer to not do it)
that REST endpoint seems to be returning only jobs that are running/finished/failed so not total number of jobs.
After going through the source code a bit I guess there's no way to see upfront how many jobs will there be since I couldn't find any place where Spark would be doing such analysis upfront (as jobs are submitted in each action independently Spark doesn't have a big picture of all the jobs from the start).
This kind of makes sense because of how Spark divides work into:
jobs - which are started whenever the code which is run on the driver node encounters an action (i.e. collect(), take() etc.) and are supposed to compute a value and return it to the driver
stages - which are composed of sequences of tasks between which no data shuffling is required
tasks - computations of the same type which can run in parallel on worker nodes
So we do need to know stages and tasks upfront for a single job to create the DAG but we don't necessarily need to create a DAG of jobs, we can just create them "as we go".