Flink - multiple instances of flink application deployment on kubernetes - kubernetes

I need help on Flink application deployment on K8
we have 3 source that will send trigger condition as in form of SQL queries. Total queries ~3-6k and effectively a heavy load on flink instance. I try to execute but it was very slow and takes lot of time to start.
Because of high volume of queries, we decide to create multiple flink app instance per source. so effectively one flink instance will execute ~1-2K queries only.
example: sql query sources are A, B, C
Flink instance:
App A --> will be responsible to handle source A queries only
App B --> will be responsible to handle source B queries only
App C --> will be responsible to handle source C queries only
I want to deploy these instances on Kubernetes
Question:
a) is it possible to deploy standalone flink jar with mini cluster (inbuilt)? like just start main method: Java -cp mainMethod (sourceName is command line argument A/B/C).
b) if k8's one pod or flink instance is down then how we can manage it in another pod or another flink instance? is it possible to give the work to other pod or other flink instance?
sorry If I mixed up two or more things together :(
Appreciate your help. thanks

Leaving aside issues of exactly-once semantics, one way to handle this would be to have a parallel source function that emits the SQL queries (one per sub-task), and a downstream FlatMapFunction that executes the query (one per sub-task). Your source could then send out updates to the query without forcing you to restart the workflow.

Related

Is it feasible to use one Spring Cloud Dataflow UI for multiple batch applications?

In the world of microservices, we would have multiple applications that have their own independent datasources and corresponding batch applications in their bounded context. Having this said, since SCDF requires a datasource to be configured to bring it up to monitor batch jobs, is it possible to configure a single, central SCDF server and UI to monitor all the batch jobs of different microservices(obviously with corresponding DBs) with the spring batch metadata tables intact with the applications' business tables? Asking this because it might look very clumsy, untidy and unmaintainable to keep so many SCDF servers running in the environment(Please correct me if my feeling is senseless).
Please bring me some clarity on this query. Thanks in advance.
Yes
Setup and install a single SCDF server with an associated DB. For each of your task/batch apps
override the TaskConfigurer and BatchConfigurer to accept a second datasource that refers to SCDF DB as shown here https://github.com/spring-cloud/spring-cloud-task/tree/main/spring-cloud-task-samples/multiple-datasources.
Thus the batch and task apps will report their state to the SCDF DB while still using their own DB for the work desired.

Batch Processing on Kubernetes

Anyone here have experience about batch processing (e.g. spring batch) on kubernetes ? Is it good idea ? How to prevent batch processing process same data if we use kubernetes auto scaling feature ? Thank you.
Anyone here have experience about batch processing (e.g. spring batch) on kubernetes ? Is it good idea ?
For Spring Batch, we (the Spring Batch team) do have some experience on the matter which we share in the following talks:
Cloud Native Batch Processing on Kubernetes, by Michael Minella
Spring Batch on Kubernetes, by me.
Running batch jobs on kubernetes can be tricky:
pods may be re-scheduled by k8s on different nodes in the middle of processing
cron jobs might be triggered twice
etc
This requires additional non-trivial work on the developer's side to make sure the batch application is fault-tolerant (resilient to node failure, pod re-scheduling, etc) and safe against duplicate job execution in a clustered environment.
Spring Batch takes care of this additional work for you and can be a good choice to run batch workloads on k8s for several reasons:
Cost efficiency: Spring Batch jobs maintain their state in an external database, which makes it possible to restart them from the last save point in case of job/node failure or pod re-scheduling
Robustness: Safe against duplicate job executions thanks to a centralized job repository
Fault-tolerance: Retry/Skip failed items in case of transient errors like a call to a web service that might be temporarily down or being re-scheduled in a cloud environment
I wrote a blog post in which I explain all these aspects in details with code examples. You can find it here: Spring Batch on Kubernetes: Efficient batch processing at scale
How to prevent batch processing process same data if we use kubernetes auto scaling feature ?
Making each job process a different data set is the way to go (a job per file for example). But there are different patterns that you might be interested in, see Job Patterns from k8s docs.

Apache Spark standalone settings

I have an Apache spark standalone set up.
I wish to start 3 workers to run in parallel:
I use the commands below.
./start-master.sh
SPARK_WORKER_INSTANCES=3 SPARK_WORKER_CORES=2 ./start-slaves.sh
I tried to run a few jobs and below are the apache UI results:
Ignore the last three applications that failed: Below are my questions:
Why do I have just one worker displayed in the UI despite asking spark to start 3 each with 2 cores?
I want to partition my input RDD for better performance. So for the first two jobs with no partions, I had a time of 2.7 mins. Here my Scala source code had the following.
val tweets = sc.textFile("/Users/soft/Downloads/tweets").map(parseTweet).persist()
In my third job (4.3 min) I had the below:
val tweets = sc.textFile("/Users/soft/Downloads/tweets",8).map(parseTweet).persist()
I expected a shorter time with more partitions(8). Why was this the opposite of what was expected?
Apparently you have only one active worker, which you need to investigate why other workers are not reported by checking the spark logs.
More partitions doesn't always mean that the application runs faster, you need to check how you are creating partitions from source data, the amount of data parition'd and how much data is being shuffled, etc.
In case you are running on a local machine it is quite normal to just start a single worker with several CPU's as shown in the output. It will still split you task of the available CPU's in the machine.
Partitioning your file will happen automatically depending on the amount of available resources, it works quite well most of the time. Spark (and partitioning the files) comes with some overhead, so often, especially on a single machine Spark adds so much overhead it will slowdown you process. The added values comes with large amounts of data on a cluster of machines.
Assuming that you are starting a stand-alone cluster, I would suggest using the configuration files to setup a the cluster and use start-all.sh to start a cluster.
first in your spark/conf/slaves (copied from spark/conf/slaves.template add the IP's (or server names) of you worker nodes.
configure the spark/conf/spark-defaults.conf (copied from spark/conf/spark-defaults.conf.template Set at least the master node to the server that runs your master.
Use the spark-env.sh (copied from spark-env.sh.template) to configure the cores per worker, memory etc:
export SPARK_WORKER_CORES="2"
export SPARK_WORKER_MEMORY="6g"
export SPARK_DRIVER_MEMORY="4g"
export SPARK_REPL_MEM="4g"
Since it is standalone (and not hosted on a Hadoop environment) you need to share (or copy) the configuration (or rather the complete spark directory) to all nodes in your cluster. Also the data you are processing needs to be available on all nodes e.g. directly from a bucket or a shared drive.
As suggested by the #skjagini checkout the various log files in spark/logs/ to see what's going on. Each node will write their own log files.
See https://spark.apache.org/docs/latest/spark-standalone.html for all options.
(we have a setup like this running for several years and it works great!)

Vertx | Global state of Verticles in a cluster

Newbie alert.
I'm trying to write a simple module in Vertx that polls the database (PostGres) every 10 seconds and pushes the results to the clients. I'm thinking of confining the blocking code (queries the database via JDBC) in a worker verticle and rest of the above layers are completely non-blocking and async.
This module will be packaged as a jar and distributed to a different apps (typically webapps) which can subscribe to the event bus via the javascript bridge.
My question here is in a clustered environment where I have 5 processes of the webapp running with the vertx modules, how can I ensure that there's only one vertx verticle querying the database. I don't want all the verticles querying the database and add more load. Or is there a different way to think to solve this problem. I'm using Vertx version 3.4.1
So there are 2 ways how your verticle can be multiplied:
If you instantiate multiple instances when you deploy your verticle
If you start to cluster your vert.x instances in different jvm's or different hosts
You could try to control the number of instances of your verticle which executes the query. Means you ensure, that the verticle only exists in one of your vert.x instances and your verticle is deployed with only one instance.
But this has several drawbacks:
your deployment is not transparent, means your cluster nodes differ in the deployment structure.
if your cluster node dies, where the query verticle is running, then you have no fallback.
So the best thing is, to deploy the verticle on all instances and synchronize it.
I see 3 possibilites:
Use hazelcast (the clustermanager of vert.x) to synchronize
http://vertx.io/docs/apidocs/io/vertx/spi/cluster/hazelcast/HazelcastClusterManager.html#getLockWithTimeout-java.lang.String-long-io.vertx.core.Handler-
There are also datastructures available, which are synchronized over
the cluster
http://vertx.io/docs/apidocs/io/vertx/spi/cluster/hazelcast/HazelcastClusterManager.html#getSyncMap-java.lang.String-
Use your database as synchronization point. you could add a simple
table which stores the last execution time in millis. The polling
modules, will first check if it is time to execute the next poll. If
the polling module executes the poll it also updates the time. This
has to be done in one transaction with a explicit lock on the time
table.
You use redis with the https://redis.io/commands/getset
functionality. You can store the time in millis in a key and ensure
with the getset method, that the upgrade of the time is atomic. So only the polling module which could set the key in redis, will execute the poll.
I'm giving out my naive solution here, I don't know if it would completely solve your problem or not but here is my thought process.
1) Polling bit, yes indeed you can have a worker verticle for blocking call's [ or else you could use Async bit here too IMHO because you already have Async Postgress JDBC client ] for the every 10secs part. code snippet like this can help you
vertx.setPeriodic(10000, id -> {
// This handler will get called every 10 seconds
JsonObject jdbcObject = fetchFromJdbc();
eventBus.publish("INTRESTED_PARTIES", jdbcObject);
});
2) For the listening part all the other verticles can subscribe to event bus and listen for the that address and would be getting the message whenever things would happen
3) This is for ensuring part that not all running instances of your jar start polling the database, for this I think the best possible way to handle would be not deploying the verticle in any jar and running the verticle in an standalone way using runtime vertx command like
vertx run DatabasePoller.java -cluster
And if you really want to be very fancy you could throw in Service Discovery for ensuring part that if the service of the verticle is already register then no other deployments would trigger registrations.
But I want to give you thumbs up on considering the events for getting that information much better way for handling inter-system communication.

Launch spark job on-demand from code

What is the recommended way to launch a Spark job on-demand from within an enterprise application (in Java or Scala)? There is a processing step which currently takes several minutes to complete. I would like to use a Spark cluster to reduce the processing down to, let's say less than 15 seconds:
Rewrite the time consuming process in Spark and Scala.
The parameters would be passed to the JAR as command line arguments. The Spark job then acquires source data from a database. Do the processing and save the output in a location readable by the enterprise application.
Question 1: How to launch the Spark job on-demand from within the enterprise application? The Spark cluster (standalone) is on the same LAN but separate from the servers on which the enterprise app is running.
Question 2: What is the recommended way to transmit the processing results back to the caller code?
Question 3: How to notify the caller code about job completion (or failure such as Spark cluster down, job time out, exception in spark code)
You could try spark-jobserver . Upload your spark.jar to the server. And from your application, you can call the job in your spark.jar using the rest interface . To know whether your job is completed or not , you can keep polling the rest interface. And when your job completes and if the result is very small you could get it from the rest interface itself. But if the result is huge it is better to save to some db.