How to configure parameters of Flink job on application level? - scala

Can we configure parameters such as taskmanager.numberOfTaskSlots and taskmanager.memory.process.size in the application level?
I know that parallelism can be configured in application level. How about the other parameters and how to do it in Scala?

FLIP-169: DataStream API for Fine-Grained Resource Requirements will bring some new capabilities along these lines when it lands in Flink 1.14.

Related

Kafka to MongoDB using Spring Cloud Dataflow

I'm working on a project where i have to process data coming from Kafka cluter, processing it and send it to MongoDB. The application should be deployable on the Pivotal Cloud foundary. After doing some research on the internet, i found the toolkit Spring-Cloud-Dataflow to be interesting since it can be deployed in PCF. I'm wondering how we can use it to create our real time streaming pipeline. For the moment, i'm thinking about using Kafka Streams and Spring Cloud Stream to process and transform the streams of topics but i don't know how to integrate it in SCDF and also how we can send those streams to MongoDB. I'm sorry if my question is not clear, i'm entierly new to those frameworks.
Thanks in advance
You could use the named-destination support in SCDF to directly consume events from Kafka or any other Spring Cloud Stream supported message broker implementations.
Now, for the write portion, you can use the out-of-the-box MongoDB-sink application that we build, maintain, and ship.
If you have to do some processing before you write to MongoDB, you can create a custom Spring Cloud Stream application with the desired binder implementation [see: dev-guide/docs].
To put this all together, if we assume you have events coming from a Kafka topic named Customers, and the custom processor doing some transformation on each of the received payloads (let's assume the name of the processor as CustomerTransformer), and finally the writing part to MongoDB.
Here's a take of this streaming data pipeline use-case designed from SCDF's Dashboard:

can vert.x event bus replace the need for Kafka?

I am evaluating the vert.x framework to see if I can reduce the Kafka based communications between my microservices developed using spring boot.
The question is:
Can I replace
Kafka with vert.x event bus and
spring boot microservices with vert.x based verticles
To answer quickly, I would say it depends on your needs.
Yes, the eventbus can be a good way to handle natively communication between microservices verticles using an asynchronous and non-blocking paradigm.
But in some cases you could need:
to handle some common enterprises patterns like replay mechanisms, persistence of messages, transactional reading
to be able to process some kind of messages in a chronological order
to handle communication between multiples kind of microservices that aren't all written with the same framework/toolkit or even programming language
to handle reliability, resilience and
failure recovery when all your consumers/microservices/verticles are died
to handle dynamic horizontal scalability and monitoring of your consumers/microservices/verticles
to be able to work with a single cluster deployed in multi-datacenters and multi-regions
In those cases I'd prefer to choose Apache Kafka over the native eventbus or an old fascioned JMS compliant system.
It's not forbidden to use both eventbus and kafka in the same microservices architecture according to your real needs. For example, you could have one kafka consumers group reading a kafka topic to handle scaling, monitoring, failure recovery and reply mechanism and then handle communication between your sub-verticles through the eventbus.
I'll clarify a little bit for the scalability and monitoring part and explain why I think it's more simple to handle that with Kafka over the native eventbus and cluster mode with vert.x : Kafka allow us to know in real time (through JMX metrics and the describe command):
the "lag" of a topic which corresponds to
the number of unread messages
the number of consumers of each group that are listening a topic
the number of partitions of a topic affected of each consumers
i/o metrics
So it's possible to use an ElasticStack or Prometheus+Grafana solution to monitor those metrics and use them to handle a dynamic scalability (when you know that there's a need to increase temporarily the number of consumers for example according to the lag metric and the number of partitions and the cpu/ram/swap metrics of your hosts).
To answer the second question vert.x or SpringBoot my answer will be not very objective but I'd vote for vert.x for its performances on the JVM and especially for its simplicity. I'm a little tired of the Spring factory and its big layers of abstraction that hides a lot of issues under a mountain of annotations triggering a mountain of AOP.
Moreover, In the Java world of microservices, there's other alternatives to SpringBoot like the different implementations of Microprofile (thorntail project for example).
The event-bus is not persistent. You should use it for fast verticle-to-verticle communications, and more generally to dispatch events where you know that you can loose them if you have some crash.
Kafka streams are persistent, and you should send events there because either you want other (possibly non-Vert.x) applications to consume them, and/or because you want to ensure that these events are not being lost in case of failure.
A reactive (read "scalable and fault-tolerant") Vert.x application typically uses a combination of both the event-bus and some replicable messaging systems like AMQP / Kafka / etc.
On the question:
Can I replace spring boot microservices with vert.x based verticles?
Yes, definitely, although the 2 have different programming models.
If you want a more progressive approach and use Spring for structuring your application while using Vert.x for resource efficiency over your I/O and event processing then you can mix them, see https://github.com/vert-x3/vertx-examples/tree/master/spring-examples for examples.
Take a look at the Quarkus framework: in the workshop section you'll find Vert.x and Apache Kafka combined!

What's the use of ClientQuotaCallback in kafka-clients?

I find this line in its Comment:"Quota callback interface for brokers that enables customization of client quota computation".but it doesnt has any child class,why?and i googled it but cant find an example.
In Kafka, it was decided to have all broker pluggable APIs as Java interfaces. For that reason, there are a few interfaces in kafka-clients that are not related to the clients. This is because the server side is actually written in Scala.
Anything under org.apache.kafka.server are pluggable APIs for the brokers. These can be used to customize some behaviours on the broker side:
http://kafka.apache.org/20/javadoc/org/apache/kafka/server/policy/package-summary.html
http://kafka.apache.org/20/javadoc/org/apache/kafka/server/quota/package-summary.html
For example, ClientQuotaCallback allows to customize the way quotas are calculated by Kafka brokers. For example, you can build Quotas for groups or have Quotas scale when topic/partitions are created. KIP-257 details exactly how this all works.
Of course, for these to work you need to build implementation of these interfaces and put them in the classpath on your brokers. It's not something that can be used by clients directly.

Kafka connect throttling

I have a requirement to consume messages on behalf of a set of lazy consumers who just exposes REST APIs. Therefore, I am planning to have Sink Connectors which fetches messages from Kafka topics and does HTTP POST operation on the exposed APIs.
One of the key factors for consideration is throttling. What mechanism do you suggest for throttling the Sink Tasks to meet the tier SLA of the APIs. I understand that Kafka has client quota feature, however, what is the optimum mechanism to keep track of API requests/min or sec which would allow to adjust the client quota dynamically ?
I think the best way to implement rate-limiting for your REST API would be in your connector code by blocking if necessary in SinkTask.put(). You may want to think about whether rate-limiting at the level of your SinkTasks is sufficient or you need it to be global (more complex since coordination involved).
The advantage of using Kafka quotas which you were considering is that the distributed aspect is handled for you, however I believe those can currently only be configured in terms of bytes transferred.

Spring Cloud Stream flow as one application

As far as I know there is an option to use couple components of Spring Cloud Stream as one application by using AggregateApplication or AggregateApplicationBuilder.
From what I understood, spring will not use broker (Rabbit or Kafka) for communication between steps in this situation it will just pass result from previous step as an argument to the next almost directly, am I right?
If I am, is there another way to have running more components in one instance of an application with usage of a broker? I'm aware that this is not an architecture which is great for Cloud Stream, but now I don't have an infrastructure in which I can run Dataflow and also I would like to use durability of a brokers.
In general, aggregation has been designed as a replacement for communication over a message broker - to reduce the latency by avoiding to go over a hop. That being said, it may make sense to add an option of have the channels bound for use cases like yours. Can you open a feature request in GitHub, please?