Number of runtime instances (pods) on Stream deployment on Spring Cloud Dataflow - kubernetes

We are right now busy with a new project where we want to introduce SCDF, but running into one major issue and was wondering if you guys faced a similar issue and how did you solve it.
What we saw, for every stream we create in SCDF, the deployment(on Kubernetes) creates separate instances of the microservices per stream. So if microservice A is used in 3 different streams, at runtime we have 3 instances of microservice A. In our solution, we have a lot of reusable microservices but if SCDF instantiates these microservices per stream we are roughly running almost 400 instances (pods) in production, and if we scale on top of this, we are using an enormous amount of resources. We need to somehow find a way to share pods (instances) across streams.
Did you face this issue? If yes, what was your approach to this?

There are a couple of ways to reduce the number of pods.
Use function composition. All of the prepackaged apps are now function based, meaning you can combine functions into a single source, sink or processor app. The SCDF stream definition requires at least a source and sink, but the out of the box functions are designed to be reused in custom apps which may apply the functions to implement intermediate steps as necessary. Bear in mind that composed functions processes data in memory, eliminating the messaging middleware used to stream data between separate pods. This could make your app more susceptible to data loss. There are always trade offs.
Use named destinations: You may share parts of a streaming pipeline using named destinations. This allows you to fan-in or fan-out. In this example, 3 stream definitions enable 2 sources to feed a shared processor and sink.
source1 > :my-named-destination
source2 > :my-named-destination
:my-named-destination > proccessor1 | sink1
The commercial edition of SCDF supports stream definitions using custom components that implement multiple input/outputs. This gives you options similar to the above, where custom routing logic is implemented internally
You can deploy a custom task in place of a stream if appropriate for your use case. The task may incorporate out of the box functions and function composition as needed.
An important consideration when combining components is increased coupling and dependencies among pipeline steps. Simple linear processing creates more pods but is much simpler to implement,deploy,manage, and reason about.

Related

kafka and parallel consumer: why order is important into a microservice architecture

I started to dive into kafka ecosystem.
I was surprised to find out that by default, each consumer only digests one "event" at a time, sequentially!
It's given by offset acknowledgement, unit of parallelism is at partition-level and some other stuff... you can find nice details here.
If I need to consume received messages in parallel into my application node thread pool, I need to use and make some non-default development effort to get it.
By other hand, several technologies have their own recipes to get it: quarkus/smallrye, confluentinc has a parallel-consummer, spring, ...
I hope to find an by-default code configuration in order to get it.
This suggests me that perhaps, some other technologies are more suitable in order to consume messages straightforwardly...
Why parallel consumer is not given by default into libraries?
Why order is important into a microservice architecture?
KafkaConsumer is a relatively low-level object, that's basically capable of reading records from given offset position, seeking to a particular offset, reading and saving that offset in existing Kafka store (called __consumer_offsets). Similarly, the receive API is fully synchronous with its poll(Duration) signature.
If more custom, e.g. asynchronous behaviour is desired, then you can use the wrappers like parallel-consumer or spring-kafka.
When it comes to library design, very often it is preferable to do only one thing (basically an applied single responsibility principle).
As an example, consider that if the "main" library were to be asynchrous, the library providers would need to provide thread creation and maintaining semantics, what happens when there are no records (compare to spring-kafka's listeners), and so on. By exposing low-level API these concerns that are not immediately relevant to Kafka these concerns can be avoided.
Why parallel consumer is not given by default into libraries?
Kafka clients are a largely pluggable ecosystem. The core developers are focused on optimizing the server code, and the built-in client libraries (and serializers) work "well-enough" (TM). So, therefore, a "by default code configuration" for parallel-consumption doesn't exist.
Why order is important into a microservice architecture
That completely depends on your app, but one example is payment-processing or handling any sort of ledger system (after all, Kafka is a sort of distributed ledger). You cannot withdraw money without first depositing a balance. This is not unique to microservices.

Kakfa Streams multiple applications / streams on same node?

Hopefully this is a quick and easy question. Right now I have an application that has two unique tasks on it in the same stream. When the entire application runs, partitions are not balanced across the two tasks which was an issue as one of the tasks requires more resources (memory / cpu)
In order to solve this I created two unique streams with different stream builders in my application and started them separately. By setting it up this way, the partitions were balanced in the way I expected.
kafkaStreams0 = new KafkaStreams(kafkaStreamsBuilder0.build(), streamsProperties0)
kafkaStreams1 = new KafkaStreams(kafkaStreamsBuilder1.build(), streamsProperties1)
kafkaStreams0.start()
kafkaStreams1.start()
I'm giving each of these their own application id in the stream properties. Something about this seems like a hack, but I can't find any documentation about whether this is a valid solution.
As a note: I'd like to avoid splitting these into two applications as I don't want to add the operational overhead.

How to use DataFrames within SparkListener?

I've written a CustomListener (deriving from SparkListener, etc...) and it works fine, I can intercept the metrics.
The question is about using the DataFrames within the listener itself, as that assumes the usage of the same Spark Context, however as of 2.1.x only 1 context per JVM.
Suppose I want to write to disk some metrics in json. Doing it at ApplicationEnd is not possible, only at the last jobEnd (if you have several jobs, the last one).
Is that possible/feasible???
I'm trying to measure the perfomance of jobs/stages/tasks, record that and then analyze programmatically. May be that is not the best way?! Web UI is good - but I need to make things presentable
I can force the creation of dataframes upon endJob event, however there are a few errors thrown (basically they refer to not able to propogate events to the listener) and in general I would like to avoid unnecessary manipulations. I want to have a clean set of measurements that I can record and write to disk
SparkListeners should be as fast as possible as a slow SparkListener would block others to receive events. You could use separate threads to release the main event dispatcher thread, but you're still bound to the limitation of having a single SparkContext per JVM.
That limitation is however easily to overcome since you could ask for the current SparkContext using SparkContext.getOrCreate.
I'd however not recommend the architecture. That puts too much pressure on the driver's JVM that should rather "focus" on the application processing (not collecting events that probably it already does for web UI and/or Spark History Server).
I'd rather use Kafka or Cassandra or some other persistence storage to store events to and have some other processing application to consume them (just like Spark History Server works).

Storm dynamic topology

Does Storm support dynamic topology? The functionality I want from this is to dynamically change the topology according to the user requirement while the Storm topology is running. For example, when user want to know the top-10 words of a stream, I use the top-10 bolt to process it, when user want to know something else, I use the other bolt to process the stream and 'unplug' the top-10 bolt.
I know it could be done by partition the stream or duplicate the stream and alway running every functionalities and only demo the data we want, or we could shut down the stream and update another topology, but is there a 'hot plug-in' way to do that?
You can't dinamically change a Storm topology's structure, i.e. modify the spouts and bolts wiring. A Storm topology's wiring is always static.
However, you could implement the needed functionality in other ways you already described. IMHO, the best, most logical way would be by running multiple topologies -- in case the data processing differs greatly. But if most of the processing is similar in both cases, just duplicate the source stream and process the data in different branches of the same topology.
It was added on STORM-561, on 03/Jun/15:
https://issues.apache.org/jira/browse/STORM-561
There is no built in way to do this (switch out one bolt for another), but what you can do is write a bolt that executes arbitrary code based on the input it receives. So long as your input and output has the same structure in storm (same tuples emitted), you could theoretically execute whatever you wanted at run time in your bolt. This is especially easy if you build your bolt in Clojure, but it's possible in essentially every language you can use with Storm.
However, this probably doesn't make a lot of sense as most computations you'll want to do involve more than one bolt and lend themselves to passing differently structured tuples around. As schiavuzzi already said in their answer, you're probably better off running multiple topologies if there are multiple, independent computations you'd like to do to a stream.
For hot deployment there is a new streaming platform from eBay.
Jetstream: https://github.com/pulsarIO/jetstream.
It has a built in config management tool and your config sits in mongodb. When user modify the config bean, the tool will publish the notification to zookeeper, the corresponding JetStream applications will be get notified and change the config dynamically

Service with background jobs, how to ensure jobs only run periodically ONCE per cluster

I have a play framework based service that is stateless and intended to be deployed across many machines for horizontal scaling.
This service is handling HTTP JSON requests and responses, and is using CouchDB as its data store again for maximum scalability.
We have a small number of background jobs that need to be run every X seconds across the whole cluster. It is vital that the jobs do not execute concurrently on each machine.
To execute the jobs we're using Actors and the Akka Scheduler (since we're using Scala):
Akka.system().scheduler.schedule(
Duration.create(0, TimeUnit.MILLISECONDS),
Duration.create(10, TimeUnit.SECONDS),
Akka.system().actorOf(LoggingJob.props),
"tick")
(etc)
object LoggingJob {
def props = Props[LoggingJob]
}
class LoggingJob extends UntypedActor {
override def onReceive(message: Any) {
Logger.info("Job executed! " + message.toString())
}
}
Is there:
any built in trickery in Akka/Actors/Play that I've missed that will do this for me?
OR a recognised algorithm that I can put on top of Couchbase (distributed mutex? not quite?) to do this?
I do not want to make any of the instances 'special' as it needs to be very simple to deploy and manage.
Check out Akka's Cluster Singleton Pattern.
For some use cases it is convenient and sometimes also mandatory to
ensure that you have exactly one actor of a certain type running
somewhere in the cluster.
Some examples:
single point of responsibility for certain cluster-wide consistent decisions, or coordination of actions across the cluster system
single entry point to an external system
single master, many workers
centralized naming service, or routing logic
Using a singleton should not be the first design choice. It has
several drawbacks, such as single-point of bottleneck. Single-point of
failure is also a relevant concern, but for some cases this feature
takes care of that by making sure that another singleton instance will
eventually be started.