Hopefully this is a quick and easy question. Right now I have an application that has two unique tasks on it in the same stream. When the entire application runs, partitions are not balanced across the two tasks which was an issue as one of the tasks requires more resources (memory / cpu)
In order to solve this I created two unique streams with different stream builders in my application and started them separately. By setting it up this way, the partitions were balanced in the way I expected.
kafkaStreams0 = new KafkaStreams(kafkaStreamsBuilder0.build(), streamsProperties0)
kafkaStreams1 = new KafkaStreams(kafkaStreamsBuilder1.build(), streamsProperties1)
kafkaStreams0.start()
kafkaStreams1.start()
I'm giving each of these their own application id in the stream properties. Something about this seems like a hack, but I can't find any documentation about whether this is a valid solution.
As a note: I'd like to avoid splitting these into two applications as I don't want to add the operational overhead.
Related
We are right now busy with a new project where we want to introduce SCDF, but running into one major issue and was wondering if you guys faced a similar issue and how did you solve it.
What we saw, for every stream we create in SCDF, the deployment(on Kubernetes) creates separate instances of the microservices per stream. So if microservice A is used in 3 different streams, at runtime we have 3 instances of microservice A. In our solution, we have a lot of reusable microservices but if SCDF instantiates these microservices per stream we are roughly running almost 400 instances (pods) in production, and if we scale on top of this, we are using an enormous amount of resources. We need to somehow find a way to share pods (instances) across streams.
Did you face this issue? If yes, what was your approach to this?
There are a couple of ways to reduce the number of pods.
Use function composition. All of the prepackaged apps are now function based, meaning you can combine functions into a single source, sink or processor app. The SCDF stream definition requires at least a source and sink, but the out of the box functions are designed to be reused in custom apps which may apply the functions to implement intermediate steps as necessary. Bear in mind that composed functions processes data in memory, eliminating the messaging middleware used to stream data between separate pods. This could make your app more susceptible to data loss. There are always trade offs.
Use named destinations: You may share parts of a streaming pipeline using named destinations. This allows you to fan-in or fan-out. In this example, 3 stream definitions enable 2 sources to feed a shared processor and sink.
source1 > :my-named-destination
source2 > :my-named-destination
:my-named-destination > proccessor1 | sink1
The commercial edition of SCDF supports stream definitions using custom components that implement multiple input/outputs. This gives you options similar to the above, where custom routing logic is implemented internally
You can deploy a custom task in place of a stream if appropriate for your use case. The task may incorporate out of the box functions and function composition as needed.
An important consideration when combining components is increased coupling and dependencies among pipeline steps. Simple linear processing creates more pods but is much simpler to implement,deploy,manage, and reason about.
I'm using a single kafka topic for different types of related messages. Topic name is: apiEvents. Events are of type:
ApiUpdateEvent
EndpointUpdateEvent
TemplateUpdateEvent
One of the applications I have, consumes all these events. Moreover - I want it to consume the same event differently (twice), in two unrelated use cases.
For example, two use cases for the same event (EndpointUpdateEvent):
I'd like create a windowed time frame of 500ms and respond to an aggregation of all the events that came in this time frame - ONCE!
These same events as stated in section (1) - I want to respond to each one individually, by triggering some DB operation.
Thus, I want to write code that will be clean and maintainable and wouldn't want to throw all use cases in one big consumer with a lot of mess.
A solution I've thought about is to write a new kafka-consumer for each use case and to assign each consumer a different groupId (within the same application). That way, each business logic use case will have its own class which will handle the events in its own special way. Seems tidy enough.
May there arise any problems if I create too many consumer groups in one application?
Is there a better solution that will allow me to keep clean and divide different business logic use cases?
It sounds like you are on the right track by using separate consumer groups for different business logic use cases that will have separately managed offsets to track the individual progress. This will also align more with a microservice style architecture where different business cases may be implemented in different components.
One more consideration - And I cannot judge this just based on the information provided, but I would also think about splitting your topic into one per event type. It is not a problem for a consumer group to be subscribed to multiple topics at the same time. Whereas I believe it is less efficient to have consumers process/discard a large number of events that are irrelevant for them.
You can use Kafka Streams Processor API to consume and act on individual messages as well as window them within a specific, rolling/hopping time period
I'm building a solution where we'll have a (service-fabric) stateless service deployed to K instances. This service is tasked with some workload (like querying) and I want to split the workload between them as evenly as I can - and I want to make this a dynamic solution, which means if I decide to go from K instances to N instances tomorrow, I want the workload splitting to happen in a way that it will automatically distribute the load across N instances now. I don't have any partitions specified for this service.
As an example -
Let's say I'd like to query a database to retrieve a particular chunk of the records. I have 5 nodes. I want these 5 nodes to retrieve different 1/5th of the set of records. This can be achieved through some query logic like (row_id % N == K) where N is the total number of instances and K is the unique instance_number.
I was hoping to leverage FabricRuntime.GetNodeContext().NodeId - but this returns a guid which is not overly useful.
I'm looking for a way where I can deterministically say it's instance number M out of N (I need to be able to name the instances through 1..N) - so I can set my querying logic according to this. One of the requirements is if that instance goes down / crashes etc... when SF automatically restarts it, it should still identify as the same instance id - so that 2 or more nodes doesn't query the same set of results.
What is the best of solving this problem? Is there a solution which involves pure configuration through ApplicationManifest.xml or ServiceManifest.xml?
There is no out of the box solution for your problem, but it can be easily done in many different ways.
The simplest way is using the Queue-Based Load Leveling pattern in conjunction with Competing Consumers pattern.
It consists of creating a queue, add the work to the queue, and each instance get one message to process this work, if one instance goes down and the message is not processed, it goes back to the queue and another instance pick it up.
This way you don't have to worry about the number of instances running, failures and so on.
Regarding the work being put in the queue, it will depend if you want to to do batch processing or process item by item.
Item by item, you put one message in the queue for each item being processed, this is a simple way to handle the work and each instance process one message at time, or multiple messages in parallel.
In batch, you can put a message that represents a list of items to be processed and each instance process that batch until completed, this is a bit trickier because you might have to handle the progress of the work being done, in case of failure, the next time you can continue from where it stopped.
The queue approach is a reactive design, in this case the work need to be put in the queue to trigger the processing, If you want a proactive approach and need to keep track of which work goes to who, you probably might be better of using some other approach, like a Leasing mechanism, where each instance acquire a lease that belongs to the instance until it releases the lease, this would more suitable when you work with partitioned data or other mechanism where you can easily split the load.
Regarding the issue with the ID, an option would be the InstanceId of the replica you are on, you can reach by StatelessService.Context.InstanceId, it is not a sequential ID, but it is a random number. It is better than using the node id, because you might have multiple partitions on same node and the id would conflict with each other.
If you decide to use named partitions, you could use order in the partition name instead, so each partition would have a sequential name.
Worth mention that service fabric has a limitation that doesn't allow services to have multiple replicas on same node, because of this limitation you might have to design your services with this in mind, otherwise you won't be able to scale out once the limit is reached. Also, the same thread has some discussion about approaches to process multiple distributed items that might give you some ideas.
The doc states that it is possible to schedule multiple jobs from within one Spark Session / context. Can anyone give an example on how to do that? Can I launch the several jobs / Action, within future ? What Execution context should I use? I'm not entirely sure how spark manage that. How the driver or the cluster is aware of the many jobs being submitted from within the same driver. Is there anything that signal spark about it ? If someone has an example that would be great.
Motivation: My data is key-Value based, and has the requirement that for each group associated with a key I need to process them in
batch. In particular, I need to use mapPartition. That's because In each
partition I need to instantiate an non-serializable object for
processing my records.
(1) The fact is, I could indeed, group things using scala collection directly within the partitions, and process each group as a batch.
(2) The other way around, that i am exploring would be to filter the data by keys before end, and launch action/jobs for each of the filtered result (filtered collection). That way no need to group in each partition, and I can just process the all partition as a batch directly. I am assuming that the fair scheduler would do a good job to schedule things evenly between the jobs. If the fair Scheduler works well, i think this solution is more efficient. However I need to test it, hence, i wonder if someone could provide help on how to achieve threading within a spark session, and warn if there is any down side to it.
More over if anyone has had to make that choice/evaluation between the two approach, what was the outcome.
Note: This is a streaming application. each group of record associated with a key needs a specific configuration of an instantiated object, to be processed (imperatively as a batch). That object being non-serializable, it needs to be instantiated per partition
I want to use Lagom to build a data processing pipeline. The first step in this pipeline is a service using a Twitter client to supscribe to a stream of Twitter messages. For each new message I want to persist the message in Cassandra.
What I dont understand is given I model my Aggregare root as a List of TwitterMessages for example, after running for some time this aggregare root will be several gigabytes in size. There is no need to store all the TwitterMessages in memory since the goal of this one service is just to persist each incomming message and then publish the message out to Kafka for the next service to process.
How would I model my aggregate root as Persistent Entitie for a stream of messages without it consuming unlimited resources? Are there any example code showing this usage if Lagom?
Event sourcing is a good default go to, but not the right solution for everything. In your case it may not be the right approach. Firstly, do you need the Tweets persisted, or is it ok to publish them directly to Kafka?
Assuming you need them persisted, aggregates should store in memory whatever they need to validate incoming commands and generate new events. From what you've described, your aggregate doesn't need any data to do that, so your aggregate would not be a list of Twitter messages, rather, it could just be NotUsed. Each time it gets a command it emits a new event for that Tweet. The thing here is, it's not really an aggregate, because you're not aggregating any state, you're just emitting events in response to commands with no invariants or anything. And so, you're not really using the Lagom persistent entity API for what it was made to be used for. Nevertheless, it may make sense to use it in this way anyway, it's a high level API that comes with a few useful things, including the streaming functionality. But there are also some gotchas that you should be aware of, you put all your Tweets in one entity, you limit your throughput to what one core on one node can do sequentially at a time. So maybe you could expect to handle 20 tweets a second, if you ever expect it to ever be more than that, then you're using the wrong approach, and you'll need to at a minimum distribute your tweets across multiple entities.
The other approach would be to simply store the messages directly in Cassandra yourself, and then publish directly to Kafka after doing that. This would be a lot simpler, a lot less mechanics involved, and it should scale very nicely, just make sure you choose your partition key columns in Cassandra wisely - I'd probably partition by user id.