Storm dynamic topology

Storm dynamic topology - streaming

Does Storm support dynamic topology? The functionality I want from this is to dynamically change the topology according to the user requirement while the Storm topology is running. For example, when user want to know the top-10 words of a stream, I use the top-10 bolt to process it, when user want to know something else, I use the other bolt to process the stream and 'unplug' the top-10 bolt.
I know it could be done by partition the stream or duplicate the stream and alway running every functionalities and only demo the data we want, or we could shut down the stream and update another topology, but is there a 'hot plug-in' way to do that?

You can't dinamically change a Storm topology's structure, i.e. modify the spouts and bolts wiring. A Storm topology's wiring is always static.
However, you could implement the needed functionality in other ways you already described. IMHO, the best, most logical way would be by running multiple topologies -- in case the data processing differs greatly. But if most of the processing is similar in both cases, just duplicate the source stream and process the data in different branches of the same topology.

It was added on STORM-561, on 03/Jun/15:
https://issues.apache.org/jira/browse/STORM-561

There is no built in way to do this (switch out one bolt for another), but what you can do is write a bolt that executes arbitrary code based on the input it receives. So long as your input and output has the same structure in storm (same tuples emitted), you could theoretically execute whatever you wanted at run time in your bolt. This is especially easy if you build your bolt in Clojure, but it's possible in essentially every language you can use with Storm.
However, this probably doesn't make a lot of sense as most computations you'll want to do involve more than one bolt and lend themselves to passing differently structured tuples around. As schiavuzzi already said in their answer, you're probably better off running multiple topologies if there are multiple, independent computations you'd like to do to a stream.

For hot deployment there is a new streaming platform from eBay.
Jetstream: https://github.com/pulsarIO/jetstream.
It has a built in config management tool and your config sits in mongodb. When user modify the config bean, the tool will publish the notification to zookeeper, the corresponding JetStream applications will be get notified and change the config dynamically

Related

Event sourcing - why a dedicated event store?

I am trying to implement event sourcing/CQRS/DDD for the first time, mostly for learning purposes, where there is the idea of an event store and a message queue such as Apache Kafka, and you have events flowing from event store => Kafka Connect JDBC/Debezium CDC => Kafka.
I am wondering why there needs to be a separate event store when it sounds like its purpose can be fulfilled by Kafka itself with its main features and log compaction or configuring log retention for permanent storage. Should I store my events in a dedicated store like RDBMS to feed into Kafka or should I feed them straight into Kafka?

Much of the literature on event-sourcing and cqrs comes from the [domain driven design] community; in its earliest form, CQRS was called DDDD... Distributed domain driven design.
One of the common patterns in domain driven design is to have a domain model ensuring the integrity of the data in your durable storage, which is to say, ensuring that there are no internal contradictions...
I am wondering why there needs to be a separate event store when it sounds like its purpose can be fulfilled by Kafka itself with its main features and log compaction or configuring log retention for permanent storage.
So if we want an event stream with no internal contradictions, how do we achieve that? One way is to ensure that only a single process has permission to modify the stream. Unfortunately, that leaves you with a single point of failure -- the process dies, and everything comes to an end.
On the other hand, if you have multiple processes updating the same stream, then you have risk of concurrent writes, and data races, and contradictions being introduced because one writer couldn't yet see what the other one did.
With an RDBMS or an Event Store, we can solve this problem by using transactions, or compare and swap semantics; and attempt to extend the stream with new events is rejected if there has been a concurrent modification.
Furthermore, because of its DDD heritage, it is common for the durable store to be divided into many very fine grained partitions (aka "aggregates"). One single shopping cart might reasonably have four streams dedicated to it.
If Kafka lacks those capabilities, then it is going to be a lousy replacement for an event store. KAFKA-2260 has been open for more than four years now, so we seem to be lacking the first. From what I've been able to discern from the Kakfa literature, it isn't happy about fine grained streams either (although its been a while since I checked, perhaps things have changed).
See also: Jesper Hammarbäck writing about this 18 months ago, and reaching similar conclusions to those expressed here.

Kafka can be used as a DDD event store, but there are some complications if you do so due to the features it is missing.
Two key features that people use with event sourcing of aggregates are:
Load an aggregate, by reading the events for just that aggregate
When concurrently writing new events for an aggregate, ensure only one writer succeeds, to avoid corrupting the aggregate and breaking its invariants.
Kafka can't do either of these currently, since 1 fails since you generally need to have one stream per aggregate type (it doesn't scale to one stream per aggregate, and this wouldn't necessarily be desirable anyway), so there's no way to load just the events for one aggregate, and 2 fails since https://issues.apache.org/jira/browse/KAFKA-2260 has not been implemented.
So you have to write the system in such as way that capabilities 1 and 2 aren't needed. This can be done as follows:
Rather than invoking command handlers directly, write them to
streams. Have a command stream per aggregate type, sharded by
aggregate id (these don't need permanent retention). This ensures that you only ever process a single
command for a particular aggregate at a time.
Write snapshotting code for all your aggregate types
When processing a command message, do the following:
Load the aggregate snapshot
Validate the command against it
Write the new events (or return failure)
Apply the events to the aggregate
Save a new aggregate snapshot, including the current stream offset for the event stream
Return success to the client (via a reply message perhaps)
The only other problem is handling failures (such as the snapshotting failing). This can be handled during startup of a particular command processing partition - it simply needs to replay any events since the last snapshot succeeded, and update the corresponding snapshots before resuming command processing.
Kafka Streams appears to have the features to make this very simple - you have a KStream of commands that you transform into a KTable (containing snapshots, keyed by aggregate id) and a KStream of events (and possibly another stream containing responses). Kafka allows all this to work transactionally, so there is no risk of failing to update the snapshot. It will also handle migrating partitions to new servers, etc. (automatically loading the snapshot KTable into a local RocksDB when this happens).

there is the idea of an event store and a message queue such as Apache Kafka, and you have events flowing from event store => Kafka Connect JDBC/Debezium CDC => Kafka
In the essence of DDD-flavoured event sourcing, there's no place for message queues as such. One of the DDD tactical patterns is the aggregate pattern, which serves as a transactional boundary. DDD doesn't care how the aggregate state is persisted, and usually, people use state-based persistence with relational or document databases. When applying events-based persistence, we need to store new events as one transaction to the event store in a way that we can retrieve those events later in order to reconstruct the aggregate state. Thus, to support DDD-style event sourcing, the store needs to be able to index events by the aggregate id and we usually refer to the concept of the event stream, where such a stream is uniquely identified by the aggregate identifier, and where all events are stored in order, so the stream represents a single aggregate.
Because we rarely can live with a database that only allows us to retrieve a single entity by its id, we need to have some place where we can project those events into, so we can have a queryable store. That is what your diagram shows on the right side, as materialised views. More often, it is called the read side and models there are called read-models. That kind of store doesn't have to keep snapshots of aggregates. Quite the opposite, read-models serve the purpose to represent the system state in a way that can be directly consumed by the UI/API and often it doesn't match with the domain model as such.
As mentioned in one of the answers here, the typical command handler flow is:
Load one aggregate state by id, by reading all events for that aggregate. It already requires for the event store to support that kind of load, which Kafka cannot do.
Call the domain model (aggregate root method) to perform some action.
Store new events to the aggregate stream, all or none.
If you now start to write events to the store and publish them somewhere else, you get a two-phase commit issue, which is hard to solve. So, we usually prefer using products like EventStore, which has the ability to create a catch-up subscription for all written events. Kafka supports that too. It is also beneficial to have the ability to create new event indexes in the store, linking to existing events, especially if you have several systems using one store. In EventStore it can be done using internal projections, you can also do it with Kafka streams.
I would argue that indeed you don't need any messaging system between write and read sides. The write side should allow you to subscribe to the event feed, starting from any position in the event log, so you can build your read-models.
However, Kafka only works in systems that don't use the aggregate pattern, because it is essential to be able to use events, not a snapshot, as the source of truth, although it is of course discussable. I would look at the possibility to change the way how events are changing the entity state (fixing a bug, for example) and when you use events to reconstruct the entity state, you will be just fine, snapshots will stay the same and you'll need to apply correction events to fix all the snapshots.
I personally also prefer not to be tightly coupled to any infrastructure in my domain model. In fact, my domain models have zero dependencies on the infrastructure. By bringing the snapshotting logic to Kafka streams builder, I would be immediately coupled and from my point of view it is not the best solution.

Theoretically you can use Kafka for Event Store but as many people mentioned above that you will have several restrictions, biggest of those, only able to read event with the offset in the Kafka but no other criteria.
For this reason they are Frameworks there dealing with the Event Sourcing and CQRS part of the problem.
Kafka is only part of the toolchain which provides you the capability of replaying events and back pressure mechanism that are protecting you from overload.
If you want to see how all fits together, I have a blog about it

Can I attach multiple transformers/processors to a single stream in Apache Kafka

In all example I see a simple single transformer/processor topology for Kafka. My doubt is whether we can modularise application logic by breaking down in to multiple transformers/processors applying sequentially to a single input stream.
Please find use case below :
Current application configuration is a single processor containing all processing logic tasks like filtering, validation, application logic, delaying(Kafka is too fast for dbs) and invoke SP/push to down stream.
But we are now planning to decouple all these operations by breaking down each task into separate processors/transformers of Kstream.
Since we are relatively new to Kafka, we are not sure of the pros and cons of this approach especially with respect to Kafka internals like state store/ task scheduling/ multithreading model.
Please share your expert opinions and experiences
Please note that we do not have control over topic, no new topic can be created for this design. The design must be feasible for the existing topic alone.

Kafka Streams allows you to split your logic into multiple processors. Internally, Kafka Streams implements a "depth-first" execution strategy. Thus, each time you call "forward" the output tuple is immediately processed by the downstream processor and "forward" return after downstream processing finished (note, that writing data into a topic and reading it back "breaks" the in-memory pipeline -- thus, when data is written to a topic, there is no guarantee when downstream processor will read and process those records).
If you have state that is shared between multiple processor, you would need to attach the store to all processor that need to access to store. The execution on the store will be single threaded and thus, there should be no performance difference.
As long as you connect processor directly (and not via topics) all processor will be part of the same tasks. Thus, there shouldn't be a performance difference.

Understanding Persistent Entities with streams of data

I want to use Lagom to build a data processing pipeline. The first step in this pipeline is a service using a Twitter client to supscribe to a stream of Twitter messages. For each new message I want to persist the message in Cassandra.
What I dont understand is given I model my Aggregare root as a List of TwitterMessages for example, after running for some time this aggregare root will be several gigabytes in size. There is no need to store all the TwitterMessages in memory since the goal of this one service is just to persist each incomming message and then publish the message out to Kafka for the next service to process.
How would I model my aggregate root as Persistent Entitie for a stream of messages without it consuming unlimited resources? Are there any example code showing this usage if Lagom?

Event sourcing is a good default go to, but not the right solution for everything. In your case it may not be the right approach. Firstly, do you need the Tweets persisted, or is it ok to publish them directly to Kafka?
Assuming you need them persisted, aggregates should store in memory whatever they need to validate incoming commands and generate new events. From what you've described, your aggregate doesn't need any data to do that, so your aggregate would not be a list of Twitter messages, rather, it could just be NotUsed. Each time it gets a command it emits a new event for that Tweet. The thing here is, it's not really an aggregate, because you're not aggregating any state, you're just emitting events in response to commands with no invariants or anything. And so, you're not really using the Lagom persistent entity API for what it was made to be used for. Nevertheless, it may make sense to use it in this way anyway, it's a high level API that comes with a few useful things, including the streaming functionality. But there are also some gotchas that you should be aware of, you put all your Tweets in one entity, you limit your throughput to what one core on one node can do sequentially at a time. So maybe you could expect to handle 20 tweets a second, if you ever expect it to ever be more than that, then you're using the wrong approach, and you'll need to at a minimum distribute your tweets across multiple entities.
The other approach would be to simply store the messages directly in Cassandra yourself, and then publish directly to Kafka after doing that. This would be a lot simpler, a lot less mechanics involved, and it should scale very nicely, just make sure you choose your partition key columns in Cassandra wisely - I'd probably partition by user id.

Is Event Sourcing applicable for batch inputs?

I have a use case where the inputs to the application comes in batches of XML files. For example, a nightly batch of bank transactions. I am trying to see if I can use event sourcing to create a log of events. Based on what I read so far, the examples seems to be based on user driven input (click streams, updates from a user interface etc.,). Is event sourcing using a distributed log mechanism(like Kafka) a valid approach for batch/file based inputs?
Below is the approach I would like to take:
Accept input as a batch in file/xml
Run some basic validations in the memory.
Convert the batch input into a series of events
Write the event log to a Kafka topic(s).
Use the event log to store the data into the database, send the events
to a search engine, update caches, run spark jobs to do aggregations
etc.,
Repeat the process for other incoming batches.
If this approach is not efficient, what other options are available for distributed processing of such inputs?

Are your inputs coming from something that looks like an event storage? I.e. a database that acts as an immutable source of truth, of append only events.
If that is the case, you have the foundation to use event sourcing, and additionally CQRS. (They're not the same thing)
What you would have to realize is that the so called write side / command side... has already been done for you.
The incoming batch of XML files with transactions... each transaction is an event already. It doesn't sound like you need to convert these to events, to then put these into Kafka. You can just map these to something you can put into Kafka, and then all subscribers of the topics can do stuff accordingly.
Effectively you would be implementing the read side of Event Sourcing + CQRS.
In practical terms, unless you are going to be doing things on the write side (where the xml files are generated / where user input is received)... I wouldn't worry too much about the subtleties of event sourcing as it relates to DDD and CQRS. I would simply think of what you're doing as a way to distribute your data to multiple services.
And make sure to consider how caches, search engines, etc. will only be updated whenever you get those XML files.

If each individual event in these xml files has a timestamp then you can think of the output to Kafka as just a steam of late arriving events. Kafka allows you to set the event time on these messages to be the timestamp of the event rather than the time it was ingested to Kafka. In that way, any downstream processing apps like Kafka Streams can put the event into the right temporal context and aggregate into the proper time windows or session windows or even join with other realtime inputs

Apache Flink: changing state parameters at runtime from outside

i'm currently working on a streaming ML pipeline and need exactly once event processing. I was interested by Flink but i'm wondering if there is any way to alter/update the execution state from outside.
The ml algorithm state is kept by Flink and that's ok, but considering that i'd like to change some execution parameters at runtime, i cannot find a viable solution. Basically an external webapp (in GO) is used to tune the parameters and changes should reflect in Flink for the subsequent events.
I thought about:
a shared Redis with pub/sub (as polling for each event would kill throughput)
writing a custom solution in Go :D
...
The state would be kept by key, related to the source of one of the multiple event streams coming in from Kafka.
Thanks

You could use a CoMapFunction/CoFlatMapFunction to achieve what you described. One of the inputs is the normal data input and on the other input you receive state changing commands. This could be easiest ingested via a dedicated Kafka topic.