I'm trying to see what are the possible toolkits/frameworks available to achieve the following.
A toolkit where a developer typically should configure the data flow (which is a series of steps) to form a data processing pipeline. A declarative approach with zero or very minimal coding.
The underlying messaging infrastructure should be Kafka - ie the toolkit should support Kafka straight out of the box (when the right dependencies are included).
Very intuitive to visualise, deploy, debug the flows.
Aggregation capabilities (group by) etc on streaming data.
I'm seeing Spring Cloud Data Flow as something that could (possibly) tried out as a candidate?
Is this what it is meant for (from people using it on production)?
Are there any free/opensource alternatives too?
I will attempt to unpack a few topics in the context of Spring Cloud Data Flow (SCDF).
A toolkit where a developer typically should configure the data flow (which is a series of steps) to form a data processing pipeline. A declarative approach with zero or very minimal coding.
There are ~70 data integration applications that we maintain and ship. They should cover the most common use-cases. Each of them is a Spring Cloud Stream application, and the business logic in them can work as-is with a variety of message brokers that the framework supports, including Kafka and Kafka Streams.
However, when you have a custom data processing requirement and there's no application to address that need, you will have to build a custom source, processor, or sink style of apps. If you don't want to use Java, polyglot workloads are possible, as well.
SCDF allows you to assemble the applications into a coherent streaming data pipeline [see streams developer guide]. SCDF then orchestrates the deployment of the apps in the data pipeline to targeted platforms like Kubernetes as native resources.
Because these applications are connected with one another through persistent pub/sub-brokers (eg: Kafka), SCDF also provides the primitives to CI/CD, rolling-upgrade, and rolling-rollback the individual applications in the streaming data pipeline without causing upstream or downstream impacts. The data ordering and guarantees are preserved also because we rely upon and delegate that to the underlying message broker.
The underlying messaging infrastructure should be Kafka - ie the toolkit should support Kafka straight out of the box (when the right dependencies are included).
This is already covered in the previous answer. The point to note here, though, in the future, if you want to switch from Kafka to let's say Azure Event Hubs, there's absolutely zero code change required in the business logic. Spring Cloud Stream workload is portable, and you're not locking yourself into a single tech like Kafka.
Very intuitive to visualise, deploy, debug the flows
SCDF supports a drag+drop interface, integration with observability tooling such as Prometheus+Grafna, and the metrics based auto-scaling of applications in the data pipeline.
All of the above is also possible to accomplish by directly using SCDF's APIs, Java DSL (programmatic creation of data pipelines — critical for CI/CD automation), or Shell/CLI.
Aggregation capabilities (group by) etc on streaming data
When using Kafka Streams binder implementation, you can build comprehensive joins, aggregations, and stateful analytics — see samples.
Related
I am curious to know whether the Asynchronous I/O Flink API for External Data Access is also available for Beam. I could not find anything specifically. I know that Beam uses Flink as execution framework so async I/O may be possible for Beam as well but I am not sure the API was defined yet.
I didn't find anything related in Beam's Flink runner codebase and I believe that it's not so easy to do (not sure that it's even possible) since all Beam IOs are implemented in the unified way that they have to be run on different runners. So, any of specific data processing things or optimisations have to be used directly by that runner, that is used to run a Beam pipeline on this platform (Flink in your case), during pipeline translation stage and still stay inside the limits of Beam model.
I am unsure how to make use of event-driven architecture in real-world scenarios. Let's say there is a route planning platform consisting of the following back-end services:
user-service (manages user data and roles)
map-data-service (roads & addresses, only modified by admins)
planning-tasks-service
(accepts new route planning tasks, keeps track of background tasks, stores results)
The public website will usually request data from all 3 of those services. map-data-service needs information about user-roles on a data change request. planning-tasks-service needs information about users, as well as about map-data to validate new tasks.
Right now those services would just make a sync request to each other to get the needed data. What would be the best way to translate this basic structure into an event-driven architecture? Can dependencies be reduced by making use of events? How will the public website get the needed data?
Cosmin is 100% correct in that you need something to do some orchestration.
One approach to take, if you have a client that needs data from multiple services, is the Experience API approach.
Clients call the experience API, which performs the orchestration - pulling data from different sources and providing it back to the client. The design of the experience API is heavily, and deliberately, biased towards what the client needs.
Based on the details you've said so far, I can't see anything that cries out for event-based architecture. The communication between the client and ExpAPI can be a mix of sync and async, as can the ExpAPI to [Services] communication.
And for what it's worth, putting all of that on API gateway is not a bad idea, in that they are designed to host API's and therefore provide the desirable controls and observability for managing them.
Update based on OP Comment
I was really interested in how an event-driven architecture could
reduce dependencies between my microservices, as it is often stated
Having components (or systems) talk via events is sort-of the asynchronous equivalent of Inversion of Control, in that the event consumers are not tightly-coupled to the thing that emits the events. That's how the dependencies are reduced.
One thing you could do would be to do a little side-project just as a learning exercise - take a snapshot of your code and do a rough-n-ready conversion to event-based and just see how that went - not so much as an attempt to event-a-cise your solution but to see what putting events into a real-world solution looks like. If you have the time, of course.
The missing piece in your architecture is the API Gateway, which should be the only entry-point in your system, used by the public website directly.
The API Gateway would play the role of an orchestrator, which decides to which services to route the request, and also it assembles the final response needed by the frontend.
For scalability purposes, the communication between the API Gateway and individual microservices should be done asynchronously through an event-bus (or message queue).
However, the most important step in creating a scalable event-driven architecture which leverages microservices, is to properly define the bounded contexts of your system and understand the boundaries of each functionality.
More details about this architecture can be found here
Event storming is the first thing you need to do to identify domain events(a change in state in your system). For example, 'userCreated', 'userModified', 'locatinCreated', 'routeCreated', 'routeCompleted' etc. Then you can define topics that manage these events. Interested parties can consume these events by subscribing to published events(via topics/channel) and then act accordingly. Implementation of an event-driven architecture is often composed of loosely coupled microservices that communicate asynchronously through a message broker like Apache Kafka. Free EDA book is an excellent resource to know most of the things in EDA.
Tutorial: Even-driven-architecture pattern
How is temporal.io related to cadenceworkflow.io? What should be used if starting a new project depending on the cadence workflow service?
Disclaimer: I'm the original co-founder and tech lead of the Cadence project and currently co-founder/CEO of the Temporal Technologies.
temporal.io is the fork of the Cadence project by the original founders and tech leads of the Cadence project Maxim Fateev and Samar Abbas. The fork is fully open source under the same MIT (with some SDKs under Apache 2.0) license as Cadence. We started Temporal Technologies and received VC funding as we believe that the programming model that we pioneered through AWS Simple Workflow, Durable Task Framework and the Cadence project has potential which goes far beyond a single company. Having a commercial entity to drive the project forward is essential for the longevity of the project.
The temporal.io fork has all the features of Cadence as it constantly merges from it. It also implemented multiple new features.
Here are some of the technical differences between Cadence and Temporal as of initial release of the Temporal fork.
All thrift structures are replaced by protobuf ones
All public APIs of Cadence rely on Thrift. Thrift object are also stored in DB in serialized form.
Temporal converted all these structures to Protocol Buffers. This includes objects stored in the DB.
Communication protocol switched from TChannel to gRPC
Cadence relies on TChannel which was TCP based multiplexing protocol which was developed at Uber. TChannel has a lot of limitations like not supporting any security and having a very limited number of language bindings. It is essentially deprecated even at Uber.
Temporal uses gRPC for all interprocess communication.
TLS Support
Cadence doesn't support any communication security as it is a limitation of TChannel.
Temporal has support for mutual TLS and is going to support more advanced authentication and authorization features in the future.
Simplified configuration
Temporal has reworked the service configuration. Some of the most confusing parts of it are removed. For example, the need to configure membership seeds is eliminated. In temporal each host upon startup registers itself with the database and uses the list from the database as the seed list.
Release pipelines
Cadence doesn't test any publicly released artifacts including docker images as its internal release pipeline is ensuring the quality of the internally built artifacts only. It also doesn't perform any release testing for dependencies that are not used within Uber. For example, MySQL integration is not tested beyond rather incomplete unit tests. The same applies to the CLI and other components.
Temporal is making heavy investment into the release process. All the artifacts including a full supported matrix of dependencies are going to be subjected through a full release pipeline which is going to include multi-day stress runs.
The other important part of the release process is the ability to generate patches for production issues. The ability to ensure quality of such patches and produce all the necessary artifacts in a timely manner is important for anyone running Temporal in production.
Payload Metadata
Cadence stores activity inputs and outputs and other payloads as binary blobs without any associated metadata.
Temporal allows associating metadata with every payload. It enables features like dynamically pluggable serialization mechanisms, seamless compression, and encryption.
Failure Propagation
In Cadence activity and workflow failures are modeled as a single binary payload and a string reason field. Only Java client supports chaining exceptions across workflow and activity boundaries. But this chaining relies on fragile GSON serialization and doesn't work with other languages.
Temporal activity and workflow failures are modeled as protobufs and can be chained across components implemented in different SDKs. For example, a single failure trace can contain a chain that is caused by an exception that originates in activity written in Python, propagated through Go child workflow up to Java workflow, and later to the client.
Go SDK
Temporal implemented the following improvements over Cadence Go client:
Protobuf & gRPC
No global registration of activity and workflow types
Ability to register activity structure instance with the worker. It greatly simplifies passing external dependencies to the activities.
Workflow and activity interceptors which allow implementing features like configuring timeouts through external config files.
Activity and workflow type names do not include package names. This makes code refactoring without breaking changes much simpler.
Most of the timeouts which were required by Cadence are optional now.
workflow.Await method
Java SDK
Temporal implemented the following improvements over Cadence Java client:
Workflow and activity annotations to allow activity and workflow implementation objects to implement non-workflow and activity interfaces. This is important to play nice with AOP frameworks like Spring.
Polymorphic workflow and activity interfaces. This allows having a common interface among multiple activity and workflow types.
Dynamic registration of signal and query handlers.
Workflow and activity interceptors which allow implementing features like configuring timeouts through external config files.
Activity and workflow type name generation improved
SDKs not supported by Cadence
Typescript SDK, Python SDK, PHP SDK
SDKs under active development
.NET SDK, Ruby SDK
Temporal Cloud
Temporal Technologies monetizes the project by providing a hosted version of the Temporal service. There are dozens of companies (including SNAP) already using it in production.
Other
We have a lot of other features and client SDKs for other languages planned. You can find us at Temporal Community Forum.
Overview
Using iWF will let you switch between Cadence & Temporal easily.
In addition, iWF will provides a nice abstraction on top of the both and makes your life a lot better.
The fact is both Cadence & Temporal are under active development. You can see they have some different focuses if looking at their road maps. The two projects share the same vision to let everyone rethink about programming models of long-running business.
Tasks across domain+clusters
If you have multiple Cadence clusters,
this allows starthing childWF across different clusters and domains.
Support Both Thrift&gRPC
gRPC support is completely done on the server side. Internal traffic is all using gRPC and we are working on letting users migrating from Thrift to gRPC.
Authorization
The permission is based on domain but can be extended. Different from Temporal, the permission policy can be stored within Cadence domain data storage so that you don't have to build another service/storage to manage them.
Note that the whole proposal is developed by community member.
Workflow Shadower
Workflow Shadower is built on top of Workflow Replayer to address this problem. The basic idea of shadowing is: scan workflows based on the filters you defined, fetch history for each workflow in the scan result from Cadence server and run the replay test. It can be run either as a test to serve local development purpose or as a workflow in your worker to continuously replay production workflows.
Graceful domain failover
This allows XDC(multiple clusters) mode to reduce the pain of rerun some tasks during failover.
NoSQL plugin model
This allows implementing different NoSQL persistence in a minimum way. By the time writing this post, Temporal haven't started working on it.
MongoDB support
On top of the NoSQL interfaces, MongoDB support is WIP.
Using multiple SQL instances as sharded SQL
This allows user to have a Cadence cluster with a much larger scale. (then using XDC to add more DB instances)
Configuration Storage for Dynamic config
This enables us changing the dynamic configuration(like for ratelimiting) without making any deployment. Just a CLI command can control the behavior of the system.
It's in experiment and still WIP for production ready.
Workflow notification
A WIP eco system project to allow getting notification from Cadence. This is the benefits of Cadence using Kafka to deliver visibility messages. Temporal doesn't use Kafka which will be super difficult to support this feature.
Periodical Healthchecker(Canary) and Benchmark tool and benchmark setup docs
More Documentation
Seamless Cluster Migration guidance
Dashboard/Monitoring
...
...
Other small improvements that Temporal is missing
TerminateIfRunning IDReusePolicy
All domain API forwarding policy
Better & cleaner XDC configuration
Tooling to deserialize database blob data
...
...
I'm from the Cadence team at Uber, and I wanted to let you know that Cadence continues to be developed actively by our team. Below is a section of the update that we shared with the Cadence community recently:
We want to reinforce that Uber's Cadence team is committed to the
growth and open source development of the Cadence project. Today,
Cadence powers 100+ different use cases within Uber and that number
grows quickly. Collectively, there are 50M+ ongoing executions at any
moment on average and our customers finish 3B+ executions per month.
Outside of Uber, we also know that many engineering teams at various
companies have already adopted Cadence for their business-critical
workflows. We are excited to continue evolving Cadence as an
open-source project in a backward-compatible way with an increased
focus on reliability, scalability, and maintainability in the near
term.
It's probably too early to compare Cadence and Temporal. Still, I have a few ideas around how we can systematically shed light on Cadence's roadmap to ensure all the necessary information is out there to enable such comparisons going forward. I'll update this post with links when we create a page with information about the roadmap.
In the meantime, please let me know if you need further information about Cadence that would be helpful in this context.
Temporal.io is a company that has forked cadence project and are now building on top of it - naming it temporal.
It is founded by the authors of cadence.
I would suggest using temporal.io as it is under active development
Need to develop a rest API which can read published messages from kafka cluster to a dataware house application.
Materials available over internet say use POST/GET commands , but i don't think this is for production use rather useful for testing purposes.
How to implement it in scala/ Java Programming?
Materials available over internet say use POST/GET commands , but i don't think this is for production use rather useful for testing purposes
Please link to where you read this... All production web-services operate over (more than) these two HTTP methods, hundreds of thousands times a day...
If you want to really use Kafka for throughput, though, you wouldn't "hide it" behind a REST interface, though. You would distribute SSL certs plus usernames+passwords, to remote clients, for example.
Need to develop a rest API which can read publish messages from kafka
REST is not meant to keep an open connection, primarily because it is stateless (it shouldn't maintain where you are reading from in Kafka)... It would make more sense to forward a websocket from a Kafka consumer, which is different from a REST API.
how to implement it in scala/ Java Programming
The Confluent REST Proxy is already written in Java, and it is open-source (and used in Production at several companies, I believe). If you need inspiration, then you can start there. Otherwise, you can find examples of Spring and Vert.x, for example, with their Kafka integrations in their respective documentations, but you'll be re-implementing a lot of the existing functionality.
We are building a Spring Boot based web application which consists of a central API server and multiple remote workers. We tried to embrace the idea of microservices so each component is built as a separate Spring Boot App. But they do share a lot of same data schema for the objects being handled across components, thus currently the JPA model definitions in code are duplicated in each component's project. Every time something changes we need to remember to change it everywhere and this results in poor compatibility between different versions across components.
So I'm wondering if this is the best we can do here, or is there better ways to manage the components code in such scenario?
To be more specific, we use both MySQL and Redis for data storage and both are accessed by all components. Redis is actually serving as one tool of data communication between components.