Asynchronous I/O BEAM API for External Data Acces - apache-beam

I am curious to know whether the Asynchronous I/O Flink API for External Data Access is also available for Beam. I could not find anything specifically. I know that Beam uses Flink as execution framework so async I/O may be possible for Beam as well but I am not sure the API was defined yet.

I didn't find anything related in Beam's Flink runner codebase and I believe that it's not so easy to do (not sure that it's even possible) since all Beam IOs are implemented in the unified way that they have to be run on different runners. So, any of specific data processing things or optimisations have to be used directly by that runner, that is used to run a Beam pipeline on this platform (Flink in your case), during pipeline translation stage and still stay inside the limits of Beam model.

Related

Temporal workflow vs Cadence workflow

How is temporal.io related to cadenceworkflow.io? What should be used if starting a new project depending on the cadence workflow service?
Disclaimer: I'm the original co-founder and tech lead of the Cadence project and currently co-founder/CEO of the Temporal Technologies.
temporal.io is the fork of the Cadence project by the original founders and tech leads of the Cadence project Maxim Fateev and Samar Abbas. The fork is fully open source under the same MIT (with some SDKs under Apache 2.0) license as Cadence. We started Temporal Technologies and received VC funding as we believe that the programming model that we pioneered through AWS Simple Workflow, Durable Task Framework and the Cadence project has potential which goes far beyond a single company. Having a commercial entity to drive the project forward is essential for the longevity of the project.
The temporal.io fork has all the features of Cadence as it constantly merges from it. It also implemented multiple new features.
Here are some of the technical differences between Cadence and Temporal as of initial release of the Temporal fork.
All thrift structures are replaced by protobuf ones
All public APIs of Cadence rely on Thrift. Thrift object are also stored in DB in serialized form.
Temporal converted all these structures to Protocol Buffers. This includes objects stored in the DB.
Communication protocol switched from TChannel to gRPC
Cadence relies on TChannel which was TCP based multiplexing protocol which was developed at Uber. TChannel has a lot of limitations like not supporting any security and having a very limited number of language bindings. It is essentially deprecated even at Uber.
Temporal uses gRPC for all interprocess communication.
TLS Support
Cadence doesn't support any communication security as it is a limitation of TChannel.
Temporal has support for mutual TLS and is going to support more advanced authentication and authorization features in the future.
Simplified configuration
Temporal has reworked the service configuration. Some of the most confusing parts of it are removed. For example, the need to configure membership seeds is eliminated. In temporal each host upon startup registers itself with the database and uses the list from the database as the seed list.
Release pipelines
Cadence doesn't test any publicly released artifacts including docker images as its internal release pipeline is ensuring the quality of the internally built artifacts only. It also doesn't perform any release testing for dependencies that are not used within Uber. For example, MySQL integration is not tested beyond rather incomplete unit tests. The same applies to the CLI and other components.
Temporal is making heavy investment into the release process. All the artifacts including a full supported matrix of dependencies are going to be subjected through a full release pipeline which is going to include multi-day stress runs.
The other important part of the release process is the ability to generate patches for production issues. The ability to ensure quality of such patches and produce all the necessary artifacts in a timely manner is important for anyone running Temporal in production.
Payload Metadata
Cadence stores activity inputs and outputs and other payloads as binary blobs without any associated metadata.
Temporal allows associating metadata with every payload. It enables features like dynamically pluggable serialization mechanisms, seamless compression, and encryption.
Failure Propagation
In Cadence activity and workflow failures are modeled as a single binary payload and a string reason field. Only Java client supports chaining exceptions across workflow and activity boundaries. But this chaining relies on fragile GSON serialization and doesn't work with other languages.
Temporal activity and workflow failures are modeled as protobufs and can be chained across components implemented in different SDKs. For example, a single failure trace can contain a chain that is caused by an exception that originates in activity written in Python, propagated through Go child workflow up to Java workflow, and later to the client.
Go SDK
Temporal implemented the following improvements over Cadence Go client:
Protobuf & gRPC
No global registration of activity and workflow types
Ability to register activity structure instance with the worker. It greatly simplifies passing external dependencies to the activities.
Workflow and activity interceptors which allow implementing features like configuring timeouts through external config files.
Activity and workflow type names do not include package names. This makes code refactoring without breaking changes much simpler.
Most of the timeouts which were required by Cadence are optional now.
workflow.Await method
Java SDK
Temporal implemented the following improvements over Cadence Java client:
Workflow and activity annotations to allow activity and workflow implementation objects to implement non-workflow and activity interfaces. This is important to play nice with AOP frameworks like Spring.
Polymorphic workflow and activity interfaces. This allows having a common interface among multiple activity and workflow types.
Dynamic registration of signal and query handlers.
Workflow and activity interceptors which allow implementing features like configuring timeouts through external config files.
Activity and workflow type name generation improved
SDKs not supported by Cadence
Typescript SDK, Python SDK, PHP SDK
SDKs under active development
.NET SDK, Ruby SDK
Temporal Cloud
Temporal Technologies monetizes the project by providing a hosted version of the Temporal service. There are dozens of companies (including SNAP) already using it in production.
Other
We have a lot of other features and client SDKs for other languages planned. You can find us at Temporal Community Forum.
Overview
Using iWF will let you switch between Cadence & Temporal easily.
In addition, iWF will provides a nice abstraction on top of the both and makes your life a lot better.
The fact is both Cadence & Temporal are under active development. You can see they have some different focuses if looking at their road maps. The two projects share the same vision to let everyone rethink about programming models of long-running business.
Tasks across domain+clusters
If you have multiple Cadence clusters,
this allows starthing childWF across different clusters and domains.
Support Both Thrift&gRPC
gRPC support is completely done on the server side. Internal traffic is all using gRPC and we are working on letting users migrating from Thrift to gRPC.
Authorization
The permission is based on domain but can be extended. Different from Temporal, the permission policy can be stored within Cadence domain data storage so that you don't have to build another service/storage to manage them.
Note that the whole proposal is developed by community member.
Workflow Shadower
Workflow Shadower is built on top of Workflow Replayer to address this problem. The basic idea of shadowing is: scan workflows based on the filters you defined, fetch history for each workflow in the scan result from Cadence server and run the replay test. It can be run either as a test to serve local development purpose or as a workflow in your worker to continuously replay production workflows.
Graceful domain failover
This allows XDC(multiple clusters) mode to reduce the pain of rerun some tasks during failover.
NoSQL plugin model
This allows implementing different NoSQL persistence in a minimum way. By the time writing this post, Temporal haven't started working on it.
MongoDB support
On top of the NoSQL interfaces, MongoDB support is WIP.
Using multiple SQL instances as sharded SQL
This allows user to have a Cadence cluster with a much larger scale. (then using XDC to add more DB instances)
Configuration Storage for Dynamic config
This enables us changing the dynamic configuration(like for ratelimiting) without making any deployment. Just a CLI command can control the behavior of the system.
It's in experiment and still WIP for production ready.
Workflow notification
A WIP eco system project to allow getting notification from Cadence. This is the benefits of Cadence using Kafka to deliver visibility messages. Temporal doesn't use Kafka which will be super difficult to support this feature.
Periodical Healthchecker(Canary) and Benchmark tool and benchmark setup docs
More Documentation
Seamless Cluster Migration guidance
Dashboard/Monitoring
...
...
Other small improvements that Temporal is missing
TerminateIfRunning IDReusePolicy
All domain API forwarding policy
Better & cleaner XDC configuration
Tooling to deserialize database blob data
...
...
I'm from the Cadence team at Uber, and I wanted to let you know that Cadence continues to be developed actively by our team. Below is a section of the update that we shared with the Cadence community recently:
We want to reinforce that Uber's Cadence team is committed to the
growth and open source development of the Cadence project. Today,
Cadence powers 100+ different use cases within Uber and that number
grows quickly. Collectively, there are 50M+ ongoing executions at any
moment on average and our customers finish 3B+ executions per month.
Outside of Uber, we also know that many engineering teams at various
companies have already adopted Cadence for their business-critical
workflows. We are excited to continue evolving Cadence as an
open-source project in a backward-compatible way with an increased
focus on reliability, scalability, and maintainability in the near
term.
It's probably too early to compare Cadence and Temporal. Still, I have a few ideas around how we can systematically shed light on Cadence's roadmap to ensure all the necessary information is out there to enable such comparisons going forward. I'll update this post with links when we create a page with information about the roadmap.
In the meantime, please let me know if you need further information about Cadence that would be helpful in this context.
Temporal.io is a company that has forked cadence project and are now building on top of it - naming it temporal.
It is founded by the authors of cadence.
I would suggest using temporal.io as it is under active development

A dynamic message orchestration flow engine for Kafka

I'm trying to see what are the possible toolkits/frameworks available to achieve the following.
A toolkit where a developer typically should configure the data flow (which is a series of steps) to form a data processing pipeline. A declarative approach with zero or very minimal coding.
The underlying messaging infrastructure should be Kafka - ie the toolkit should support Kafka straight out of the box (when the right dependencies are included).
Very intuitive to visualise, deploy, debug the flows.
Aggregation capabilities (group by) etc on streaming data.
I'm seeing Spring Cloud Data Flow as something that could (possibly) tried out as a candidate?
Is this what it is meant for (from people using it on production)?
Are there any free/opensource alternatives too?
I will attempt to unpack a few topics in the context of Spring Cloud Data Flow (SCDF).
A toolkit where a developer typically should configure the data flow (which is a series of steps) to form a data processing pipeline. A declarative approach with zero or very minimal coding.
There are ~70 data integration applications that we maintain and ship. They should cover the most common use-cases. Each of them is a Spring Cloud Stream application, and the business logic in them can work as-is with a variety of message brokers that the framework supports, including Kafka and Kafka Streams.
However, when you have a custom data processing requirement and there's no application to address that need, you will have to build a custom source, processor, or sink style of apps. If you don't want to use Java, polyglot workloads are possible, as well.
SCDF allows you to assemble the applications into a coherent streaming data pipeline [see streams developer guide]. SCDF then orchestrates the deployment of the apps in the data pipeline to targeted platforms like Kubernetes as native resources.
Because these applications are connected with one another through persistent pub/sub-brokers (eg: Kafka), SCDF also provides the primitives to CI/CD, rolling-upgrade, and rolling-rollback the individual applications in the streaming data pipeline without causing upstream or downstream impacts. The data ordering and guarantees are preserved also because we rely upon and delegate that to the underlying message broker.
The underlying messaging infrastructure should be Kafka - ie the toolkit should support Kafka straight out of the box (when the right dependencies are included).
This is already covered in the previous answer. The point to note here, though, in the future, if you want to switch from Kafka to let's say Azure Event Hubs, there's absolutely zero code change required in the business logic. Spring Cloud Stream workload is portable, and you're not locking yourself into a single tech like Kafka.
Very intuitive to visualise, deploy, debug the flows
SCDF supports a drag+drop interface, integration with observability tooling such as Prometheus+Grafna, and the metrics based auto-scaling of applications in the data pipeline.
All of the above is also possible to accomplish by directly using SCDF's APIs, Java DSL (programmatic creation of data pipelines — critical for CI/CD automation), or Shell/CLI.
Aggregation capabilities (group by) etc on streaming data
When using Kafka Streams binder implementation, you can build comprehensive joins, aggregations, and stateful analytics — see samples.

Use AWS SQS in Scala Play 2.4 reactive project

I want to use AWS SQS in my Play 2.4 project.
There are two options at moment:
There is a SQS wrapper called https://github.com/kifi/franz which supports reactive way of using that. But it seems not quite popular on Github not sure how mature it is? Whether the developers will continue to maintain it.
There is Java SQS SDK, but it doesn't support Scala Future (reactive way). If I want to make it non-blocking, could I use Akka or?
I haven't used SQS yet but I did have a similar need for DynamoDB. I tried using a third party Scala specific library -- bad idea (buggy, bad api, missing features). In the end I wrapped the AWS Java SDK. Making it 'reactive' is pretty easy and is similar to most things you would do in Scala. Call the AWS functions within a Future body (Future {}) and use functional constructs to process the result (map, reduce etc..).

build workflow engine with Akka

In our Scala/Play application we use activiti. (also experimenting with camunda) users can create workflows (shown in this picture http://camunda.com/ ). All calls to these external workflow engines are wrapped in Scala Future (activiti and camunda APIs are all Java blocking APIs).
is there any library to implement workflows totally using Akka/Actors avoiding heavy toolkits like activiti/camunda? Or ideas how to best use Akka with activiti/camunda ?
You could try and use the Akka FSM dsl to do the same bypassing activity and also blocking apis. see http://doc.akka.io/docs/akka/snapshot/scala/fsm.html
Note that camunda has very powerful asynchronous continuation features which allow you to delegate any long-running processing to background threads. This allows very flexible configuration of "how much work" is done synchronously in the client (possibly HTTP) thread. This can give you a good balance between performance and fault tolerance.
I know of the existence of the Catify BPMN Engine, built using Akka (Java). I do not have any experience with it, nor do I know for sure whether API calls are asynchronous, but I would expect so. Since it is written in Akka it should combine well with Play!.

Batch processing and functional programming

As a Java developer, I'm used to use Spring Batch for batch processing, generally using a streaming library to export large XML files with StAX for exemple.
I'm now developping a Scala application, and wonder if there's any framework, tool or guideline to achieve batch processing.
My Scala application uses the Cake Pattern and I'm not sure how I could integrate this with SpringBatch. Also, I'd like to follow the guidelines described in Functional programming in Scala and try to keep functional purity, using stuff like the IO monad...
I know this is kind of an open question, but I never read anything about this...
Has anyone already achieved functional batch processing here? How was it working? Am I supposed to have a main that creates a batch processing operation in an IO monad and run it? Is there any tool or guideline to help, monitor or handle restartability, like we use Spring Batch in Java.
Do you use Spring Batch in Scala?
How do you handle the integration part, for exemple waiting for a JMS/AMQP message to start the treatment that produces an XML?
Any feedback on the subjet is welcome
You don't mention what kind of app you are developing with Scala, so I'm going to wild guess here and suppose you are doing a server side one. Going further with wild guessing let's say you are using Akka... because you are using it, aren't you? :)
In that case, I guess what you are looking for is Akka Quartz Scheduler, the official Quartz Extension and utilities for cron-style scheduling in Akka. I haven't tried it myself, but from your requirements it seems that Akka + this module would be a good fit. Take into account that Akka already provides hooks to handle restartability of failed actors, and I don't think that it would be that difficult to add monitoring of batch processes leveraging the lifecycle callbacks built into actors.
Regarding interaction with JMS/AMQP messaging, you could use the Akka Camel module, that provides support for sending and receiving messages through a lot of protocols, including JMS. Using this module you could have a consumer actor receiving messages from some JMS endpoint, and fire whatever process you want from there, probably forwarding or sending a new message to the actor responsible for that process. If the process is fired either by a cron style timer or an incoming message you can reuse the same actor to accomplish the task.