What is Apache Beam? [closed] - apache-beam

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I was going through the Apache posts and found a new term called Beam. Can anybody explain what exactly Apache Beam is?
I tried to google out but unable to get a clear answer.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.
History: The model behind Beam evolved from a number of internal Google data processing projects, including MapReduce, FlumeJava, and Millwheel. This model was originally known as the “Dataflow Model” and first implemented as Google Cloud Dataflow -- including a Java SDK on GitHub for writing pipelines and fully managed service for executing them on Google Cloud Platform. Others in the community began writing extensions, including a Spark Runner, Flink Runner, and Scala SDK. In January 2016, Google and a number of partners submitted the Dataflow Programming Model and SDKs portion as an Apache Incubator Proposal, under the name Apache Beam (unified Batch + strEAM processing). Apache Beam graduated from incubation in December 2016.
Additional resources for learning the Beam Model:
The Apache Beam website
The VLDB 2015 paper (using the original naming Dataflow model)
Streaming 101 and Streaming 102 posts on O’Reilly’s Radar site
A Beam podcast on Software Engineering Radio

Apache Beam (Batch + strEAM) is a model and set of APIs for doing both batch and streaming data processing. It was open-sourced by Google (with Cloudera and PayPal) in 2016 via an Apache incubator project.
The page Dataflow/Beam & Spark: A Programming Model Comparison - Cloud Dataflow contrasts the Beam API with Apache Spark, which has been hugely successful at bringing a modern, flexible API and set of optimization techniques for both batch and streaming to the Hadoop world and beyond.
Beam tries to take all that a step further via a model that makes it easy to describe the various aspects of the out-of-order processing that often is an issue when combining batch and streaming processing, as described in that Programming Model Comparison.
In particular, to quote from the comparison, The Dataflow model is designed to address, elegantly and in a way that is more modular, robust and easier to maintain:
... the four critical questions all data processing practitioners must attempt to answer when building their pipelines:
What results are calculated? Sums, joins, histograms, machine learning models?
Where in event time are results calculated? Does the time each event originally occurred affect results? Are results aggregated in fixed windows, sessions, or a single global window?
When in processing time are results materialized? Does the time each event is observed within the system affect results? When are results emitted? Speculatively, as data evolve? When data arrive late and results must be revised? Some combination of these?
How do refinements of results relate? If additional data arrive and results change, are they independent and distinct, do they build upon one another, etc.?
The pipelines described in Beam can in turn be run on Spark, Flink, Google's Dataflow offering in the cloud, and other "runtimes", including a "Direct" local machine option.
A variety of languages are supported by the architecture. The Java SDK is available now. A Dataflow Python SDK is nearing release, and others are envisioned for Scala etc.
See the source at Mirror of Apache Beam

Related

When to use a workflow engine - use and misuse

THE CONTEXT:
I should develop a software to calculate billing for a lot of customers
The software should be used by different local administrations, each one with its own rules to calculate the billing to its citizens.
At first i've thought to a workflow engine in order to "design" different calculation flows and apply then to the customers.
In the past i had a little experience with a workflow manager product (i worked a little with IBM BPM) and i had a lot of difficult to debug what happens when something went wrong and i found a lot of performance issue (respect to a simple OOP software).
Maybe these difficulties went caused by my poor knowledge of the tool, or maybe IBM BPM is not as good as IBM says.
Anyway, respect to my objective (produce a custom billing, and make it as flexible as possible in therm of configuration and process) is a workflow engine a suitable product?
Any suggestion about tools, frameworks and above all how to approach the problem.
My initial idea of the architecture is to develop a main software in c# (where i'm more confident) and using a workflow engine (like JBpm) as a black box, invoking previously configured flows into the bpm.
I would recommend using Cadence Workflow for your use case. It records all events that are related to your workflow in an execution history. It makes troubleshooting of production issues very straightforward.
As workflow is essentially a program in Java (or Go) you have unlimited flexibility on implementation. The performance is also not an issue as Cadence was built from ground up for high scalability. It was tested to over hundred million open workflows and tens of thousands of events per second.
See the presentation that goes over Cadence programming model.

Scala Redis driver for high performance

I need to develop a Scala based application that will write\read to\from managed AWS Redis at verry high rate. On official Redis page they mention several clients, without comparission. For my project every microsecond matters. I saw similar questions here, on SO, but they all are outdated.
Please advice what client has better performance.
As another pointed out, you can use Jedis: https://github.com/xetorthio/jedis/blob/master/src/main/java/redis/clients/jedis/JedisPool.java
The latency may depend more on requesting within the same AZ/VPC (avoiding external networks) and using Redis pipelines, which batch together transactions and reduce number of requests. See pipeline usage examples here:
https://github.com/xetorthio/jedis/wiki/AdvancedUsage
Here is another example combining AWS client libraries with Jedis:
https://github.com/fishercoder1534/AmazonElastiCacheExample/blob/master/src/main/java/AmazonElastiCacheExample.java

Best approach to construct a real-time rule engine for our streaming events

We are at the beginning of building an IoT cloud platform project. There are certain well known portions to achieve complete IoT platform solution. One of them is real-time rule processing/engine system which is needed to understand that streaming events are matched with any rules defined dynamically by end users with readable format (SQL or Drools if/when/then etc.)
I am so confused because there are lots of products, projects (Storm, Spark, Flink, Drools, Espertech etc.) in internet so, considering we have 3-person development team (a junior, a mid-senior, a senior), what would it be the best choice ?
Choosing one of the streaming projects such as Apache Flink and learn well ?
Choosing one of the complete solution (AWS, Azure etc.)
The BRMS(Business Rule Management System) like Drools is mainly built for quickly adapting changes in business logic and are more matured and stable compared to stream processing engines like Apache Storm, Spark Streaming, and Flink. Stream processing engines are built for high throughput and low latency. The BRMS may not be suitable to serve hundreds of millions of events in IOT scenarios and may be difficult to deal with event-time-based window calculations.
All these solutions can be used in Iaas providers. In AWS you may also want to take a look at AWS EMR and Kinesis/Kinesis Analytics.
Some use cases I've seen.
Stream data directly to FlinkCEP.
Use rule engines to do fast response with low latency, at the same time stream data to Spark for analysis and machine learning.
You can also run Drools in Spark and Flink to hot-deploy user-defined rules.
Disclaimer, I work for them. But, you should check out Losant. It's developer friendly and it's super easy to get started. We also have a workflow engine, where you can build custom logic/rules for your application.
check out the Waylay rules engine built specifically for real-time IoT data streams.
In the beginning phase Go for the cloud based IoT platform like predix,AWA,SAP or Watson for rapid product development and initial learning.

EventStore vs. MongoDb [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I would like to know what advantages there are to using EventStore (http://geteventstore.com) over implementing event sourcing yourself in a MongoDb.
The reason I ask, is that our company has a number of people that work with MongoDb daily. They don't work with Event Sourcing though. While they are not completely in the dark about the subject, they aren't about to start implementing it anywhere either.
I am about to start a project, that is perfectly suited for Event Sourcing. There are about 16 very well defined events, and about 7 well defined projections. I say "about" because I know there will be demand for more projections and events once they see the product in use.
The approach is going to be API first, with a REST Api that other parts of our organisation are going to consume.
While I have read a lot about Event Sourcing the way Greg Young defines it, I have never actually implemented an Event Sourcing solution.
This is a green field project. No technology restrictions since we are going to expose everything as a REST interface. So if anyone has working experience with EvenStore or Event Sourcing with MongoDb please enlighten me.
Also an almost totally non related question about Event Sourcing:
Do you ever query the event store directly? Or would you always create new projections and replay event to populate those projections?
Disclaimer I am Greg Young (if you cant read my name :))
I am going to answer this question though I believe it will likely get deleted anyways. This question alone for me is a bit odd, but the answers are fairly bizarre. I won't take the time to answer each reply individually but will instead put all of my comments in this reply.
1) There is a comment that we only run on a custom version of mono which is a detail but... This is not the case (and has not been for over a year). We were waiting on critical patches we made to mono (as example threadpool.c to hit their master). This has happened.
2) EventStore is 3-clause BSD licensed. Not sure how you could claim we are not Open Source. We also have a company behind it and provide commercial support.
3) Someone mentioned us going on to version 3 in Sept. Version 1 was released 2 years ago. Version 2 added Clustering (obviously some breaking changes vs single node). Version 3 is adding a ton of stuff including ability to have competing consumers. Very little has changed in terms of the actual client protocol over this time (especially for those using the HTTP API).
What is really disturbing for me in the recommendations however is that they don't seem to understand what they are comparing. It would be roughly the equivalent of me saying "Which should I use neo4j or leveldb?". You could build yourself a graph database on top of leveldb but that would be quite a bit of work.
Mongo in this case would be a storage engine on the event store the OP would have to write him/herself. The writing of a production quality event store is a non-trivial exercise on top of a storage engine if you want to have even the most basic operations.
I wrote this in response to the mailing list equivalent of this question:
How will you do the following with Mongo?:
Write and read events to/from streams with ordering/optimistic concurrency/etc
Then:
Your projections don't want to read from streams in the same way they were written, projections are normally interested in event types and want all events of type T regardless of stream written to and in proper order.
You probably also want for instance the ability to switch live from pushed event notifications to handling pulled information (eg polling) etc.
It would make more sense if Kafka, datomic, and Event Store were being compared.
Seeing as the other replies don't talk about the tooling or benefits in EventStore and only refer to the benefits of MongoDB I'll chime in. But note that my experience is limited.
I'll start with the cons...
There are a lot of check-ins which can lead to deciding which version you are going to actively support yourself. While the team has been solidifying their releases, that they have arrived at version 3 not even 18 months after being released should be an indicator that you have to pull up the version you are supporting for another more recent version (which can also impact the platform you choose to deploy to).
It's not going to easily work on every platform (especially if you're trying to move to a cloud environment or a docker based lxc container). Some of this is due to the community surrounding other DBs such as Mongo. But the team seems to have been working their butts off on read/write performance while maintaining cross platform stability. As time presses on I've found that you don't want to deviate too far from a bare-metal OS implementation which this day in age is not attractive.
Uses a special version of Mono. Finding support for older versions of Mono only serve to make the process more of a root canal.
To make the most of performance of EventStore you really need to think about your architecture. EventStore outputs to flat files and event data can grow pretty quickly. What's the fail rate of the disks are you persisting your data to. How are things compressed? archived? etc. You have a lot of control and the control is geared towards storing your data as events. However, while I'm sure Greg Young himself could quote me to my grave the features that optimize and save your disks in the long term, I'll more than likely find a mature Mongo community that has had experience running into similar cases.
And the Pros...
RESTful - It's AtomPub. Is your stream not specific enough? Create another and do http gets till your hearts content. Concerned about routing do do an http forward. Concerned about security put an http proxy in front. Simple!
You have a nice suite of tools and UI for testing out and building your projections as your events start to generate new data (eg. use chrome browser as a way to debug your projections... ya they're written with java script)
Read performance - Since the application outputs to a flat file you can get kernel level caching and expose them via http in the drop of a hat. Also indexes are across your streams for querying projections against larger data sets (but I really get the feeling index performance will creep up on you over time).
I personally would not use this for a core / mission critical / or growing application! However, if you have a side case for keeping your evented environment interesting then I'd give it go! I personally have to stick to Mongo for now.

Why to use workflow system and what are my options? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I was reading the following post:
Why use Windows Workflow?
Now wf concept looks promising technology for lowering company costs on business process implementation. MS WF looks that it does not have everything for fast implementation. What are other wf/bpm options for fast implementation?
I have been working with workflow engines/systems (OpenText, K2.net, Metastorm, MS WorkFlow Foundation, ...) for past 10 years and I can say that wf technology can be very useful, however it's not suitable for solving all type of problems.
Basicly it's ment to solve process oriented tasks. You would ask your self what does this mean!? Well process is any entity that has start, duration and end. If you look the tipical company is buildup with processes. Apparently storing some final reports in such system would not be the goal... The power shows up when those tasks needs to be processed in controled manner or new process route is required. Classic implementation would require for developer to write additional code, good wf system will let you implement route change in a second without line of code and process versioning is not a problem. This in just one of the benifits.
You should look at wf system as platform for fast process development, monitoring, optimization and versioning. It should give you all the tools needed for BPM life cycle. Here you can find what I am talking about: http://en.wikipedia.org/wiki/Business_process_management
In my professional career I have developed one wf engine and one fully wf system based on MS .net technology. If you are interested in details please visit my web site:
http://www.gp-solutions.si/business/Product.aspx?s=pro&id=1&cat=2 With this system you can develope new process with all the forms, monitoring, security, documents,... in less then 10 min. You can not do this in traditional way of development. Save time and money is the name of the game here.
If you're looking for a commercial alternative for fast BPM implementation I worked with two .NET based platforms in the past - K2.net and PNMSoft.
I personally like PNMSoft (http://www.pnmsoft.com/) since it is native .NET, it supports WF and other technologies and is extremely fast and easy to use.
If you're looking for open-source alternatives, there are some .NET based ones like Bonita (http://sourceforge.net/projects/bonita/) but don't expect it to be as quick and easy...
Nowadays there are several open source BPMS under convenient license models.
For instance the Eclipse Process Manager "Stardust" (http://www.eclipse.org/stardust/) is a comprehensive and mature Java open source BPMS. Its commercial version is used in several products for different industries, also in combination with .NET.
browser-based or Eclipse based process modeler
Process engine in Spring or EJB mode, e.g. Tomcat
Web Service and Java APIs (SOAP, REST)
OOTB portal for workflow execution, business control center and administration
User interface mashup feature to include arbitrary UI technologies in workflow steps
embedded DMS
strong ootb system integration capabilities (JMS, WS, Camel,...)
Amazon Web Service Stardust image available (http://wiki.eclipse.org/Stardust/Knowledge_Base/Getting_Started/RTE_on_AWS)
commerical support and SaaS on demand offering available
suitable e.g. for human centric workflow, ETL, low latency and high volume message processing, document centric worklfow and document management,...
Best regards
Rob