Does apache beam for python support flink runner commercially? - apache-beam

Does apache beam for python support flink runner right now ? or even the portable runner ? And is beam for java supported by flink runner commercially ?

Yes, both python and java is supported for the Apache Flink runner.
It is important to understand that the Flink Runner comes in two flavors:
A legacy Runner which supports only Java (and other JVM-based languages)
A portable Runner which supports Java/Python/Go
Ref: https://beam.apache.org/documentation/runners/flink/

I think you would have to define what commercially supported means:
Do you want to run the Flink Runner as a service, similarly to what Google Cloud Dataflow provides? If so, the closest to this is Amazon's Kinesis Data Analytics, but it is really just a managed Flink cluster.
Many companies use the Flink Runner and contribute back to the Beam project, e.g. Lyft, Yelp, Alibaba, Ververica. This could be seen as a form of commercial support. There are also various consultancies, e.g. BigDataInstitute, Ververica, which could help you manage your Flink Runner applications.

Related

Which Runners support kafkaIO in apache beam?

I am working with apache beam. My task is to pull data from kafka topic and process in dataflow.
Does dataflow support kafkaIO ?
Which runners are supported for KafkaIO ?
Yes. KafkaIO is supported by Dataflow and other major Beam runners.
We have Kafka transforms for both Java and Python SDKs.
Note that Python versions are cross-language transforms and Dataflow requires runner V2.

Why does Flink use Yarn?

I am taking a deep look inside Flink to see how I can use it on a project and had a question for the creators / high level thinkers... why does Flink use Yarn as the default resource manager?
Was Kubernetes considered? Or is it one of those things where we started on Yarn, it works pretty well...
I have come across many projects and articles that allow Kubernetes and Yarn to work together in cluding the Myraid project that allows yarn to be deployed with Mesos (but I am on Kubernetes...)
I have a very large compute cluster 2000 or so nodes that I use and I want to use the super cool CEP features of Flink feeding off a Kafka infrastructure (also deployed on to this kubernetes environment).
I am looking to understand the reasons behind using Yarn as the resource manager underneath Flink and if would be possible (with some effort and contribution to the project) to make Kubernetes an option alongside Yarn.
Please note - I am new to Yarn - just reading up about it. Also new to Flink and learning about the deployment and scale-out architecture.
Flink is not tied to YARN. It can also run on Apache Mesos and there are also users running it on Kubernetes. In the current version (Flink 1.4.1), there are a few things to consider when running Flink in Kubernetes (see this talk by Patrick Lucas).
The Flink community is also currently working on improving Flink's support for container setups. The effort is called FLIP-6 and will be included in the next release (Flink 1.5.0).

Does Flink 1.2 support python in streaming programming?

Refter to the python programming guide online:
https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/batch/python.html
I didn't see any material related to streaming programming like kafka connector and so on.
Also from the git hub examples(https://github.com/apache/flink/tree/master/flink-examples/flink-examples-streaming), I didn't see python codes either.
If it does support python in streaming programming, could you show me some examples to start with?
Thanks!
Maybe Flink 1.4 will support Python for streaming, you can see the recent PR FLINK-5886 - Python API for streaming applications.
No, Flink 1.2 does not support Python for streaming.
Flink 1.3 doesn't support it either.

Can Eclipse/IntelliJ Idea be used to execute code on the cluster

Production system : HDP-2.5.0.0 using Ambari 2.4.0.1
Aplenty demands coming in for executing a range of code(Java MR etc., Scala, Spark, R) atop the HDP but from a desktop Windows machine IDE.
For Spark and R, we have R-Studio set-up.
The challenge lies with Java, Scala and so on, also, people use a range of IDEs from Eclipse to IntelliJ Idea.
I am aware that the Eclipse Hadoop plugin is NOT actively maintained and also has aplenty bugs when working with latest versions of Hadoop, IntelliJ Idea I couldn't find reliable inputs from the official website.
I believe the Hive and HBase client API is a reliable way to connect from Eclipse etc. but I am skeptical about executing MR or other custom Java/Scala code.
I referred several threads like this and this, however, I still have the question that is any IDE like Eclipse/Intellij Idea having an official support for Hadoop ? Even the Spring Data for Hadoop seems to lost traction, it anyways didn't work as expected 2 years ago ;)
As a realistic alternative, which tool/plugin/library should be used to test the MR and other Java/Scala code 'locally' i.e on the desktop machine using a standalone version of the cluster ?
Note : I do not wish to work against/in the sandbox, its about connecting to the prod. cluster directly.
I don't think that there is a genereal solution which would work for all Hadoop services equally. Each solution has it's own development, testing and deployment scenarios as they are different standalone products. For MR case you can use MRUnit to simulate your work locally from IDE. Another option is LocalJobRunner. They both allow you to check your MR logic directly from IDE. For Storm you can use backtype.storm.Testing library to simalate topology's workflow. But they all are used from IDE without direct cluster communications like in case wuth Spark and RStudio integration.
As for the MR recommendation your job should ideally pass the following lifecycle - writing the job and testing it locally, using MRUnit, then you should run it on some development cluster with some test data (see MiniCluster as an option) and then running in on real cluster with some custom counters which would help you to locate your malformed data and to properly maintaine the job.

Apache Kafka and supported platforms

Basic question, which platforms and languages does Apache Kafka currently support?
Kafka is written in Scala, which means it runs on the JVM, so you can effectively run on any OS that supports the JVM. However, the brokers extract a huge performance boost by using the OS s kernel buffer cache. Im not sure how good this is with a non-unix system like Windows. The kafka source code base provides first class support for Scala and Java Clients . You could also find producer and consumer clients in languages like Php,C++, python etc under the contrib directory.
Apache Kafka runs well and is most stable and performant on Linux (either bare metal Linux, Linux VMs in private or public clouds, or Linux based docker containers). Kafka has been known to run on Windows but most vendors that commercially support Kafka do not extend their support to Windows for production servers so it's "community supported" by the Kafka community. Kafka also runs quite well on macOS for development.
The Apache Kafka distribution includes support for Java and Scala clients only but the larger Kafka community has created a long list of clients for other languages. A good list of the available options for clients is on the apache kafka wiki here: https://cwiki.apache.org/confluence/display/KAFKA/Clients
You will find that for some languages (like C#/.Net, Python, or Go) there are 2 or 3 or even more options for client libraries. Some are up to date with the newest Kafka wire protocol changes such as Exactly-Once Semantics, and message Headers which were added in Apache Kafka 0.11 or timestamps which were added in 0.10, or the security enhancements and new consumer api added in 0.9, and others are not. Some have the full set of functions/methods provided in Java (like seek(), or consumer group management, or interceptors) but others do not. Some are written purely in the target language and others are wrappers in the librdkafka C/C++ library. Some are commercially supported by a vendor and others are not, so choose based on your needs in terms of functionality, stability, execution environment, and supportability.