I am working with apache beam. My task is to pull data from kafka topic and process in dataflow.
Does dataflow support kafkaIO ?
Which runners are supported for KafkaIO ?
Yes. KafkaIO is supported by Dataflow and other major Beam runners.
We have Kafka transforms for both Java and Python SDKs.
Note that Python versions are cross-language transforms and Dataflow requires runner V2.
Related
Does apache beam for python support flink runner right now ? or even the portable runner ? And is beam for java supported by flink runner commercially ?
Yes, both python and java is supported for the Apache Flink runner.
It is important to understand that the Flink Runner comes in two flavors:
A legacy Runner which supports only Java (and other JVM-based languages)
A portable Runner which supports Java/Python/Go
Ref: https://beam.apache.org/documentation/runners/flink/
I think you would have to define what commercially supported means:
Do you want to run the Flink Runner as a service, similarly to what Google Cloud Dataflow provides? If so, the closest to this is Amazon's Kinesis Data Analytics, but it is really just a managed Flink cluster.
Many companies use the Flink Runner and contribute back to the Beam project, e.g. Lyft, Yelp, Alibaba, Ververica. This could be seen as a form of commercial support. There are also various consultancies, e.g. BigDataInstitute, Ververica, which could help you manage your Flink Runner applications.
I have kafka broker upgraded from 0.8 to 0.11, and now I am trying to upgrade the spark streaming job code to be compatible with the new kafka -I am using spark 1.6.2-.
I searched a lot for steps to follow to do this upgrade I didn't find any article either official or not-official.
The only article I found useful is this one, however it is mentioning spark 2.2 and kafka 0.10, but I got a line saying
However, because the newer integration uses the new Kafka consumer API instead of the simple API, there are notable differences in usage. This version of the integration is marked as experimental, so the API is potentially subject to change
Do anyone have tried to integrate spark streaming 1.6 with kafka 0.11, or is it better to upgrade spark first to 2.X , since there is a lack of info about and support regarding this version mix of spark-streaming and kafka?
After lots of investigations, found no way to do this move, as spark-streaming only supporting kafka version up to 0.10 (which has major differences from kafka 0.11, 1.0.X).
That's why I decided to move from spark-streaming to use the new kafka-streaming api, and simply it was awesome, simple to use, very flexible, and the big advantage is that: IT IS A LIBRARY, you can simply add it to your project, not a framework that is wrapping your code.
Kafka-streaming api almost support all functionality provided by spark (aggregation, windowing, filtering, MR).
Refter to the python programming guide online:
https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/batch/python.html
I didn't see any material related to streaming programming like kafka connector and so on.
Also from the git hub examples(https://github.com/apache/flink/tree/master/flink-examples/flink-examples-streaming), I didn't see python codes either.
If it does support python in streaming programming, could you show me some examples to start with?
Thanks!
Maybe Flink 1.4 will support Python for streaming, you can see the recent PR FLINK-5886 - Python API for streaming applications.
No, Flink 1.2 does not support Python for streaming.
Flink 1.3 doesn't support it either.
Any Opensource tool to monitor confluent Kafka? Most of the opensource tools available are specific to Apache Kafka but not for Confluent Kafka.
we want to monitor atleast the connectors, streams and cluster health
The Kafka that is distributed in the Confluent Platform is Apache Kafka. There really is no such thing as "Confluent Kafka". Any tools that work with the latest version of Apache Kafka (including Kafka Connect and Kafka Streams) will work with the same versions of Kafka included with Confluent Open Source.
Confluent 3.3 includes Apache Kafka 0.11
Confluent 3.2 includes Apache Kafka 0.10.2
Confluent 3.1 includes Apache Kafka 0.10.1
Confluent 3.0 includes Apache Kafka 0.10.0
Confluent 2.0 includes Apache Kafka 0.9
Confluent 1.0 includes Apache Kafka 0.8.2
Note: Confluent Enterprise includes its own monitoring and management GUI called Control Center. Control Center is a separate process so the Apache Kafka is still the same as the open source version.
You can use updated version of KafkaOffsetMonitor. It supports SSL/TLS and Kerbros. Also uses Kafka 1.1.0 library.
You should be able to use kafka-monitor for monitoring your cluster's health as well as Burrow and KafkaOffsetMonitor for monitoring your consumer application lag. Also, you should definitely use something like jmxtrans for collecting your Kafka broker metrics.
I'm trying to install Confluent over HDP for Kafka Streams which think may not be possible could you people suggest me what to do
It seems like you are trying to install Confluent Platform using Ambari. If that's the case then you want to use a custom service install or you will need to wait for HDP to support Kafka 0.10 which includes Kafka Streams. The alternative route is to install Confluent Platform in parallel with HDP and just not activate the Kafka version that ships with HDP. This will require that you monitor and manage the Confluent Platform independently.
for Kafka Streams
Kafka Streams is a client library that can use with any Kafka cluster above version 0.10. You do not specifically require "Confluent Platform"
That being said, if the version that HDP/Cloudera provide of Kafka is not useful for you, then you should provision external infrastructure for it.