I am taking a deep look inside Flink to see how I can use it on a project and had a question for the creators / high level thinkers... why does Flink use Yarn as the default resource manager?
Was Kubernetes considered? Or is it one of those things where we started on Yarn, it works pretty well...
I have come across many projects and articles that allow Kubernetes and Yarn to work together in cluding the Myraid project that allows yarn to be deployed with Mesos (but I am on Kubernetes...)
I have a very large compute cluster 2000 or so nodes that I use and I want to use the super cool CEP features of Flink feeding off a Kafka infrastructure (also deployed on to this kubernetes environment).
I am looking to understand the reasons behind using Yarn as the resource manager underneath Flink and if would be possible (with some effort and contribution to the project) to make Kubernetes an option alongside Yarn.
Please note - I am new to Yarn - just reading up about it. Also new to Flink and learning about the deployment and scale-out architecture.
Flink is not tied to YARN. It can also run on Apache Mesos and there are also users running it on Kubernetes. In the current version (Flink 1.4.1), there are a few things to consider when running Flink in Kubernetes (see this talk by Patrick Lucas).
The Flink community is also currently working on improving Flink's support for container setups. The effort is called FLIP-6 and will be included in the next release (Flink 1.5.0).
Related
Looking for UI tools to connect to Strimzi kafka cluster to get visibility to kafka topics, read messages within topics, broker and partition details and ability to connect with or without SSL/SASL connections.I have already tried using kafka tool and facing issues with it hence looking for an alternative). Kindly suggest some UI tools for the same ? ( similar to confluent center/kafka tool) which are either free or with minimal cost.
There are no UI tools for now but a new issue was started about it. I would follow it https://github.com/strimzi/strimzi-kafka-operator/issues/3287
There is a project in early stage of developement: https://github.com/strimzi/strimzi-ui
Strimzi UI provides a way for managing Strimzi and Kafka clusters (+
other components) deployed by it using a graphical user interface.
But unforunately in the moment of writing:
This UI is currently not in a state where it can be used. Is it still
early on in it's development but we hope to have something usable very
soon!
So, keep an eye on it.
I already saw a similar question in SO, but not clearly answer my doubts.
We have different Kafka clusters and lot of exploitation operational habits around it. We have our way to start/stop the cluster, lots of exploit scripts that help maintain the cluster etc..
Now we would like to use Kafka connect connectors for new needs, but from what I saw, Kafka connect is extremely coupled to confluent-hub.
It's like I can't even use the connectors without having to install a full operational confluent-hub.
This makes it very difficult for us to use Kafka connect connectors, I understand that confluent-hub might be a framework that help running those connectors, but it's like we can't even use a dissociated Kafka cluster ( a one not exploited by confluent-hub..).
But maybe I miss something..
Do you know if there is any way to use properly Kafka connectors on a already existing Kafka cluster ( completely independent from confluent-hub) ?
EDITED :
It's more a question regarding the high coupled behaviour between confluent-hub and Kafka-connect. All the features that comes with Kafka connect ( distributed workers to handle different fail over scenarios, etc..) are not usable without confluent-hub, thus a "need" to have Kafka cluster running exclusively via confluent-hub, which is not an easy task when you already have an existing big Kafka cluster with lots of OPS habits on it.
Kafka Connect is part of Apache Kafka. It's a pluggable framework for streaming integration between systems in and out of Kafka.
To use Kafka Connect you need connectors for the specific technology with which you want to integrate. For example, S3 sink, Elasticsearch sink, JDBC source or sink, and so on.
The connector API is part of Apache Kafka, and available for anyone who wants to develop a connector.
Connectors are written by various people and organisations, and available in various different ways. How you obtain a connector depends on which connector you want, how its licensed, and how the author has made it available for distribution. It could be you go to github, clone the repo and build the JAR. It could be you can download the JAR directly.
All that Confluent Hub does is make lots of these connectors available for you in one place, easily searchable, and with an optional CLI tool that will install them for you.
Do you have to use Confluent Hub? No, not at all. Might it make your life easier in locating connectors that you want to use, and make it easier to install them? Hopefully :)
Disclaimer: I work for Confluent.
Currently, at my company we are migrating from Kafka 0.8 to 0.11, brokers migration steps and clearly stated in kafka documentation here
What I am stuck in is, upgrading the kafka clients (producers, consumers, spark-streaming), I don't find any documentation/ articles listing out clearly what are the required changes or steps to follow to upgarde the client, all what I found is the java doc Producer Client
What I did so far is to change the kafka client version in my gradle to kafka-clients-0.11.0.0, and everything from the compilation point of view went fine with no code changes at all.
What I seek help with is, is there any expected problems I should take care of, any pointers for client changes other than the kafka-client version?
I went through lots of experiments to get this done.
For the consumers and producers, I just used the kafka consumers and producers 0.11.0.
The trick part was replacing spark-streaming, spark-streaming latest version only support upto kafka 0.10.X, which doesn't contains any updates related to the new broker.
What I recommend here, if you are about to write an application from scratch and your main goal is realtime streaming go for kafka-streaming API, it is just AWESOME!, if you already have spark streaming app (which was my case), you should either judge which is more important than the other wether to get stuck with the kafka-broker version 10.X and spark-streaming which was [experimental][1] btw.
The benefits of having the streaming inside kafka not spark the following:
Kafka streaming is a normal jar that can be injected in any java application, so you don't care that much about deployment, and environment
Auto-scaling is so easy when using kafka-streaming using any scaleset provided by any cloud service provider, unlike scaling a HDP cluster.
Monitoring using something like prometheus would be much easier.
Production system : HDP-2.5.0.0 using Ambari 2.4.0.1
Aplenty demands coming in for executing a range of code(Java MR etc., Scala, Spark, R) atop the HDP but from a desktop Windows machine IDE.
For Spark and R, we have R-Studio set-up.
The challenge lies with Java, Scala and so on, also, people use a range of IDEs from Eclipse to IntelliJ Idea.
I am aware that the Eclipse Hadoop plugin is NOT actively maintained and also has aplenty bugs when working with latest versions of Hadoop, IntelliJ Idea I couldn't find reliable inputs from the official website.
I believe the Hive and HBase client API is a reliable way to connect from Eclipse etc. but I am skeptical about executing MR or other custom Java/Scala code.
I referred several threads like this and this, however, I still have the question that is any IDE like Eclipse/Intellij Idea having an official support for Hadoop ? Even the Spring Data for Hadoop seems to lost traction, it anyways didn't work as expected 2 years ago ;)
As a realistic alternative, which tool/plugin/library should be used to test the MR and other Java/Scala code 'locally' i.e on the desktop machine using a standalone version of the cluster ?
Note : I do not wish to work against/in the sandbox, its about connecting to the prod. cluster directly.
I don't think that there is a genereal solution which would work for all Hadoop services equally. Each solution has it's own development, testing and deployment scenarios as they are different standalone products. For MR case you can use MRUnit to simulate your work locally from IDE. Another option is LocalJobRunner. They both allow you to check your MR logic directly from IDE. For Storm you can use backtype.storm.Testing library to simalate topology's workflow. But they all are used from IDE without direct cluster communications like in case wuth Spark and RStudio integration.
As for the MR recommendation your job should ideally pass the following lifecycle - writing the job and testing it locally, using MRUnit, then you should run it on some development cluster with some test data (see MiniCluster as an option) and then running in on real cluster with some custom counters which would help you to locate your malformed data and to properly maintaine the job.
I'm playing with Kafka-Connect. I've got the HDFS connector working both in stand-alone mode and distributed mode.
They advertise that the workers (which are responsible for running the connectors) can be managed via YARN However, I haven't seen any documentation that describes how to achieve this goal.
How do I go about getting YARN to execute workers? If there is no specific approach, are there generic how-to's as to how to get an application to run within YARN?
I've used YARN with SPARK using spark-submit however, I cannot figure out how to get the connector to run in YARN.
You can theoretically run anything on YARN, even a simple hello world program. Which is why saying Kafka-Connect runs on YARN is technically correct. The caveat is that getting Kafka-Connect to run on YARN will take a fair amount of elbow grease at the moment. There are two ways to do it:
Directly talk to the YARN API to acquire a container, deploy the Kafka-Connect binaries and launch Kafka-Connect.
Use the separate Slider project https://slider.incubator.apache.org/docs/getting_started.html that Stephane has already mentioned in the comments.
Slider
You'll have to read quite a bit of documentation to get it working but the idea behind Slider is that you can get any program to run on YARN without dealing with the YARN API and writing a YARN app master by doing the following:
Create a slider package out of your program
Define a configuration for you package
Use the slider cli to deploy your application onto YARN
Slider handles container deployment and recovery of failed containers for you, which is nice. Also Slider is becoming a native part of YARN when YARN 3.0 is released.
Alternatives
Also as a side note, getting Kafka-Connect to deploy on Kubernetes or Mesos / Marathon is probably going to be easier. The basic workflow to do that would be:
Create a Kafka-Connect docker container or just use confluent's docker container
Create a deployment config for Kubernetes or Marathon
Click a button / run a command
Tutorials
A good Mesos / Marathon tutorial can be found here
Kubernetes tutorial here
Confluent Kubernetes Helm Charts here