How Zookeper Authorization works? - apache-zookeeper

I am new to Zookeper and am wondering, I understand Zookeper can be used as configuration storage, and considering that what if I have one client of Zookeeper should not have access to certain configurations? How do I restrict that access?
Scenario: I want to use it as a configuration service, from where my application retrieves its configurations, database endpoint lists etc. Can I do that with Zookeper ? If I can how do I restrict access, so one application doesn't access configurations from another?

ZooKeeper is a distributed co-ordination service to manage large set of hosts. Co-ordinating and managing a service in a distributed environment is a complicated process. ZooKeeper solves this issue with its simple architecture and API. ZooKeeper allows developers to focus on core application logic without worrying about the distributed nature of the application.
The ZooKeeper framework was originally built at “Yahoo!” for accessing their applications in an easy and robust manner. Later, Apache ZooKeeper became a standard for organized service used by Hadoop, HBase, and other distributed frameworks. For example, Apache HBase uses ZooKeeper to track the status of distributed data.
It's not a key-value storage

Related

Do you need multiple zookeeper instances to run a multiple-broker kafka?

I'm new to kafka.
Kafka is supposed to be used as a distributed service. But the tutorials and blog posts i found online never mention if there is one or several zookeeper nodes.
The tutorials just pop one zookeper instance, and then multiple kafka brokers.
Is it how it is supposed to be done?
Zookeeper is a co-ordination service (in a centralized manner) for distributed systems that is used by clusters for maintenance of distributed system . The distributed synchronization achieved by it via metadata such as configuration information, naming, etc.
In general architectures, Kafka cluster shall be served by 3 ZooKeeper nodes, but if the size of deployment is huge, then it can be ramped up to 5 ZooKeeper nodes but that in turn will add load on the nodes as all nodes try to be in sync as all metadata related activities are handled by ZooKeeper.
Also, it should be noted that as an improvement, the new release of Kafka reduces dependency on ZooKeeper in order to enhance scalability of metadata across, to reduce the complexity in maintaining the meta data with external components and to enhance the recovery from unexpected shutdowns. With new approach, the controller failover is almost instantaneous. This is achieved by Kafka Raft Metadata mode termed as 'KRaft' that will run Kafka without ZooKeeper by merging all the responsibilities handled by ZooKeeper inside a service in the Kafka Cluster itself and operates on event based mechanism that is used in the KRaft protocol.
Tutorials generally keep things nice and simple, so one ZooKeeper (often one Kafka broker too). Useful for getting started; useless for any kind of resilience :)
In practice, you are going to need three ZooKeeper nodes minimum.
If it helps, here is an enterprise reference architecture whitepaper for the deployment of Apache Kafka
Disclaimer: I work for Confluent, who publish the above whitepaper.

Alternative of Confluent REST Proxy

We have some applications which want to communicate with Kafka using REST API calls to both consume and produce messages. If we do not want to use Confluent REST Proxy, what are the options ?
One possible alternative is the Strimzi Kafka Bridge (https://github.com/strimzi/strimzi-kafka-bridge).
It's part of the broader Strimzi project about running Kafka on Kubernetes but work even running as standalone (when your Kafka cluster is on bare metal).
Of course it's open source and Apache 2.0 licensed.
the reason [not to use it] is monetary
You can use the Confluent REST Proxy with no software/licensing costs.
We are thinking of not buying any additional hardware for this new request and use existing configuration to meet the requirement.I am mostly interested to know if consumer/producer can be created to meet this requirement
You don't need extra hardware.
Pick an existing server with at least 2GB available of memory, and run kafka-rest-start and see how well it works
if we can create Rest-API calls which will be used by other applications to consume data from Kafka and push data to Kafka
That's the main purpose of REST Proxy, yes.

KSQL Server Elastic Scaling in Kubernetes

in the context of kubernetes or else, does it make sense to have one KSQL SERVER per application? When i read the capacity planning for KSQL Server, it is seems the basic settings are for running multiple queries on one server.
However I feel like to have a better control over scaling up and down with Kubernetes, it would make more sense to fix the number of Thread by per query, and launch a server configured in kube with let say 1 cpu, where only one application would run. However i am not sure how heavy are KSQL Server, and if that make actual sense or not.
Any recommendation.
First of all, what you have mentioned is clearly doable. You can run KSQL Server with Docker, so it's you could have a container orchestrator such as kubernetes or swarm maintaining and scheduling those KSQL Server instances.
So you know how this would play out:
Each KSQL Instance will join a group of other KSQL Instances with
the same KSQL_SERVICE_ID that use the same Kafka Cluster defined by KSQL_KSQL_STREAMS_BOOTSTRAP_SERVERS
You can create several KSQL Server Clusters, i.e for different
applications, just use different KSQL_SERVICE_ID while using the
same Kafka Cluster.
As a result, you now you have:
Multiple Containerized KSQL Server Instances managed by a container
orchestrator such as Kubernetes.
All of the KSQL Instances are connected to the Same Kafka Cluster (you can also have different Kafka Clusters for different KSQL_SERVICE_ID)
The KSQL Server Instances can be grouped in different applications
(different KSQL_SERVICE_ID) in order to achieve separation of
concerns so that scalability, security and availability can be
better maintained.
Regarding the coexistence of several KSQL Server Instances (maybe with different KSQL_SERVICE_ID) on the same server, you should know the available machine resources can be monopolized by a greedy instance, causing problems to the less greedy instance. With Kubernetes you could set resource limits on your Pods to avoid this, but greedy instances will be limited and slowed down.
Confluent advice regarding multi-tenancy:
We recommend against using KSQL in a multi-tenant fashion. For
example, if you have two KSQL applications running on the same node,
and one is greedy, you're likely to encounter resource issues related
to multi-tenancy. We recommend using a single pool of KSQL Server
instances per use case. You should deploy separate applications onto
separate KSQL nodes, because it becomes easier to reason about scaling
and resource utilization. Also, deploying per use case makes it easier
to reason about failovers and replication.
A possible drawback is the overhead you'll have if you run multiple KSQL Server Instances (Java Application footprint) in the same pool while having no work for them to do (i.e: no schedulable tasks due to lack of partitions on your topic(s)) or simply because you have very little workload. You might be doing the same job with less instances, avoiding idled or nearly-idled instances.
Of course stuffing all stream processing, maybe for completely different use cases or projects, on a single KSQL Server or pool of KSQL Servers may bring its own internal concurrency issues, development cycle complexities, management, etc..
I guess something in the middle will work fine. Use a pool of KSQL Server instances for a single project or use case, which in turn might translate to a pipeline consisting on a topology of several source, process and sinks, implemented by a number of KSQL queries.
Also, don't forget about the scaling mechanisms of Kafka, Kafka Streams and KSQL (built on top of Kafka Streams) discussed in the previous question you've posted.
All of this mechanisms can be found here:
https://docs.confluent.io/current/ksql/docs/capacity-planning.html
https://docs.confluent.io/current/ksql/docs/concepts/ksql-architecture.html
https://docs.confluent.io/current/ksql/docs/installation/install-ksql-with-docker.html

Monitoring UI for Apache kafka - kafka manager vs kafka monitor [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Closed 3 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I am new to kafka. We want to monitor and manage kafka topics. We tried different open source monitoring tools like
kafka-monitor
kafka-manager
Both tools are good. But we are unable to make a decision which should be included in our deployment stack. Which one is better and why, and in which scenario?
'kafka manager' from yahoo looks the older one and 'kafka monitor' from LinkedIn is newer one
Kafka Monitor-
Lenses
Lenses (ex Landoop) enhances Kafka with User Interface, streaming SQL engine and cluster monitoring. It enables faster monitoring of Kafka data pipelines.
They provide a free all-in-one docker (Lenses Box) which can serve a single broker for up to 25M messages. Note that this is recommended for development environments.
Cloudera SMM
Streams Messaging Manager is the solution for monitoring and managing clusters running Cloudera or Hortonworks kafka. It also comes with replication capability.
Confluent
Another option is Confluent Enterprise which is a Kafka distribution for production environments. It also includes Control Centre, which is a management system for Apache Kafka that enables cluster monitoring and management from a User Interface.
Yahoo CMAK (Cluster Manager for Apache Kafka, previously known as Kafka Manager)
Kafka Manager or CMAK is a tool for monitoring Kafka offering less functionality compared to the aforementioned tools.
KafDrop
KafDrop is a UI for monitoring Apache Kafka clusters. The tool displays information such as brokers, topics, partitions, and even lets you view messages. It is a lightweight application that runs on Spring Boot and requires very little configuration.
LinkedIn Burrow
Burrow is a monitoring companion for Apache Kafka that provides consumer lag checking as a service without the need for specifying thresholds. It monitors committed offsets for all consumers and calculates the status of those consumers on demand. An HTTP endpoint is provided to request status on demand, as well as provide other Kafka cluster information. There are also configurable notifiers that can send status out via email or HTTP calls to another service.
Kafka Tool
Kafka Tool is a GUI application for managing and using Apache Kafka clusters. It provides an intuitive UI that allows one to quickly view objects within a Kafka cluster as well as the messages stored in the topics of the cluster. It contains features geared towards both developers and administrators.
If you cannot afford licenses, then go for Yahoo Kafka Manager, LinkedIn Burrow or KafDrop. Confluent's and Landoop's products are the best out there, but unfortunately, they require licensing.
For more details, you can refer to my blog post Overview of UI Monitoring tools for Apache Kafka Clusters.
If you want to pay for licensing and Kafka cluster support, then you can use Confluent Control Center
Alternatively, the free route would be to use JMX exporters from Datadog and/or Prometheus/Influxdb (with Grafana dashboards) to see overall system health checks (CPU, network, memory, etc)... Much more information than what you get only by monitoring Kafka processes with Kafka tools
At my company, we used the Yahoo product, we investigated the LinkedIn product, and several others mentioned. My company ultimately chose to use Prometheus+Grafana. Everyone loves it and I'd highly recommend it.
There are two big advantages to Prometheus+Grafana. The first is it does full featured Kafka metrics ingestion+visualization+alerting but it's not limited to Kafka. While our initial needs were just to monitor Kafka, we also wanted metrics on HTTP servers+traffic, server utilization (cpu/ram/disk), and custom application level metrics. Prometheus handles all of the above. Secondly, Prometheus + Grafana are very high quality, well designed, and easy to use. A lot of other products in this space are old and complicated to work with. Prometheus + Grafana are both excellent to work with, they are very customizable, polished, and easy to use. Grafana has a very flashy + functional JavaScript interface that lets you make exactly the customized dashboards that you want. Prometheus has a very polished metric collection engine, storage engine, query language, and alerting system. Something like Yahoo Kafka Manager has much more limited functionality in all of these categories.
If you want to try Prometheus, you need to do two things:
1) install+configure the JMX->Prometheus exporter on your Kafka brokers:
https://github.com/prometheus/jmx_exporter
2) Setup a Prometheus server to collect metrics + and setup a Grafana dashboard to display the graphs that you want.
I'd also say that this is just for monitoring+dashboards+alerting. For management functions, you still need other tools.
The kafka-monitor is (despite the name) a load generation and reporting tool. Yahoo's kafka-manager is an overall monitoring tool.

Apache Kafka consumer groups and microservices running on Kubernetes, are they compatible?

So far, I have been using Spring Boot apps (with Spring Cloud Stream) and Kafka running without any supporting infrastructure (PaaS).
Since our corporate platform is running on Kubernetes we need to move those Spring Boot apps into K8s to allow the apps to scale and so on. Obviously there will be more than one instance of every application so we will define a consumer group per application to ensure the unique delivery and processing of every message.
Kafka will be running outside Kubernetes.
Now my doubt is: since the apps deployed on k8s are accessed through the k8s service that abstracts the underlying pods, and individual application pods can't be access directly outside of the k8s cluster, Kafka won't know how to call individual instances of the consumer group to deliver the messages, will it?
How can I make them work together?
Kafka brokers do not push data to clients. Rather clients poll() and pull data from the brokers. As long as the consumers can connect to the bootstrap servers and you set the Kafka brokers to advertise an IP and port that the clients can connect to and poll() then it will all work fine.
Can Spring Cloud Data Flow solve your requirement to control the number of instances deployed?
and, there is a community released Spring Cloud Data Flow server for OpenShift:
https://github.com/donovanmuller/spring-cloud-dataflow-server-openshift