Is Apache Nifi ready to use with Kubernetes in production?

Is Apache Nifi ready to use with Kubernetes in production? - kubernetes

I am planning to setup Apache Nifi on Kubernetes and make it to production. During my surfing I didn't find any one who potentially using this combination for production setup.
Is this good idea to choose this combination. Could you please share your thoughts/experience here about the same.
https://community.cloudera.com/t5/Support-Questions/NiFi-on-Kubernetes/td-p/203864

As mentioned in the Comments, work has been done regarding Nifi on Kubernetes, but currently this is not generally available.
It is good to know that there will be dataflow offerings where Nifi and Kubernetes meet in some shape or form during the coming year.* So I would recommend to keep an eye out for this and discuss with your local contacts before trying to build it from scratch.
*Disclaimer: Though I am an employee of Cloudera, the main driving force behind Nifi, I am not qualified to make promises and this is purely my own view.

I would like to invite you to try a Helm chart at https://github.com/Datenworks/apache-nifi-helm
We've been maintaining a 5-node Nifi cluster on GKE (Google Kubernetes Engine) in a production environment without major issues and performing pretty good. Please let me know if you find any issues on running this chart on your environment.

Regarding any high volume set on k8s. Be sure to tune your linux kernel (primarily related to the Linux Connection Tracker (Contrack) service. You will also expect to see non-zero tcp timeouts, retries, out of window acks, et al. Depending on which container networking implementation is used, there may be additional configuration changes required.
I will assume you are using containerd and NOT using docker networking (except obviously the container(s) within a pod)
The issue applies to ANY heavy IO pod: kafka, NiFi, mySQL, PostGreSQL, you name it.
The incident increases when:
"high" volumes of cross pod (especially cross node) tcp connections occur
additional errors if you have large (megabyte or larger) messages
Be aware of any other components using either the Pod or VM tcp stack (e.g. PVC software supporting NiFi persisted storage)

Related

Can Kubernetes work like a compute farm and route one request per pod

I've dockerized a legacy desktop app. This app does resource-intensive graphical rendering from a command line interface.
I'd like to offer this rendering as a service in a "compute farm", and I wondered if Kubernetes could be used for this purpose.
If so, how in Kubernetes would I ensure that each pod only serves one request at a time (this app is resource-intensive and likely not thread-safe)? Should I write a single-threaded wrapper/invoker app in the container and thus serialize requests? Would K8s then be smart enough to route subsequent requests to idle pods rather than letting them pile up on an overloaded pod?

Interesting question.
The inbuilt default Service object along with kube-proxy does route the requests to different pods, but only does so in a round-robin fashion which does not fit our use case.
Your use-case would require changes to be made to the kube-proxy setup during the cluster setup. This approach is tedious and will require you to have your own cluster setup (not supported by cloud services). As described here.
Best bet would be to setup a service-mesh like Istio which provides the features with little configuration along with a lot of other useful functionalities.
See if this helps.

How can I make a Kubernetes multi-cluster Redis solution?

I deployed this week a Redis instance using Bitnami's Helm Chart into a GKE (Google Kubernetes Engine) cluster. Although I've been successful on this part, the challenge now is to make a failover disaster recovery strategy that replicates the data to another Redis instance in another GKE cluster (but same GCP project). How can I do this? I tried Persistence Volume Claims but they are only visible inside the cluster.

Redis Enterprise does have a WAN (multi-geo) replication mode, however I've never used it and it looks like it dramatically limits which features of Redis you can use to things that are compatible with a CRDT model. Basically first you need to answer how you would do this by hand, and then investigate automating that. WAN failover is a very complex thing, and generally you wouldn't even want to do it since you wouldn't fail over at the Redis level. Instead you would fail over your entire DC (or whatever you want to call your failure zones). Distributed database modelling and management is very tricky, here be dragons.

Terraform, Kubernetes, Mesos etc - how are they connected

Reading a lot on internet but the information is not clear or mixedup so I thought I will ask the question here.
I am trying to understand how Terraform is same or different from container orchestration tools like Kubernetes, Mesos etc.
Can Terraform work independently or Kube and Mesos? How is it connected to docker containers?
Can someone please shed the light?
Thanks!!!

I don't know enough about Mesos as I would like, but I do know about Kubernetes and Terraform. Despite I'm not an expert the general basics between this tools have a different purpose. While Terraform deals with the generation of the infrastructure in the cloud by using their apis, Kubernetes deals with the administration and orchestration of containers in the undergroown infrastructure by using the api of the container daemon such the docker daemon.
So generally talking the Terraform main point is to make transparent the creation of the cloud infrastructure where you write what you want to have, servers, network, security policies, some PaaS Service and Kubernetes is the orchestrator of containers.
Hope this helps you. Please, in the case of someone saws a mistake. Remark it so we all improves.

Terraform - A Tools to Build your Infrastructure an open-source project Hashicorp labs if you are aware with AWS and heard of CloudFormation both work in same manner but Terraform have some better feature you can write your whole Infrastructure as a Code run it in one click and decommissioned it in one click.
For more you must visit the site: https://www.terraform.io
Now Kubernetes (An open source project by Google) and Apache-Mesos( Or DC/OS) An project by Apache foundation both are used for Container orchestration (and I’m purposely avoiding using the word Docker) is not for everyone and does not answer every need.
Mesos was launched first but it was really hard to manage Mesos networking that time. and In 2014 there was the first Release of Kubernetes comes in.
Now, DC/OS (the Distributed Cloud Operating System) is an open-source, distributed operating system based on the Apache Mesos distributed systems kernel.
It's in the race with Kubernetes .
I would suggest you must go through this article to get a better understanding of Kubernetes vs Mesos : https://logz.io/blog/kubernetes-vs-mesos/
And Yes they are not related to Terraform at all.
Thanks

Will the master know the data on workers/nodes in k8s

I try to deploy a set of k8s on the cloud, there are two options:the masters are in trust to the cloud provider or maintained by myself.
so i wonder about that if the masters in trust will leak the data on workers?
Shortly, will the master know the data on workers/nodes?

The abstractions in Kubernetes are very well defined with clear boundaries. You have to understand the concept of Volumes first. As defined here,
A Kubernetes volume is essentially a directory accessible to all
containers running in a pod. In contrast to the container-local
filesystem, the data in volumes is preserved across container
restarts.
Volumes are attached to the containers in a pod and There are several types of volumes
You can see the layers of abstraction source
Master to Cluster communication
There are two primary communication paths from the master (apiserver) to the cluster. The first is from the apiserver to the kubelet process which runs on each node in the cluster. The second is from the apiserver to any node, pod, or service through the apiserver’s proxy functionality.
Also, you should check the CCM - The cloud controller manager (CCM) concept (not to be confused with the binary) was originally created to allow cloud specific vendor code and the Kubernetes core to evolve independent of one another. The cloud controller manager runs alongside other master components such as the Kubernetes controller manager, the API server, and scheduler. It can also be started as a Kubernetes addon, in which case it runs on top of Kubernetes.
Hope this answers all your questions related to Master accessing the data on Workers.
If you are still looking for more secure ways, check 11 Ways (Not) to Get Hacked

Short answer: yes the control plane can access all of your data.
Longer and more realistic answer: probably don't worry about it. It is far more likely that any successful attack against the control plane would be just as successful as if you were running it yourself. The exact internal details of GKE/AKS/EKS are a bit fuzzy, but all three providers have a lot of experience running multi-tenant systems and it wouldn't be negligent to trust that they have enough protections in place against lateral escalations between tenants on the control plane.

Clusters and nodes formation in Kubernetes

I am trying to deploy my Docker images using Kubernetes orchestration tools.When I am reading about Kubernetes, I am seeing documentation and many YouTube video tutorial of working with Kubernetes. In there I only found that creation of pods, services and creation of that .yml files. Here I have doubts and I am adding below section,
When I am using Kubernetes, how I can create clusters and nodes ?
Can I deploy my current docker-compose build image directly using pods only? Why I need to create services yml file?
I new to containerizing, Docker and Kubernetes world.

My favorite way to create clusters is kubespray because I find ansible very easy to read and troubleshoot, unlike more monolithic "run this binary" mechanisms for creating clusters. The kubespray repo has a vagrant configuration file, so you can even try out a full cluster on your local machine, to see what it will do "for real"
But with the popularity of kubernetes, I'd bet if you ask 5 people you'll get 10 answers to that question, so ultimately pick the one you find easiest to reason about, because almost without fail you will need to debug those mechanisms when something inevitably goes wrong
The short version, as Hitesh said, is "yes," but the long version is that one will need to be careful because local docker containers and kubernetes clusters are trying to solve different problems, and (as a general rule) one could not easily swap one in place of the other.
As for the second part of your question, a Service in kubernetes is designed to decouple the current provider of some networked functionality from the long-lived "promise" that such functionality will exist and work. That's because in kubernetes, the Pods (and Nodes, for that matter) are disposable and subject to termination at almost any time. It would be severely problematic if the consumer of a networked service needed to constantly update its IP address/ports/etc to account for the coming-and-going of Pods. This is actually the exact same problem that AWS's Elastic Load Balancers are trying to solve, and kubernetes will cheerfully provision an ELB to represent a Service if you indicate that is what you would like (and similar behavior for other cloud providers)
If you are not yet comfortable with containers and docker as concepts, then I would strongly recommend starting with those topics, and moving on to understanding how kubernetes interacts with those two things after you have a solid foundation. Else, a lot of the terminology -- and even the problems kubernetes is trying to solve -- may continue to seem opaque