Is possible for a container to send kafka event when finishes? - apache-kafka

We just migrated to a kubernetes cluster, I was wondering if it is possible to send a kafka event when a container/pod finishes automatically with the stdout as message. Right now we are using fluentd with elastic search but the output of a pod is used as input for the next one, we need to poll constantly elastic search for when the output is ready and that causes performance issues on overall execution

I'm not sure of your current setup but my first thought would jump to:
Use something such as fluentd or Logstash on it's own pod per node
Configure volume access to Kubernetes log folder /var/log/containers/*
Use the Kafka output for either fluentd or Logstash with file input (tail) on the logging folder
This approach would require the configuration above on each node however but requires minimal configuration of logging locations etc..
It's not something I've personally configured but have considered it for the future.
More info here

Related

How to read in file paths into a queue that is in a Kubernetes cluster?

I want to read file paths from a persistent volume and store these file paths into a persistent queue of sorts. This would probably be done with an application contained within a pod. This persistent volume will be updated constantly with new files. This means that I will need to constantly update the queue with new file paths. What if this application that is adding items to the queue crashes? Kubernetes would be able to reboot the application, but I do not want to add in file paths that are already in the queue. The app would need to know what exists in the queue before adding in files, at least I would think. I was leaning on RabbitMQ, but apparently you cannot search a queue for specific items with this tool. What can I do to account for this issue? I am running this cluster on Google Kubernetes Engine, so this would be on the Google Cloud Platform.
What if this application that is adding items to the queue crashes?
Kubernetes would be able to reboot the application, but I do not want
to add in file paths that are already in the queue. The app would need
to know what exists in the queue before adding in files
if you are looking for searching option also i would suggest using the Redis instead of Queue Running rabbitMQ on K8s i have pretty good experience when it's come to scaling and elasticity however there is HA helm chart of RabbitMQ you can use it.
i would Recomand checking out Redis and using it as backend to store the data, if you looking forward to create queue still you can use Bull : https://github.com/OptimalBits/bull
it uses the Redis as background to store the data and you can create the queue using this library.
As in Redis you will be taking continuous dump at every second or so...! there is less chances to miss data however in RabbitMQ you can keep persistent messaging plus it provide option for acknowledgment and all.
it's about the actual requirement that you want to implement. If your application wants to order in the list you can not use the Redis in that case RabbitMQ would be best.
Have you ever heard about KubeMQ? There is a KubeMQ community where you can refer to with the guides and help.
As an alternative solution you can find useful guide on official Kubernetes documentation on creating working queue with Redis

How to send logs from Google Stackdriver to Kafka

I see many docs and posts about how to send logs to Stackdriver but almost no information about how to do the opposite - send logs from the Stackdriver to Kafka.
In my case, our Ops want to collect the logs from our web servers using Google's stackdriver agents and pushing them to stackdriver ... However, for my stream processing needs I want to get the logs into Kafka to use it's unparalleled abilities to retain and reprocess data by any number of consumers, something that I cannot do with PubSub.
So, what are the options for doing this? I only saw a couple of possible avenues - neither sounds too good:
based on this post: (https://powerspace.tech/how-to-stream-data-from-google-pubsub-to-kafka-with-kafka-connect-dbef1c340a76) push data into PubSub first, and then read from it using either Kafka connector or write my own Kafka consumer. I hate the thought of adding yet another hop (serialize/deserialize/ack/etc.) between the source of data and Kafka ....
I noticed a brief mentioning in passing on adding a plugin to Google's version of Fluentd (which is what stackdriver log collection agent is based on) here: https://powerspace.tech/how-to-stream-data-from-google-pubsub-to-kafka-with-kafka-connect-dbef1c340a76 . Not many details - so hard to tell how involved this approach is ...
Any other options?
Thank you!
Enter in to the Kafka console and add certain elements in the console. Once you have added the elements in the Kafka console you need to check if these elements are reflected successfully in the cloud shell. For this you will run the command > $ gcloud pubsub subscriptions pull from-kafka — auto-ack — limit=10 < . Once you run this command it will take some time to sync with the Kafka console. You will get the results after running this command a couple of times.
You will run the commands in the Cloud Shell and see the output in the Kafka VM SSH.
***Image1
Now you will be verifying the exact opposite procedure where in you will be running the command in the Kafka VM and seeing the output in the Cloud Shell. It will take some time for the output to be reflected and you may have to run the command > $ gcloud pubsub subscriptions pull from-kafka — auto-ack — limit=10 < a couple of times to see the output. Your output will look like this
*** image2
The Kafka plugin is deprecated. For more information, refer to https://cloud.google.com/stackdriver/docs/deprecations
Note: This functionality is only available for agents running on Linux. It is not available on Windows.
Kafka is monitored via JMX. Monitoring supports monitoring Kafka version 0.8.2 and higher.
On your VM instance, download kafka-082.conf from the GitHub configuration repository and place it in the directory /etc/stackdriver/collectd.d/:
(cd /etc/stackdriver/collectd.d/ && sudo curl -O https://raw.githubusercontent.com/Stackdriver/stackdriver-agent-service-configs/master/etc/collectd.d/kafka-082.conf)
The downloaded plugin configuration file assumes that your Kafka server is configured to accept JMX connections on port 9999. If you have configured Kafka with a different JMX port, as root, edit the file and follow the instructions to change the JMX port settings.
After adding the configuration file, restart the Monitoring agent by running the following command:
sudo service stackdriver-agent restart
What is monitored:
https://cloud.google.com/monitoring/api/metrics_agent#agent-kafka

How can I get log rotation working inside a kubernetes container/pod?

Our setup:
We are using kubernetes in GCP.
We have pods that write logs to a shared volume, with a sidecar container that sucks up our logs for our logging system.
We cannot just use stdout instead for this process.
Some of these pods are long lived and are filling up disk space because of no log rotation.
Question:
What is the easiest way to prevent the disk space from filling up here (without scheduling pod restarts)?
I have been attempting to install logrotate using: RUN apt-get install -y logrotate in our Dockerfile and placing a logrotate config file in /etc/logrotate.d/dynamicproxy but it doesnt seem to get run. /var/lib/logrotate/status never gets generated.
I feel like I am barking up the wrong tree or missing something integral to getting this working. Any help would be appreciated.
We ended up writing our own daemonset to properly collect the logs from the nodes instead of the container level. We then stopped writing to shared volumes from the containers and logged to stdout only.
We used fluentd to the logs around.
https://github.com/splunk/splunk-connect-for-kubernetes/tree/master/helm-chart/splunk-kubernetes-logging
In general, you should write logs to stdout and configure log collection tool like ELK stack. This is the best practice.
However, if you want to run logrotate as a separate process in your container - you may use Supervisor, which serves as a very simple init system and allows you to run as many parallel process in container as you want.
Simple example for using Supervisor for rotating Nginx logs can be found here: https://github.com/misho-kr/docker-appliances/tree/master/nginx-nodejs
If you write to the filesystem the application creating the logs should be responsible for rotation. If you are running a java application with logback or log4j it is simple configuration change. For other languages/frameworks it is usually similar.
If that is not an option you could use a specialized tool to handle the rotation and piping the output to it. One example would be http://cr.yp.to/daemontools/multilog.html
As method of last resort you could investigate to log into a named pipe (FIFO) instead of a real file and have some other process handling the retrieval and writing of the data - including the rotation.

Logging Kubernetes with an external ELK stack

Is there any documentation out there on sending logs from containers in K8s to an external ELK cluster running on EC2 instances?
We're in the process of trying to Kubernetes set up and I'm trying to figure out how to get the logging to work correctly. We already have an ELK stack setup on EC2 for current versions of the application but most of the documentation out there seems to be referring to ELK as it's deployed to the K8s cluster.
I am also working on the same cause.
First you should know what driver is being used by your docker containers to manage the logs (json driver/ journald etc - read here).
After that you should use some log collector in your architecture to send the logs to the Logstash endpoint. You can use filebeat/fluent bit. They are light weight alternatives to logstash/fluentd respectively. You must use one of them and not directly send your logs to logstash via syslog since these log shippers have a special functionality of enriching your logs with kubernetes metadata of the respective containers.
There might be lot of challenges after that. Parsing log data (multiline logs for example) etc. For an efficient pipeline, it’s better to do most of the work (i.e. extracting the date object from the logs etc) at the log sender side, than using the common logstash for this purpose that might be a bottle-neck.
Note that in case the container logs are not sent to stdout/stderr but written else-where, you might need to run filebeat/fluent-bit as side-car with your containers.
As for the links for documentation are concerned, I myself didn’t find anything documented in a single place on this, but the keywords that I mentioned over, reading about them I got to know many things.
Hope this helps.

Application Logging/Alerting/Metering in Kubernetes

Before looking at Kubernetes, we are writing all our logs to stdout(according to 12-factor-app) and using logspout to collect the logs to Logstash. And in Logstash we then route logs to different targets:
InfluxDB+Grafana: to monitor application metrics(e.g., how long does a certain calculation takes)
Riemann: to alert if some performance thresholds are crossed
How these things can be done in Kubernetes?
I know that with Heapster you can see JVM level graphs(memory usages, etc) or even maybe Heapster can send events to Riemann in order to alert some system level statistics(e.g., disk is full). But for stuff on the application level, what would be the right approach then?
Heapster should be grabbing the stdout from the containers as well and can send the data to different backends (sinks). It would essentially be an API call with the data. Check out: https://github.com/kubernetes/heapster/blob/master/docs/sink-configuration.md
I'm not 100% sure on stdout being the only method for a 12fa, but we use a in-house logging lib that also streams the stdout to our logging engine (graylog). That happens inside the app so that the log messages are preserved as a full 'event' vs heapster or other stdout scrapings treating each line as an event.