Is Swap Space Needed for a Kafka Node ? - apache-kafka

We are in the midst of doing a Kafka POC between our enterprise and Google Cloud, and we were told that Google Cloud VMs dont provision swap space by default. Anyone in the Kafka community who implemented Kafka know whether Kafka needs swap space ?

The brokers themselves should not require a substantial amount of memory and as such do not require a swap space. Ideally you will run your brokers on dedicated VMs allowing the broker to take full advantage of the OS's buffer cache. In order to hit the expected latency levels the OS should have an abundant amount of 'free' memory. If you make it to the point where pages need to be swapped to disk you have already ventured into bad territory.

You only need swap space if Kafka is running out of memory and in practice I haven't seen Kafka to be a huge memory hog. So just be sure your VM is provisioned with enough memory and the swap space should not matter.

Related

Kafka broker auto scaling

I am looking for some suggestion on Kafka broker auto scaling up and down based on the load.
Let us say we have an e-commerce site and we are capturing certain activities or events and these events are send to Kafka. Since during the peak hours/days the site traffic will be more so having the ideal kafka cluster with fixed number of brokers always is not a good idea so we want to scale it up the number of brokers when site traffic is more and scale it down the number of brokers when traffic is less.
How does people solve this kind of issue? i am not able to find any resource in this topic. any help will be greatly appreciated.
Kafka doesn't really work that way. Adding/removing brokers from the cluster is a very hands-on process, and it creates a lot of additional load/overhead on the cluster, so you wouldn't want the cluster to be automatically scaling up or down by itself. The main reason why it creates so much additional overhead is that adding or removing brokers requires lots of data copying across the cluster, on top of the normal traffic. Basically, all the data from a dead broker needs to be copied somewhere else, to keep the same replication factor for the topic/partitions, or if it's a new broker, data needs to be shuffled into it from the other brokers, so that the load on the cluster as a whole is reduced. All this data being copied around creates lots of IO/CPU load on the cluster, and it might be enough to cause significant problems.
The best way to handle this scenario is to do performance testing and optimization with 2x or even 3x the traffic you'd expect during peak hours, and build out the cluster accordingly. This way, you'll have plenty of headroom if there are sudden spikes, and you won't have to scale-out/scale-in.
Kafka is extremely performant, even for traffic of millions of messages per second, so you will probably find that the cluster size your application/system requires is not as large/expensive as you initially thought.

Running Kafka cluster in Docker containers?

From a performance perspective, is it a good choice to run Kafka in Docker containers ? Are there things which one should watch out for, tune specifically etc. ?
There is a good research paper from IBM on this topic - it is a bit dated by now, but I am sure the basic statements still hold true and have only been improved upon. The gist is, that the overhead introduced by Docker is quite small where it comes to cpu and memory, but for IO heavy applications you need to be a bit more careful. Depending on the workload I'd put Kafka squarely in the IO heavy group, so it is probably not a no-brainer.
Kafka benefits a lot from fast disc access, so if you run your containers in some sort of distributed platform with storage attached on a SAN or NFS share or something like that I'd assume, that you will notice a difference. But if you only chose containers to ease deployment and run them on one physical machine, I'd assume the difference to be negligible.
But as with all performance questions, it is hard to say this in general, you'll have to test your specific use case and environment to be sure.
I believe the performance would largely be effected by the type of machine you use. Linkedin and other large users of Kafka often recommend using spinning disks rather than SSDs because of the predominantly linear reads and writes done along with the the use of IBM's Zerocopy in the Kafka protocol. On a machine hosting many containers, you'd lose all the advantages that spinning disks give Kafka.

Strategy for distributed-computing inside microservices architecture?

I am looking for advice for the following problem:
I am working with other people on a microservices architecture where the microservices are distributed on different machines. Resources on the machines are very limited.
Currently, communication runs through a message broker.
In my use case, one microservice occasionally needs to run some heavy computation. I would like to perform the computation on a machine with low CPU usage and enough available memory space.
My first idea is that every machine installs a microservice which publishes CPU usage and available memory space in the message broker. Each microservice that needs to distribute their workload is looking for the fittest machines and installs "worker"-microservices on the fly. Results are published in the message broker. Since resources are limited, worker-microservices are uninstalled when not needed anymore.
I haven't found a similar use case yet. Do you guys know a better existing solution?
I am quite new to the topic of microservices and distributed computing, so i would appreciate some advice and help.

Can Kafka be configured to not retain at all?

I am very much new to Kafka, and i am researching if Kafka can be used as a real time messaging broker rather than retaining and sending. In other words can it just do the basic pub/sub broker job without retaining at all.
Is it configurable in Kafka Server configurations?
I don't think it's possible to accomplish this. One of the key differences between Kafka and other messaging systems is that Kafka uses the underlying OS's to handle storage.
Another unconventional choice that we made is to avoid explicitly
caching messages in memory at the Kafka layer. Instead, we rely on
the underlying file system page cache. Whitepaper
So Kafka automatically writes messages to disk, so it retains them by default. This is a conscious decision the designers of Kafka have made that they believe is worth the tradeoffs.
If you're asking this because you're worried writing to disk may be slower than keeping things in memory.
We have found that both the production and the
consumption have consistent performance linear to the data size,
up to many terabytes of data. Whitepaper
So the size of the data that you've retained doesn't impact how fast the system is.

Memcached and virtual memory

According to this thread (not very reliable, I know) memcached does not use the disk, not even virtual memory.
My questions are:
Is this true?
If so, how does memcached ensures that the memory he gets assigned never overflows to disk?
memcached avoids going to swap through two mechanisms:
Informing the system administrators that the machines should never go to swap. This allows the admins to maybe not configure swap space for the machine (seems like a bad idea to me) or configure the memory limits of the running applications to ensure that nothing ever goes into swap. (Not just memcached, but all applications.)
The mlockall(2) system call can be used (-k) to ensure that all the process's memory is always locked in memory. This is mediated via the setrlimit(2) RLIMIT_MEMLOCK control, so admins would need to modify e.g. /etc/security/limits.conf to allow the memcached user account to lock a lot more memory than is normal. (Locked memory is mediated to prevent untrusted user accounts from starving the rest of the system of free memory.)
Both these steps are fair assuming the point of the machine is to run memcached and perhaps very little else. This is often a fair assumption, as larger deployments will dedicate several (or many) machines to memcached.
You configure memcached to use a fixed amount of memory. When that memory is full memcached just deletes old data to stay under the limit. It is that simple.