Is there a way to add variables to Kafka server.properties? - apache-kafka

I don't have any experience with Kafka yet and need to automate a task. Is there a way that I can use env variables in the configuration file?
To be more specific:
advertised.listeners=INSIDE://:9092,OUTSIDE://<hostname>:29092
I'd like to extract and use the hostname from my env variables.

Property files offer no variable interpolation
If you started Kafka via Docker processes, or write your own shell scripts which generate a property file prior to starting the broker, then you could inject values
Some examples include confd, consul-template, dockerize

Related

Programmatically configure Kafka Binder configurations through Beans instead of application.yml file

I'm looking for a way to configure Kafka Binder configurations programmatically, especially brokers, topic info and a few Kafka consumer properties.
Why - I don't want to build and deploy the application every time there is a change in config (due to DR or otherwise)
I was looking into the samples and couldn't find relevant sample to do so.
I tried creating Bean of org.springframework.cloud.stream.binder.kafka.streams.properties.KafkaStreamsBinderConfigurationProperties but then there are so many other properties to configure in that class and I don't know a way how to update org.springframework.cloud.stream.binder.kafka.streams.KafkaStreamsBinderSupportAutoConfiguration to use the new instance instead.
I am looking to create similar config as here programmatically. Is this even possible right now? Any help regarding this much appreciated.
Hmm. . . you don't need to build and deploy the application when changing properties. What made you think that?
You can provide application configuration properties as standard command line options via -D flag (e.g., -Dspring.cloud.function.definition=blah).

Kafka broker.id: env variable vs config file precedence

I'm setting up a Kafka cluster, in which I'm setting the broker.id=-1 so that broker.ids are automatically generated. but in some cases want to set them using environment variables (i.e. KAFKA_BROKER_ID).
If done so, will the nodes with the KAFKA_BROKER_ID env variables use the env variable or auto-generate them?
Depends on how you are deploying your Kafka installation.
Out of the box, Kafka does not use system properties to configure broker id, so you need to put the value into .properties file.
(among others: grepping for KAFKA_BROKER_ID in Kafka source returns nothing)
KAFKA_BROKER_ID appears to be added by multiple Docker images, you'd need to contact the author of the one you are using.

Kafka-connect. Add environment variable to custom converter config

I'm using kafka-connect-elasticsearch with a custom converter, which extents standard org.apache.kafka.connect.json.JsonConverter.
In my custom converter I need to access an environment variable.
Let's assume, I need to append to every message the name of the cluster, which is written to environment variable CLUSTER.
How can I access my environment variable in the converter?
Maybe read it at converter configuration phase (configure(Map<String, ?> configs) methond)?
How can I forward CLUSTER env variable value to this configs map?
You can't get it in that map.
You would need to use System.getenv("CLUSTER")

spark-jobserver - managing multiple EMR clusters

I have a production environment that consists of several (persistent and ad-hoc) EMR Spark clusters.
I would like to use one instance of spark-jobserver to manage the job JARs for this environment in general, and be able to specify the intended master right when I POST /jobs, and not permanently in the config file (using master = "local[4]" configuration key).
Obviously I would prefer to have spark-jobserver running on a standalone machine, and not on any of the masters.
Is this somehow possible?
You can write a SparkMasterProvider
https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server/src/spark.jobserver/util/SparkMasterProvider.scala
A complex example is here https://github.com/spark-jobserver/jobserver-cassandra/blob/master/src/main/scala/spark.jobserver/masterLocators/dse/DseSparkMasterProvider.scala
I think all you have to do is write one that will return the config input as spark master, that way you can pass it as part of job config.

Using Ansible to automatically configure AWS autoscaling group instances

I'm using Amazon Web Services to create an autoscaling group of application instances behind an Elastic Load Balancer. I'm using a CloudFormation template to create the autoscaling group + load balancer and have been using Ansible to configure other instances.
I'm having trouble wrapping my head around how to design things such that when new autoscaling instances come up, they can automatically be provisioned by Ansible (that is, without me needing to find out the new instance's hostname and run Ansible for it). I've looked into Ansible's ansible-pull feature but I'm not quite sure I understand how to use it. It requires a central git repository which it pulls from, but how do you deal with sensitive information which you wouldn't want to commit?
Also, the current way I'm using Ansible with AWS is to create the stack using a CloudFormation template, then I get the hostnames as output from the stack, and then generate a hosts file for Ansible to use. This doesn't feel quite right – is there "best practice" for this?
Yes, another way is just to simply run your playbooks locally once the instance starts. For example you can create an EC2 AMI for your deployment that in the rc.local file (Linux) calls ansible-playbook -i <inventory-only-with-localhost-file> <your-playbook>.yml. rc.local is almost the last script run at startup.
You could just store that sensitive information in your EC2 AMI, but this is a very wide topic and really depends on what kind of sensitive information it is. (You can also use private git repositories to store sensitive data).
If for example your playbooks get updated regularly you can create a cron entry in your AMI that runs every so often and that actually runs your playbook to make sure your instance configuration is always up to date. Thus avoiding having "push" from a remote workstation.
This is just one approach there could be many others and it depends on what kind of service you are running, what kind data you are using, etc.
I don't think you should use Ansible to configure new auto-scaled instances. Instead use Ansible to configure a new image, of which you will create an AMI (Amazon Machine Image), and order AWS autoscaling to launch from that instead.
On top of this, you should also use Ansible to easily update your existing running instances whenever you change your playbook.
Alternatives
There are a few ways to do this. First, I wanted to cover some alternative ways.
One option is to use Ansible Tower. This creates a dependency though: your Ansible Tower server needs to be up and running at the time autoscaling or similar happens.
The other option is to use something like packer.io and build fully-functioning server AMIs. You can install all your code into these using Ansible. This doesn't have any non-AWS dependencies, and has the advantage that it means servers start up fast. Generally speaking building AMIs is the recommended approach for autoscaling.
Ansible Config in S3 Buckets
The alternative route is a bit more complex, but has worked well for us when running a large site (millions of users). It's "serverless" and only depends on AWS services. It also supports multiple Availability Zones well, and doesn't depend on running any central server.
I've put together a GitHub repo that contains a fully-working example with Cloudformation. I also put together a presentation for the London Ansible meetup.
Overall, it works as follows:
Create S3 buckets for storing the pieces that you're going to need to bootstrap your servers.
Save your Ansible playbook and roles etc in one of those S3 buckets.
Have your Autoscaling process run a small shell script. This script fetches things from your S3 buckets and uses it to "bootstrap" Ansible.
Ansible then does everything else.
All secret values such as Database passwords are stored in CloudFormation Parameter values. The 'bootstrap' shell script copies these into an Ansible fact file.
So that you're not dependent on external services being up you also need to save any build dependencies (eg: any .deb files, package install files or similar) in an S3 bucket. You want this because you don't want to require ansible.com or similar to be up and running for your Autoscale bootstrap script to be able to run. Generally speaking I've tried to only depend on Amazon services like S3.
In our case, we then also use AWS CodeDeploy to actually install the Rails application itself.
The key bits of the config relating to the above are:
S3 Bucket Creation
Script that copies things to S3
Script to copy Bootstrap Ansible. This is the core of the process. This also writes the Ansible fact files based on the CloudFormation parameters.
Use the Facts in the template.