What are the different ways to get Kafka Cluster Audit log to GCP Logging? - apache-kafka

What are the different ways to get Kafka Cluster Audit log to GCP Logging?
Can anyone share more information on how can I achieve it?
Thank you!

Assuming you have access to the necessary topic (from what I understand the Audit topic is not stored on your own cluster), to get data out of Kafka, you need a consumer. This could be in any language.
To get data into Cloud Logging, you need to use its API.
That being said, you could use any compatible pair of Kafka clients & Cloud logging clients that you would be comfortable with.
For example, you could write or find a Kafka Connect Sink connector that wraps the Java Cloud Logging client.

Related

Spring Cloud Data Flow Kafka Source

I am new to Spring Cloud Data Flow, and need to listen for messages on a topic from an external kafka cluster. This external kafka topic in confluent cloud would be my Source that I need to pass on to my Sink application.
I am also using kafka as my underlying message broker, which is a separate kafka instance that is deployed on kubernetes. I'm just not sure what is the best approach to connect to this external kafka instance. Is there an existing kafka Source app that I can use, or do I need to create my own Source application to connect to it? Or is it just some kind of configuration that I need to setup to get connected?
Any examples would be helpful. Thanks in advance!

Kafka to BigQuery, best way to consume messages

I need to receive messages to my BigQuery tables and I want to know what is the best way to consume those messages.
My Kafka servers who are at AWS they produce AVRO messages and from what I saw Dataflow needs receive JSON format messages. So I googled and found an article explaining how to receive messages to PubSub, but on PubSub what I only see in this type of architecture, they create a Kafka VM on GCP to produce the messages.
What I need to know is:
It's possible to receive AVRO messages on PubSub from external Kafka Servers and then deserialize the message using my Schema, sending it to Dataflow and finally send it to BigQuery tables?
Or do I need to create a Kafka VM and use it to consume messages from external servers?
This might seem a bit confusing but it is what I am feeling right now. The main goal here is to get messages from Kafka (AVRO format) at AWS and put them on BigQuery tables. If you have any suggestions they are very welcomed
Thanks a lot in advance
The Kafka Connect BigQuery Connector may be exactly what you need. It is a Kafka sink connector that allows you to export messages from Kafka directly to BigQuery. The README page provides detailed configuration instructions, including how to let the connector recognize your Kafka queue and how to enter the information for the destination BigQuery table. This connector should be able to retrieve the AVRO schema automatically from your Kafka project.

Kafka design questions - Kafka Connect vs. own consumer/producer

I need to understand when to use Kafka connect vs. own consumer/producer written by developer. We are getting Confluent Platform. Also to achieve fault tolerant design do we have to run the consumer/producer code ( jar file) from all the brokers ?
Kafka connect is typically used to connect external sources to Kafka i.e. to produce/consume to/from external sources from/to Kafka.
Anything that you can do with connector can be done through
Producer+Consumer
Readily available Connectors only ease connecting external sources to Kafka without requiring the developer to write the low-level code.
Some points to remember..
If the source and sink are both the same Kafka cluster, Connector doesn't make sense
If you are doing changed-data-capture (CDC) from a database and push them to Kafka, you can use a Database source connector.
Resource constraints: Kafka connect is a separate process. So double check what you can trade-off between resources and ease of development.
If you are writing your own connector, it is well and good, unless someone has not already written it. If you are using third-party connectors, you need to check how well they are maintained and/or if support is available.
do we have to run the consumer/producer code ( jar file) from all the brokers ?
Don't run client code on the brokers. Let all memory and disk access be reserved for the broker process.
when to use Kafka connect vs. own consumer/produce
In my experience, these factors should be taken into consideration
You're planning on deploying and monitoring Kafka Connect anyway, and have the available resources to do so. Again, these don't run on the broker machines
You don't plan on changing the Connector code very often, because you must restart the whole connector JVM, which would be running other connectors that don't need restarted
You aren't able to integrate your own producer/consumer code into your existing applications or simply would rather have a simpler produce/consume loop
Having structured data not tied to the a particular binary format is preferred
Writing your own or using a community connector is well tested and configurable for your use cases
Connect has limited options for fault tolerance compared to the raw producer/consumer APIs, with the drawbacks of more code, depending on other libraries being used
Note: Confluent Platform is still the same Apache Kafka
Kafka Connect:
Kafka Connect is an open-source platform which basically contains two types: Sink and Source. The Kafka Connect is used to fetch/put data from/to a database to/from Kafka. The Kafka connect helps to use various other systems with Kafka. It also helps in tracking the changes (as mentioned in one of the answers Changed Data Capture (CDC) ) from DB's to Kafka. The system maintains the offset, in order to read/write data from that particular offset to Kafka or any other database.
For more details, you can refer to https://docs.confluent.io/current/connect/index.html
The Producer/Consumer:
The Producer and Consumer are just an end system, which use the Kafka to produce and consume topics to/from Kafka. They are used where we want to broadcast the data to various consumers in a consumer group. This kind of system also maintains the lag and offsets of data for the consumer groups.
No, you don't need to run any producer/consumer while running Kafka connect. In case you want to check there is no data loss you can run the consumer while running Source Connectors. In case, of Sink Connectors, the already produced data can be verified in your database, by running their particular select queries.

Ingest Streaming Data to Kafka via http

I am very new with Kafka and Streaming Data in general. What I am trying to do is to ingest data which is to be sent via http to kafka. My research has brought me to the confluent REST proxy but I can't get it to work.
What I currently have is kafka running with a single node and single broker with kafkamanager in docker containers.
Unfortunately I can't run the full confluent platform with docker since I don't have enough memory available on my machine.
In essence my question is: How to setup a development environment where data is ingested by kafka through http?
Any help is highly appreciated!
You don't need the "full Confluent Platform" (KSQL, Control Center, included)
Zookeeper, Kafka, the REST proxy, and optionally the Schema Registry, should all only take up-to 4 GB of RAM total. If you don't even have that, then you'll need to go buy more RAM.
Note that Zookeeper and Kafka do not need to be running on the same machines as the Schema Registry or REST proxy, so if you have multiple machines, then you can save some resources that way as well.
To run one Kafka broker, zookeeper and schema registry, 1Gb is usually enough (in dev).
If you do not want for some reason to use Confluent REST proxy, you can write your own. It's quite straightforward: "on request, parse your incoming JSON, validate data, construct your message (in Avro?) and produce it to Kafka".
In this article, you'll find some configuration to press Kafka and ZK on heap memory: https://medium.com/#saabeilin/kafka-hands-on-part-i-development-environment-fc1b70955152
Here you can read how to produce/consume messages with Python:
https://medium.com/#saabeilin/kafka-hands-on-part-ii-producing-and-consuming-messages-in-python-44d5416f582e
Hope these help!

Listen to a topic continiously, fetch data, perform some basic cleansing

I'm to build a Java based Kafka streaming application that will listen to a topic X continiously, fetch data, perform some basic cleansing and write to a Oracle database. The kafka cluster is outside my domain and have no ability to deploy any code or configurations in it.
What is the best way to design such a solution? I came across Kafka Streams but was confused as to if it can be used for 'Topic > Process > Topic' scenarios?
I came accross Kafka Streams but was confused as to if it can be used for 'Topic > Process > Topic' scenarios?
Absolutely.
For example, excluding the "process" step, it's two lines outside of the configuration setup.
final StreamsBuilder builder = new StreamsBuilder();
builder.stream("streams-plaintext-input").to("streams-pipe-output");
This code is straight from the documentation
If you want to write to any database, you should first check if there is a Kafka Connect plugin to do that for you. Kafka Streams shouldn't really be used to read/write from/to external systems outside of Kafka, as it is latency-sensitive.
In your case, the JDBC Sink Connector would work well.
The kafka cluster is outside my domain and have no ability to deploy any code or configurations in it.
Using either solution above, you don't need to, but you will need some machine with Java installed to run a continous Kafka Streams application and/or Kafka Connect worker.