We are feeding events (logs) from Logstash to Apache Cassandra using the PerimeterX Cassandra Logstash out plugin. We have hit the max throughput of the plugin to be 8K as it opens only 2 connections to Cassandra whereas Cassandra has a much higher throughput (for consuming data) and we expecting a throughput on the actual system to be 30K or higher.
Here throughput is the capacity to consume the incoming events, which is x units/sec
Hence we planned to introduced Kafa in the middle which has a 45K throughput with Logstash output.
We are looking for help from this stack overflow post. We could configure the connector JAR as mentioned in the documentation. But there is no proper guide or current documentation is very confusing and goes in a loop with the configuration requirement. We don't see the plugin being called when Kafka is running with the target topic.
Some help on what is the correct configuration, or some documentation info on Cassandra keyspaces will be helpful.
After placing the JAR as mentioned in the documentation
We need to run Kafka connect which will show all the connectors configured.
To turn on Kafka connect run the below command (Kafka connect in distributed mode)
bin/connect-distributed.sh config/connect-distributed.properties
Kafka connect has a REST API service available at http://localhost:8083
using this REST API you can configure your connectors.
To register the connector use the below API
POST /connectors – creates a new connector; the request body should be a JSON object containing a string name field and an object config field with the connector configuration parameters
The JSON sample to register the connector is present kafka-connect-cassandra-sink-1.4.0.tar.gz file.
The official-documentation provides a list with all endpoints.
More info available here
Related
I am running Kafka Connect in distributed mode on Kubernetes with 3 sink connectors, Kafka -> S3.
When data flows into Kafka and at least one of the connectors has data to read, everything works fine.
But on periods when there is no data to read, for a few hours for example, and none of the connectors needs to read any data, all the connectors stop (the /connectors endpoint on the Rest API shows an empty list). So when new data comes in eventually - it is not read unless manually starting the connectors.
Is this common behavior or am I missing something? I can add additional information about the setup if needed.
Based on comments, your config.storage.topic was not created with cleanup.policy=compact, therefore Kafka is deleting your configs for idle configurations, not idle connector tasks. When the configs are deleted from the topic, then the REST API removes the /connector response information.
Refer documentation on appropriate configurations for the internal Connect topics
https://kafka.apache.org/documentation/#connect
I have been testing with kafka connect. But for every connector I have to go and read the connector documentation to understand the configuration needed for the connectors. As far as I read the kafka connect API documentation I have seen to APIs to get the connector related data.
GET /connector-plugins - return a list of connector plugins installed in the Kafka Connect cluster. Note that the API only checks for connectors on the worker that handles the request, which means you may see inconsistent results, especially during a rolling upgrade if you add new connector jars.
PUT /connector-plugins/{connector-type}/config/validate - validate the provided configuration values against the configuration definition. This API performs per config validation, returns suggested values and error messages during validation.
Rest of other APIs are related to created connectors. Is there anyway to get the configuration for the required connectors?
Is there anyway to get the configuration for the required connectors
The validate endpoint does exactly that, and is what the Landoop Kafka Connect UI uses to provide errors for missing/misconfigured properties.
The implementation details of how properties become required depends on the Importance level of the connector configuration, and for any non-high importance configs, referring documentation or source code (if available) would be best
We are currently on HDF (Hortonworks Dataflow) 3.3.1 which bundles Kafka 2.0.0 and are trying to use Kafka Connect in distributed mode to launch a Google Cloud PubSub Sink connector.
We are planning on sending back some metadata into a Kafka Topic and need to integrate a Kafka producer into the flush() function of the Sink task java code.
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
Also, how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source? I need to use the same Bootstrap server list to start the producer.
Currently I am changing the config for the sink connector, adding bootstrap server list as a property and parsing it in the Java code for the connector. I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible.
Kindly help on this.
Thanks in advance.
need to integrate a Kafka producer into the flush() function of the Sink task java code
There is no producer instance exposed in the SinkTask API...
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
I mean, you can add whatever code you want. As far as negative impacts go, that's up to you to benchmark on your own infrastructure. Obviously adding more blocking code makes the other processes slower overall
how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source?
Sinks and sources are not workers. Look at connect-distributed.properties
I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible
It's not possible. Adding extra properties to the sink/source configs are the only way. (Feel free to make a Kafka JIRA requesting such a feature of exposing the worker configs, though)
Using Telegraf plugins, there is a way to read data from InfluxDb and publish it to a Kafka topic.
But is there a way to read the data on demand and place it on a Kafka topic? Like a query based demand.
I can do a query based read through REST API (curl GET).
There are HTTP Listener plugins but these are only for POST methods.
None for GET method where I can query a subset of data from InfluxDb and place them on a Kafka topic. In this case, kafka would be the output plugin.
You can achieve it using Kapacitor's Kafka event handler. Kapacitor can be configured either in batch mode or streaming mode. In case of streaming mode, if the condition met for processing, Kapacitor event handler will process the record immediately and send to Kafaka cluster. Please refer here for more details.
I am very new with Kafka and Streaming Data in general. What I am trying to do is to ingest data which is to be sent via http to kafka. My research has brought me to the confluent REST proxy but I can't get it to work.
What I currently have is kafka running with a single node and single broker with kafkamanager in docker containers.
Unfortunately I can't run the full confluent platform with docker since I don't have enough memory available on my machine.
In essence my question is: How to setup a development environment where data is ingested by kafka through http?
Any help is highly appreciated!
You don't need the "full Confluent Platform" (KSQL, Control Center, included)
Zookeeper, Kafka, the REST proxy, and optionally the Schema Registry, should all only take up-to 4 GB of RAM total. If you don't even have that, then you'll need to go buy more RAM.
Note that Zookeeper and Kafka do not need to be running on the same machines as the Schema Registry or REST proxy, so if you have multiple machines, then you can save some resources that way as well.
To run one Kafka broker, zookeeper and schema registry, 1Gb is usually enough (in dev).
If you do not want for some reason to use Confluent REST proxy, you can write your own. It's quite straightforward: "on request, parse your incoming JSON, validate data, construct your message (in Avro?) and produce it to Kafka".
In this article, you'll find some configuration to press Kafka and ZK on heap memory: https://medium.com/#saabeilin/kafka-hands-on-part-i-development-environment-fc1b70955152
Here you can read how to produce/consume messages with Python:
https://medium.com/#saabeilin/kafka-hands-on-part-ii-producing-and-consuming-messages-in-python-44d5416f582e
Hope these help!