How to connect druid ( standalone Not cluster ) to superset and explore data on the dashboard? - druid

I have superset up and running and I also have a druid service and Kafka broker running. I want to know how to connect druid to superset so I can explore and query the data that I have in druid. I feel like you can only connect a druid cluster, not a standalone or I am just not understanding all the configurations when I try to add a druid cluster on the superset UI such as broker host and port. Any Idea?

Follow following steps:
Step 1: On Superset web ui, go to Sources -> Druid Clusters, add a new druid cluster and druid coordinator and broker information.
Step 2: Ingest the data to druid, following steps at druid.apache.org/docs/latest/tutorials/tutorial-kafka.html
Step 3: After ingesting the data ( both supervisor spec and data ingestion to druid using Kafka), Go to Sources -> Refresh Druid Metadata, you will find the data source you are trying to add. Select that data source for visualization.

i was really struggling with this one... in my case, i am running on my mac. druid is installed directly and superset is running in docker (using the docker-compose setup).
the key is that localhost in the superset docker container doesn't point at the host, but instead at the docker container.
when i changed the superset datasource to this, it worked...
druid://host.docker.internal:8082/druid/v2/sql/

Related

Running Source Connector on Demand and Not Based on poll.interval.ms

I have a table that is updated once / twice a day, but I want the data to be pushed to Kafka immediately after the table is updated. Is it possible to avoid running the connector every poll.interval.ms, but rather to run it only after the table is updated (sync on demand or trigger the sync in some other way after the table update)
I apologize if this question is stupid... Can sink connector be running on one Kafka cluster, but pull messages from another Kafka cluster and insert them into Postgres. I'm not talking about replicating messages from Cluster A to Cluster B and then inserting messages from Cluster B to Postgres. I'm talking about Connector running on Cluster B but pulling messages from Cluster A and writing them to Postgres.
Thanks!
If you use log-based change data capture (Debezium, etc) then you capture changes as soon as they are there, without needing to re-query the database. If you use query-based CDC then you do have to query the database on a polling interval. For query-based vs log-based CDC see this blog or talk.
One option would be to use the Kafka Connect REST API to control the connector - but you're kind of going against the streaming paradigm here and will start to find awkward edges in doing this. For example, when do you decide to pause the connector? How do you determine that it's ingested all the changes? etc.
Using log-based CDC is low-impact on the source system and commonly the route that people go.
Kafka Connect does not run on your Kafka cluster. Kafka Connect runs as its own cluster. Physically, it can be co-located for purposes of dev/sandbox environment (this ref arch is useful for production). See also this talk "Running Kafka Connect".
So in your example, "Cluster B" is actually a Kafka Connect cluster - and it would be configured to read from Kafka cluster "A", and that is fine.

DC replication via Kafka Connect for Apache Ignite

We are trying to implement the replication of data between two Ignite data clusters.
For this purpose, we are using Kafka Connect.
We have followed the things mentioned in this document -> https://dzone.com/articles/linking-apache-ignite-and-apache-kafka-for-highly
Everything is working fine till we use one cache and PUT operation.
But when I use the same for REMOVED operation, in the consumer thread of the connector, I can see the CacheEvent record, but the data is not being removed from the Sink Cluster nodes.
Can someone please help with this case?
It might be an issue with the Ignite Kafka integration. Try to collect all the relevant details and report to the Ignite community via the user list.
In the meantime, if the issue is not solved you can consider other replication options:
Replication via the certified GridGain Kafka Connect integration.
Datacenter replication feature by GridGain.

Configuring Kafka connect Postgress Debezium CDC plugin

I am trying to use kafka connect to read changes in postgress DB.
I have Kafka running on my local system and i want to use the Kafka connect API in standalone mode to read the postgress server DB changes.
connect-standalone.sh connect-standalone.properties dbezium.properties
i would appreciate if someone can help me with setting up configuration properties for CDC postgress debezium connector
https://www.confluent.io/connector/debezium-postgresql-cdc-connector/
I am following the below to construct the properties
https://debezium.io/docs/connectors/postgresql/#how-the-postgresql-connector-works
The name of the Kafka topics takes by default the form
serverName.schemaName.tableName, where serverName is the logical name
of the connector as specified with the database.server.name
configuration property
and here is what i have come up with for dbezium.properties
name=cdc_demo
connector.class=io.debezium.connector.postgresql.PostgresConnector
tasks.max=1
plugin.name=wal2json
slot.name=debezium
slot.drop_on_stop=false
database.hostname=localhost
database.port=5432
database.user=postgress
database.password=postgress
database.dbname=test
time.precision.mode=adaptive
database.sslmode=disable
Lets say i create a PG schema name as demo and table name as suppliers
So i need to create a topic with name as test.demo.suppliers so that this plugin can push the data to?
Also can someone suggest a docker image which has the postgress server + with suitable replication plugin such as wal2json etc? i am having hard time configuring postgress and the CDC plugin myself.
Check out the tutorial with associated Docker Compose and sample config.
The topic you've come up with sounds correct, but if you have your Kafka broker configured to auto-create topics (which is the default behaviour IIRC) then it will get created for you and you don't need to pre-create it.

Kafka connect cluster setup or launching connect workers

I am going through kafka connect, and i am trying to get the concepts.
Let us say I have kafka cluster (nodes k1, k2 and k3) setup and it is running, now i want to run kafka connect workers in different nodes say c1 and c2 in distributed mode.
Few questions.
1) To run or launch kafka connect in distributed mode I need to use command ../bin/connect-distributed.sh, which is available in kakfa cluster nodes, so I need to launch kafka connect from any one of the kafka cluster nodes? or any node from where I launch kafka connect needs to have kafka binaries so that i will be able to use ../bin/connect-distributed.sh
2) I need to copy the my connector plugins to any kafka cluster node( or to all cluster nodes?) from where I do the step 1?
3) how does kafka copies these connector plugins to worker node before starting jvm process on the worker node? because the plugin is the one which has my task code and it needs to be copied to worker in order to start the process in worker.
4) Do i need to install anything in connect cluster nodes c1 and c2, like need to install java or any kafka connect related?
5) In some places it says use confluent platform but i would like to start it with apache kafka connect alone first.
can some one please through some light or even pointer to some resources would also help.
Thank you.
1) In order to have a highly available kafka-connect service you need to run at least two instances of connect-distributed.sh on two distinct machines that have the same group.id. You can find more details regarding the configuration of each worker here. For improved performance, Connect should be ran independently of the broker and Zookeeper machines.
2) Yes, you need to place all your connectors under plugin.path (normally under /usr/share/java/) on every machine that you are planning to run kafka-connect.
3) kafka-connect will load the connectors on startup. You don't need to handle this. Note that if your kafka-connect instance is running and a new connector is added, you need to restart the service.
4) You need to have Java installed on all your machines. For Confluent Platform particularly:
Java 1.7 and 1.8 are supported in this version of Confluent Platform
(Java 1.9 is currently not supported). You should run with the
Garbage-First (G1) garbage collector. For more information, see the
Supported Versions and Interoperability.
5) It depends. Confluent was founded by the original creators of Apache Kafka and it comes as a more complete distribution adding schema management, connectors and clients. It also comes with KSQL which is quite useful if you need to act on certain events. Confluent simply adds on top of the Apache Kafka distribution, it's not a modified version.
Answer given by Giorgos is correct. I ran few connectors and now I understand it better.
I am just trying to put it differently.
In Kafka connect there are two things involved one is Worker and second is connector.Below is on details about running distributed Kafka connect.
Kafka connect Worker is a Java process on which the connector/connect task will run. So first thing is we need to launch worker, to run/launch a worker we need java installed on that machine then we need Kafka connect related sh/bat files to launch worker and kafka libs which will be used by kafka connect worker, for this we will just simply copy/install Kafka in the worker machine, also we need to copy all the connector and connect-task related jars/dependencies in "plugin.path" as defined in the below worker properties file, now worker machine is ready, to start worker we need to invoke ./bin/connect-distributed.sh ./config/connect-distributed.properties, here connect-distributed.properties will have configuration for worker. The same thing has to be repeated in each machine where we need to run Kafka connect.
Now the worker java process is running in all machines, the woker config will have group.id property, the workers which have this same property value will be forming a group/cluster of workers.
Each worker process will expose rest endpoint (default http://localhost:8083/connectors), to launch/start a connector on the running workers, we need do http-post a connector config json, based on the given config the worker will start the connector and the number of tasks in the above group/cluster workers.
Example: Connect post,
curl -X POST -H "Content-Type: application/json" --data '{"name": "local-file-sink", "config": {"connector.class":"FileStreamSinkConnector", "tasks.max":"3", "file":"test.sink.txt", "topics":"connect-test" }}' http://localhost:8083/connectors

Monitoring of dynamic(scale up/down) kafka cluster

We are using kafka cluster and we want monitor this cluster.
Current our approach:
collect jmx metrics (telegraf jolokia plugin or jmxtrans)
store into influx
render via grafana
But we faced with problem of static configuration of kafka broker's list.
We can configure only static list via Jmxtrans or Telegraf jolokia plugin.
We would like dynamic list (example configure only kafka cluster zookeeper connection for getting brokers list in each iteration of collecting metrics) for case scaling up/down of kafka.
May be do exist another instruments for monitoring dynamic kafka cluster?
My finally solution next:
Custom bash input plugin (because i don't want build own telegraf with one custom go input plugin and I know not very well go language now:) )
In bash plugin zkCli is used for all kafka nodes discovery
Next bash script post bulk jolokia requet to each kafka node, aggregates and format to influx plugin.