Meltano Kafka Monk.io integration - apache-kafka

I'm playing a bit with Monk.io and Kafka - Meltano integration.
So, I would like to create a Monk.io Kafka cluster and provision a new connection on Meltano.
I'm using https://github.com/lensesio/fast-data-dev for Kafka env.
What would be the best approach and have the most sense?
I've planed to do it this way:
Create runnables of Kafka and Meltano and create Monk actions for Meltano template.
Those actions would have custom Meltano loader that would provide pipe to Kafka.

I haven't tested the integration with Kafka but should be similar to Postgres in terms of execution in actions i.e to add a loader or extractor if those Kafka plugins exist for Meltano.
actions:
defines: actions
add:
arguments:
resource-type:
type: string
resource-name:
type: string
code: exec("meltano", "/bin/sh", "-c", `cd elt && meltano add ${args["resource-type"]} ${args["resource-name"]}`)
The workflow and template are abstracted here https://github.com/monk-io/monk-dataops so it's more a question if Meltano has a good Kafka plugin

Related

Redis Pub/Sub - ETL -> Postgres

I have a simple task:
Subscribe to messages on Redis channel
Transform message, e.g.
HASH: '<user_id>|<user_type>|<event_type>|...'
with items:
{ 'param_1': 'param_1_value', 'param_2': 'param_2_value', ... } into tabular form
user_id
event_type
param_1
param_2
...
<user_id>
<event_type>
cleaned(param_1_value)
cleaned(param_2_value)
...
Append to an existing table in Postgres
Additional context:
The scale of events is rather small
Refreshments must be done at most every ~15 minutes
Solution must be deployable on premises
Using something else as a queue than Redis is not an option
The best solution I came up with is to use Kafka, with Kafka Redis Source Connector (https://github.com/jaredpetersen/kafka-connect-redis) and then Kafka Postgres Sink Connector (https://github.com/ibm-messaging/kafka-connect-jdbc-sink). It seems reasonable, but the task seems like generic Redis to Postgres ETL and I'm wondering if there is really no easier out of the box solution out there.
You could just write a script and execute it via cron. But take a look at the Benthos project as you can easily run it on prem and what you describe can be done entirely via configuration for Redis -> Postgres.

Debezium io with pulsar

I want to understand how pulsar uses debezium io connect for CDC.
While creating the source using pulsar-admin source create, how can I pass broker url and authentication params or client. Similar to what we di when using localrun.
The cmd I run :
bin/pulsar-admin source localrun --sourceConfigFile debezium-mysql-source-config.yaml --client-auth-plugin --client-auth-params --broker-service-url
Now I want to replace this to create a connector which runs in cluster mode.
Localrun is a special mode that simplifies debugging and it runs outside of normal cluster. It needs extra parameters to create the client for the local runtime.
In the cluster mode the connector will get the client from the Pulsar connectors runtime/through the function worker configuration. All you need to do is use "bin/pulsar-admin source create ...".

Dynamic creation of Kafka Connectors

I have deployed a Kafka cluster and a Kafka Connect cluster in kubernetes, using Strimzi and AKS. And I wanted to start reading from RSS resources to feed my Kafka cluster, so I created a connector instance of "org.kaliy.kafka.connect.rss.RssSourceConnector" which reads from a specific RSS feed, given an url, and writes to a specific topic. But my whole intention with this is to eventually have a Kafka Connect cluster able to manage a lot of external requests of new RSSs to read from; and here is where all my doubts come in:
Shoud I create an instance of Kaliy RSS connector for each RSS feed? Or would it be better to implement my own connector, so I create only one instance of it and each time I want to read a new RSS feed I would create a new Task in the connector?
Who should be resposible of assuring the Kafka Connect Cluster state is the desired one? I mean that if a Connector(in the case of 1 RSS feed : 1 Connector instance) stopped working, who should try to start it again? An external client via the Kafka Connect REST API? Kubernetes itself?
Right now, I think my best option is to rely on Kafka Connect REST API making the external client responsible of managing the state of the set of connectors, but I don't know if these was designed to recieve a lot of requests as it would be the case. Maybe these could be scaled by provisioning several listeners in the Kafka Connect REST API configuration but I do not know.
Thanks a lot!
One of the main benefits in using Kafka Connect is to leverage a configuration-driven approach, so you will lose this by implementing your own Connector. In my opinion, the best strategy is to have one Connector instance for each RSS feed. Reducing the number of instances could make sense when having a single data source system, to avoid overloading it.
Using Strimzi Operator, Kafka Connect cluster would be monitored and it will try to restore the desired cluster state when needed. This does not include the single Connector instances and their tasks, but you may leverage the K8s API for monitoring the Connector custom resource (CR) status, instead of the REST API.
Example:
$ kubetctl get kafkaconnector amq-sink -o yaml
apiVersion: kafka.strimzi.io/v1alpha1
kind: KafkaConnector
# ...
status:
conditions:
- lastTransitionTime: "2020-12-07T10:30:28.349Z"
status: "True"
type: Ready
connectorStatus:
connector:
state: RUNNING
worker_id: 10.116.0.66:8083
name: amq-sink
tasks:
- id: 0
state: RUNNING
worker_id: 10.116.0.66:8083
type: sink
observedGeneration: 1
It could be late, but it could help anyone will pass by the question, It is more relevant to have a look at Kafka-connect CR (Custom Resources) as a part of Confluent For Kubernetes (CFK), it introduces a clear cut declarative way to manage and monitor Connector with health checks and auto healing.
https://www.confluent.io/blog/declarative-connectors-with-confluent-for-kubernetes/

ApacheBeam on FlinkRunner doesn't read from Kafka

I'm trying to run Apache Beam backed by a local Flink cluster in order to consume from a Kafka Topic, as described in the Documentation for ReadFromKafka.
The code is basically this pipeline and some other setups as describe in the Beam Examples
with beam.Pipeline() as p:
lines = p | ReadFromKafka(
consumer_config={'bootstrap.servers': bootstrap_servers},
topics=[topic],
) | beam.WindowInto(beam.window.FixedWindows(1))
output = lines | beam.FlatMap(lambda x: print(x))
output | WriteToText(output)
Since I attempted to run on Flink, I followed this doc for Beam on Flink and did the following:
--> I download the binaries for flink 1.10 and followed these instructions to proper setup the cluster.
I checked the logs for the server and task instance. Both were properly initialized.
--> Started kafka using docker and exposing it in port 9092.
--> Executed the following in the terminal
python example_1.py --runner FlinkRunner --topic myTopic --bootstrap_servers localhost:9092 --flink_master localhost:8081 --output output_folder
The terminal outputs
2.23.0: Pulling from apache/beam_java_sdk Digest: sha256:3450c7953f8472c2312148a2a8324b0115fd71b3a7a01a3b017f6de69d89dfe1 Status: Image is up to date for apache/beam_java_sdk:2.23.0 docker.io/apache/beam_java_sdk:2.23.0
But then after writing some messags to myTopic, the terminal remains frozen and I don't see anything in the output folder. I checked flink-conf.yml and given these two lines
jobmanager.rpc.address: localhost
jobmanager.rpc.port: 6123
I assumed that the port for the jobs would be 6123 instead of 8081 as specified in beam documentation, but the behaviour for both ports is the same.
I'm very new to Beam/Flink, so I'm not quite sure that it can be, I have two hypothesis as of now, but can't quite figure out how to investigate'em:
Something related to the port that Beam communicates with Flink in order to send the jobs.
2.The Expansions Service for Python SDK mentioned in the apache.beam.io.external.ReadFromKafka docs
Note: To use these transforms, you need to start a Java Expansion Service. Please refer to the portability documentation on how to do that. Flink Users can use the built-in Expansion Service of the Flink Runner’s Job Server. The expansion service address has to be provided when instantiating the transforms.
But reading the portability documentation, it refers me back to the same doc for Beam on Flink.
Could someone, please, help me out?
Edit: I was writing to the topic using Debezium Source Connector for PostgreSQL and seeing the behavior mentioned above. But when I tried to the topic manually, the application crashed with the following
RuntimeError: org.apache.beam.sdk.util.UserCodeException: org.apache.beam.sdk.coders.CoderException: cannot encode a null byte[]
at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:36)
You are doing everything correctly; the Java Expansion Service no longer needs to be started manually (see the latest docs). Also, Flink serves the web UI at 8081, but accepts job submission there just as well, so either port works fine.
It looks like you may be running into the issue that Python's TextIO does not yet support streaming.
Additionally, there is the complication that when running Python pipelines on Flink, the actual code runs in a docker image, and so if you are trying to write to a "local" file it will be a file inside the image, not on your machine.

Safely give secret/token to Kafka Connector?

We are using Kafka Connectors (JDBC and others), and configuring them using the REST API (using curl in shell scripts). Right now, when testing/developing, we are including secrets (for the JDBC connect - database user/pw) directly in the request. This is obviously bad, as those are then readily available for everybody to see when reading them out using the GET request.
Is there a good way to give secrets to the connectors? We can bring them in safely using environment variables or config files (injected fom OpenShift) - but is there a syntax available when starting a connector via the REST API for that?
EDIT: This is for the distributed mode of connectors; i.e., configuration by REST API, not connector config files...
A pluggable interface for this was implemented in Apache Kafka 2.0 through KIP-297. You can see more details in the documented example here.