PDI transformation does not send messages to Kafka server - apache-kafka

I have a transformation in Pentaho Data Integration (PDI) that makes a query to NetSuite, builds JSON strings for each row and finally these strings are sent to Kafka. This is the transformation:
When I test the transform against my local Kafka it works like a charm, as you can see below:
The problem is when I substitute the connection parameters for those of an AWS EC2 instance where I have Kafka as well. The problem is that the transformation does not give errors, but the messages do not reach Kafka, as can be seen here:
This is the configuration of the Kafka Producer step of the transformation:
The strange thing is that although it does not send the messages to Kafka, it seems that it does connect to the server because the combobox is displayed with the names of the topics that I have:
In addition, this error is observed in the PDI terminal:
ERROR [NamedClusterServiceLocatorImpl] Could not find service for interface org.pentaho.hadoop.shim.api.jaas.JaasConfigService associated with named cluster null
Which doesn't make sense to me because I'm using a direct connection and not a connection to a Hadoop Cluster.
So I wanted to ask the members of this community if anyone has used POIs to send messages to Kafka and if they had to make configurations in POI or Slack to achieve it, since I cannot think what could be happening.
Thanks in advance for any ideas or comments to help me solve this!

Related

Kafka to BigQuery, best way to consume messages

I need to receive messages to my BigQuery tables and I want to know what is the best way to consume those messages.
My Kafka servers who are at AWS they produce AVRO messages and from what I saw Dataflow needs receive JSON format messages. So I googled and found an article explaining how to receive messages to PubSub, but on PubSub what I only see in this type of architecture, they create a Kafka VM on GCP to produce the messages.
What I need to know is:
It's possible to receive AVRO messages on PubSub from external Kafka Servers and then deserialize the message using my Schema, sending it to Dataflow and finally send it to BigQuery tables?
Or do I need to create a Kafka VM and use it to consume messages from external servers?
This might seem a bit confusing but it is what I am feeling right now. The main goal here is to get messages from Kafka (AVRO format) at AWS and put them on BigQuery tables. If you have any suggestions they are very welcomed
Thanks a lot in advance
The Kafka Connect BigQuery Connector may be exactly what you need. It is a Kafka sink connector that allows you to export messages from Kafka directly to BigQuery. The README page provides detailed configuration instructions, including how to let the connector recognize your Kafka queue and how to enter the information for the destination BigQuery table. This connector should be able to retrieve the AVRO schema automatically from your Kafka project.

Flink doesn't consume Data from Kafka publisher

What I have: http://prntscr.com/szmkn4
That's the most barebone version of it. Some stuff's gonna come later, but for now the issue is that data is properly arriving in my consumer in form of a JSON string.
I want to throw it into a flink table, which I create with this statement: http://prntscr.com/szmll3
I then check if it got created, just to be sure and get this: http://prntscr.com/szmn79
Next I wanna turn on the machine and check my data with "SELECT * FROM RawData" and get the following error:
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.kafka.shaded.org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers
I assume it's an issue with how I created my table, but am honestly not sure where/what/how.
My publisher's properties in NiFi are:
https://prnt.sc/szoe6z
and
http://prntscr.com/szoeka
If you need any additional information from me, feel free to ask.
Thanks in advance,
Psy
[ERROR] Could not execute SQL statement. Reason: org.apache.flink.kafka.shaded.org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers
That likely means that the Kafka bootstrap servers you've specified to Flink cannot have their hostnames resolved on the Flink servers. You'd know if it's a NiFi issue because you'd see errors in the NiFi flow saying it couldn't produce to Kafka. It might be producing to the wrong topic or even the wrong set of brokers if you have multiple Kafka clusters, but the error you posted isn't a NiFi issue.

Kafka Connector - distributed - load balancing tasks

I am running development environment for Confluent Kafka, Community edition on Windows, version 3.0.1-2.11.
I am trying to achieve load balancing of tasks between 2 instances of connector. I am running Kafka Zookepper, Server, REST services and 2 instance of Connect distributed on the same machine.
Only difference between properties file for connectors is rest port since they are running on the same machine.
I don't create topics for connector offsets, config, status. Should I?
I have custom code for sink connector.
When I create worker for my sink connector I do this by executing POST request
POST http://localhost:8083/connectors
toward any of the running connectors. Checking is there loaded worker is done at URL
GET http://localhost:8083/connectors
My sink connector has System.out.println() lines in code with which I can follow output of my code in the console log.
When my worker is running I can see that only one instance of connector is executing code. If I terminate one connector another instance will take over the worker and execution will resume. However this is not what I want.
My goal is that both connector instances are running worker code so that they can share the load between them.
I've tried to got over some open source connectors to see is there specifics in writing code of connectors but with no success.
I've made some different attempts to tackle this problem but with no success.
I could rewrite my business code to come around this but I'm pretty sure I'm missing on something not obvious for me.
Recently I commented on Robin Moffatt's answer of this question.
From the sounds of it your custom code is not correctly spawning the number of tasks that you are expecting.
Make sure that you've set tasks.max >1 in your config
Make sure that your connector is correctly creating the appropriate number of tasks to taskConfigs
References:
https://opencredo.com/blogs/kafka-connect-source-connectors-a-detailed-guide-to-connecting-to-what-you-love/
https://docs.confluent.io/current/connect/devguide.html
https://enfuse.io/a-diy-guide-to-kafka-connectors/

How can I manage Kafka connect schema errors?

I'm using kafka connect (confluent distribution) to connect an mqtt broker to a kafka topic (https://docs.lenses.io/connectors/source/mqtt.html), but when a message arrives and it isn't conform to the expected schema, the connector stops!
How can I prevent this from happening?
I'd like also to manage the error and for example keep track of it!
If you are using a ready made connector, you need to satisfy the proper schema. If any error occurs it will stop the connector. So, best way is to identify the schema error based on error message.
If its impossible to use the existing connector, create one for your own which could satisfy your need.

In Kafka Connector, how do I get the bootstrap-server address My Kafka Connect is currently using?

I'm developing a Kafka Sink connector on my own. My deserializer is JSONConverter. However, when someone send a wrong JSON data into my connector's topic, I want to omit this record and send this record to a specific topic of my company.
My confuse is: I can't find any API for me to get my Connect's bootstrap.servers.(I know it's in the confluent's etc directory but it's not a good idea to write hard code of the directory of "connect-distributed.properties" to get the bootstrap.servers)
So question, is there another way for me to get the value of bootstrap.servers conveniently in my connector program?
Instead of trying to send the "bad" records from a SinkTask to Kafka, you should instead try to use the dead letter queue feature that was added in Kafka Connect 2.0.
You can configure the Connect runtime to automatically dump records that failed to be processed to a configured topic acting as a DLQ.
For more details, see the KIP that added this feature.