How to stream messages from Databricks to a Kafka client using Azure Event Hubs [closed] - apache-kafka

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 22 days ago.
Improve this question
I have a process that reads from a Kafka queue and writes into a DWH. The kafka queue is currently receiving data from a Java application, that reads from a local storage and writes into the Kafka queue.
We need to implement the following:
replace the local storage with an Azure Storage Account (DONE)
replace the Kafka queue with Azure Event Hubs
replace the java application with a Databricks simple job that does a readStream using Autloader from the Azure DataLake, and writes into the Azure Event Hubs
Constraint: the kafka client consumer cannot be changed, rather than its connection string.
Now, The good news is that Azure Event Hubs is Kafka-compliant (let's consider that the json body of each message is smaller than 10Kb), so my question is how to configure this architecture. More specifically:
A) how should Azure EH be configured to be kafka-compliant towards its consumer?
B) should I use kafka protocol also from Databricks to SEND the messages, or can I use it an Azure Event Hubs trusting the fact that it exposes itself with Kafka interface to the consumer, and with Event Hubs interface to the sender?
C) where can I retrieve the kafka endpoint to be used from the consumer, and what should I care of in addition to the new connection string? In the listen policy I find Primary Key, Connection String and SAS Policy ARM ID but I'm not sure how to convert them to a Kafka enpoint

To use EventHubs from Kafka protocol you need just configure Kafka options correctly. You need following:
we need to get Shared Access Signatures (SAS) to authenticate to Event Hubs topic - it should look like Endpoint=sb://<....>.windows.net/;?... and will be used as a password. For security reasons it's recommended to put it into a Databricks secret scope (update variables secret_scope and secret_name with your actual values).
we need to form the correct string (the eh_sasl variable) for SASL (Simple Authentication and Security Layer) authentication - as a user name we're using static value $ConnectionString, and Event Hubs SAS is used as a password. SASL string looks a bit different on Databricks - instead of org.apache.kafka.common.security.plain.PlainLoginModule... it should be prefixed with kafkashaded. as the original Java package is shaded to avoid conflicts with other packages.
you need to provide the name of the Event Hubs namespace & topic from which to read data in eh_namespace_name and topic_name variables.
secret_scope = "scope"
secret_name = "eventhub_sas"
topic_name = "topic1"
eh_namespace_name = "<eh-ns-name>"
readConnectionString = dbutils.secrets.get(secret_scope, secret_name)
eh_sasl = 'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule' \
+ f' required username="$ConnectionString" password="{readConnectionString}";'
bootstrap_servers = f"{eh_namespace_name}.servicebus.windows.net:9093"
kafka_options = {
"kafka.bootstrap.servers": bootstrap_servers,
"kafka.sasl.mechanism": "PLAIN",
"kafka.security.protocol": "SASL_SSL",
"kafka.request.timeout.ms": "60000",
"kafka.session.timeout.ms": "30000",
"startingOffsets": "earliest",
"kafka.sasl.jaas.config": eh_sasl,
"subscribe": topic_name,
}
df = spark.readStream.format("kafka") \
.options(**kafka_options).load()
Writing is done with the similar configuration.
from pyspark.sql.functions import struct, to_json
# work with your dataframe
kafka_options = {
"kafka.bootstrap.servers": bootstrap_servers,
"kafka.sasl.mechanism": "PLAIN",
"kafka.security.protocol": "SASL_SSL",
"kafka.sasl.jaas.config": eh_sasl,
"topic": topic_name,
}
df.select(to_json(struct("*")).alias("value")) \
.write.format("kafka").options(**kafka_options).save()
See more details about Spark + Kafka in the Spark & Databricks documentation.

Related

Connection to external Kafka Server using confluent-kafka-dotnet fails

I need to read Kafka messages with .Net from an external server. As the first step, I have installed Kafka on my local machine and then wrote the .Net code. It worked as wanted. Then, I moved to the cloud but the code did not work. Here is the setup that I have.
I have a Kafka Server deployed on a Windows VM (VM1: 10.0.0.4) on Azure. It is up and running. I have created a test topic and produced some messages with cmd. To test that everything is working I have opened a consumer with cmd and received the generated messages.
Then I have deployed another Windows VM (VM2, 10.0.0.5) with Visual Studio. Both of the VMs are deployed on the same virtual network so that I do not have to worry about opening ports or any other network configuration.
then, I have copied my Visual Studio project code and then changed the IP address of the bootstrap-server to point to the Kafka server. It did not work then, I read that I have to change the server configuration of Kafka, so I opened the server.properties and modified the listeners property to listeners=PLAINTEXT://10.0.0.4:9092. It still does not work.
I have searched online and tried many of the tips but it does not work. I think first of all to provide the credential to an external server (vm1), and probably some other configuration. Unfortunately, the official documentation of confluent is very short with very few examples. There is also no example to my case on the official GitHub. I have played with the "Sasl" properties in the Consumer Config class, but also no success.
the error message is:
%3|1622220986.498|FAIL|rdkafka#consumer-1| [thrd:10.0.0.4:9092/bootstrap]: 10.0.0.4:9092/bootstrap: Connect to ipv4#10.0.0.4:9092 failed: Unknown error (after 21038ms in state CONNECT)
Error: 10.0.0.4:9092/bootstrap: Connect to ipv4#10.0.0.4:9092 failed: Unknown error (after 21038ms in state CONNECT)
Error: 1/1 brokers are down
Here is my .Net core code:
static void Main(string[] args)
{
string topic = "AzureTopic";
var config = new ConsumerConfig
{
BootstrapServers = "10.0.0.4:9092",
GroupId = "test",
//SecurityProtocol = SecurityProtocol.SaslPlaintext,
//SaslMechanism = SaslMechanism.Plain,
//SaslUsername = "[User]",
//SaslPassword = "[Password]",
AutoOffsetReset = AutoOffsetReset.Latest,
//EnableAutoCommit = false
};
int x = 0;
using (var consumer = new ConsumerBuilder<Ignore, string>(config)
.SetErrorHandler((_, e) => Console.WriteLine($"Error: {e.Reason}"))
.Build())
{
consumer.Subscribe(topic);
var cancelToken = new CancellationTokenSource();
while (true)
{
// some tasks
}
consumer.Close();
If you set listeners to a hard-coded IP, it'll only start the server binding and accepting traffic to that ip
And your listener isn't defined as SASL, so I'm not sure why you've tried using that in the client. While using credentials is strongly encouraged when sending data to cloud resources, it's not required to fix a network connectivity problem. You definitely shouldn't send credentials over plaintext, however
Start with these settings
listeners=PLAINTEXT://0.0.0.0:9092
advertised.listeners=PLAINTEXT://10.0.0.4:9092
That alone should work within the VM shared network. You can use the console tools included with Kafka to test it.
And if that still doesn't work from your local client, then it's because 10.0.0.0/8 address space is considered a private network and you must advertise the VM's public IP and allow TCP traffic on port 9092 through Azure Firewall. It'd also make sense to expose multiple listeners for internal Azure network and external, forwarded network traffic
Details here discuss AWS and Docker, but the basics still apply
Overall, I think it'd be easier to setup Azure EventHub with Kafka support

Connect Cadence to Azure Cosmo Cassandra API (Cadence workflow)

I am running cadence with cassandra externally running using docker run -e CASSANDRA_SEEDS=10.x.x.x e ubercadence/server:. and its running sucessfully.
Azure cosmos says, any system running on Cassandra can use Azure cosmos using provided cosmos cassandra APi, by modifying the client connection creation code, for example : GO app sample code :
func GetSession(cosmosCassandraContactPoint, cosmosCassandraPort, cosmosCassandraUser, cosmosCassandraPassword string) *gocql.Session {
clusterConfig := gocql.NewCluster(cosmosCassandraContactPoint)
port, err := strconv.Atoi(cosmosCassandraPort)
clusterConfig.Authenticator = gocql.PasswordAuthenticator{Username: cosmosCassandraUser, Password: cosmosCassandraPassword}
clusterConfig.Port = port
clusterConfig.SslOpts = &gocql.SslOptions{Config: &tls.Config{MinVersion: tls.VersionTLS12}}
clusterConfig.ProtoVersion = 4
session, err := clusterConfig.CreateSession()
...
return session
}
From my end, I can connect external cassandra's cqlsh(which cadence is using for persisting) to azure cosmos and can create KeySpace, table in azure cosmo db. However, when I run Cadence server, all new tables are still created on local cassandra itself(instead of Axure cosmos) might be, cadence is connected to cassandra only.
So there are basically 2 question shared below :
1.Since cadence is written in GO, can we modify the source code to establish connection to AzureCosmoDb. or
or can we pass the cosmocassandra's host, port, username, password, while running the cassandra and cadence separately (docker run -e CASSANDRA_SEEDS=10.x.x.x e ubercadence/server:)
cosmosCassandraContactPoint : xyz.cassandra.cosmos.azure.com cosmosCassandraPort : 10350 cosmosCassandraUser : xyz cosmosCassandraPassword : xyz
It’s really exciting that Azure Cosmo Cassandra is now support LWT!
I took a quick look at the docs
I think it’s may not going to work directly, because it doesn’t support LoggedBatch
But Cadence is using logged batch:
I think it’s probably okay to use unlogged batch in Cadence.
Because all the operation are in a single partition always.
From the Datastax docs:
Single partition batch operations are atomic automatically, while multiple partition batch operations require the use of a batchlog to ensure atomicity
This means using unlogged batch should be the same for Cadence.
(though I believe Cadence choose to use logged batch for safe)
It should work if we change the code slightly in the Cadence Cassandra plugin

How to connect Ksql with ibm-cloud event-stream?

we created a project with ibm functions and event-streams in IBM Cloud.
Now, I am trying to connect KSQL with IBM cloud Event Stream, and I am following along the Document for getting basic ideas of integration.
By following the instructions, I created a file called ksql-server.properties and modified bootstrap.servers, username, password according to my credentials. Then I ran ksql http://localhost:8088 --config-file ksql-server.properties with ksql local cli. I assume everying runs correctly so far since the ksql> shows in the front of every new line...
Then I decided to check if the ksql connected with my ibm cloud by running SHOW topics;
Turns out some error lines:
`Error issuing POST to KSQL server. path:ksql'`
`Caused by: com.fasterxml.jackson.databind.JsonMappingException: Failed to set 'ssl.protocol' to 'TLSv1.2' (through reference chain: io.confluent.ksql.rest.entity.KsqlRequest["streamsProperties"])`
`Caused by: Failed to set 'ssl.protocol' to 'TLSv1.2' (through reference chain: io.confluent.ksql.rest.entity.KsqlRequest["streamsProperties"])
`
`Caused by: Failed to set 'ssl.protocol' to 'TLSv1.2'`
`Caused by: Cannot override property 'ssl.protocol'`
Also, I am quick lost at step 4 when it tells me to:
`Then start DataGen twice as follows:
i. With bootstrap-server=HOSTNAME:PORTNUMBER quickstart=users format=json topic=users maxInterval=10000 to start creating users events.
ii. With bootstrap-server=HOSTNAME:PORTNUMBER quickstart=pageviews format=delimited topic=pageviews maxInterval=10000 to start creating pageviews events.`
Is there anyone have done this before or would love to help me out? Thank you very much!!!
The IBM document is very out of date. KSQL runs as a client/server. The server needs to be run with the details of the broker, and then you can connect to it with a client, including the CLI, REST API, or web interface provided by Confluent Control Center.
So you need to run the KSQL server using your properties file:
./bin/ksql-server-start ksql-server.properties
and then connect to it with the CLI (for example):
./bin/ksql http://localhost:8088
See https://docs.confluent.io/current/ksql/docs/installation/installing.html for more information.

How do I get notified when an object is uploaded to my GCS bucket?

I have an app that uploads photos regularly to a GCS bucket. When those photos are uploaded, I need to add thumbnails and do some analysis. How do I set up notifications for the bucket?
The way to do this is to create a Cloud Pub/Sub topic for new objects and to configure your GCS bucket to publish messages to that topic when new objects are created.
First, let's create a bucket PHOTOBUCKET:
$ gsutil mb gs://PHOTOBUCKET
Now, make sure you've activated the Cloud Pub/Sub API.
Next, let's create a Cloud Pub/Sub topic and wire it to our GCS bucket with gsutil:
$ gsutil notification create \
-t uploadedphotos -f json \
-e OBJECT_FINALIZE gs://PHOTOBUCKET
The -t specifies the Pub/Sub topic. If the topic doesn't already exist, gsutil will create it for you.
The -e specifies that you're only interested in OBJECT_FINALIZE messages (objects being created). Otherwise you'll get every kind of message in your topic.
The -f specifies that you want the payload of the messages to be the object metadata for the JSON API.
Note that this requires a recent version of gsutil, so be sure to update to the latest version of gcloud, or run gsutil update if you use a standalone gsutil.
Now we have notifications configured and pumping, but we'll want to see them. Let's create a Pub/Sub subscription:
$ gcloud beta pubsub subscriptions create processphotos --topic=uploadedphotos
Now we just need to read these messages. Here's a Python example of doing just that. Here are the relevant bits:
def poll_notifications(subscription_id):
client = pubsub.Client()
subscription = pubsub.subscription.Subscription(
subscription_id, client=client)
while True:
pulled = subscription.pull(max_messages=100)
for ack_id, message in pulled:
print('Received message {0}:\n{1}'.format(
message.message_id, summarize(message)))
subscription.acknowledge([ack_id])
def summarize(message):
# [START parse_message]
data = message.data
attributes = message.attributes
event_type = attributes['eventType']
bucket_id = attributes['bucketId']
object_id = attributes['objectId']
return "A user uploaded %s, we should do something here." % object_id
Here is some more reading on how this system works:
https://cloud.google.com/storage/docs/reporting-changes
https://cloud.google.com/storage/docs/pubsub-notifications
GCP also offers an earlier version of the Pub/Sub cloud storage change notifications called Object Change Notification. This feature will directly POST to your desired endpoint(s) when an object in that bucket changes. Google recommends the Pub/Sub approach.
https://cloud.google.com/storage/docs/object-change-notification
while using this example!
keep in mind two things
1) they have upgraded code to python 3.6 pub_v1 this might not be running on python 2.7
2) while calling poll_notifications(projectid,subscriptionname)
pass your GCP project id : e.g bold-idad & subscrition name e.g asrtopic

How to pass PcapPackets into Kafka queue

With the below code to pass PcapPackets to a queue, is it possible to pass this into Kafka queue so that Kafka consumer can pull PcapPackets as such from Kafka producer?
StringBuilder errbuf = new StringBuilder();
Pcap pcap = Pcap.openOffline("tests/test-afs.pcap", errbuf);
PcapPacketHandler<Queue<PcapPacket>> handler = new PcapPacketHandler<Queue<PcapPacket>>() {
public void nextPacket(PcapPacket packet, Queue<PcapPacket> queue) {
PcapPacket permanent = new PcapPacket(packet);
queue.offer(packet);
}
}
Queue<PcapPacket> queue = new ArrayBlockingQueue<PcapPacket>();
pcap.loop(10, handler, queue);
System.out.println("we have " + queue.size() + " packets in our queue");
pcap.close();
Kafka supports storing an arbitrary binary data as messages. In your case you just need to provide a PcapPacket class binary serializer (and deserializer for reading).
See Kafka: writing custom serializer for an example.
Though I am late to the party, I share my tool: Pcap Processor (GitHub URL) here if anyone with similar requirements finds it useful. I have developed a tool in Python for my research to read raw pcap files, to process them and to feed them to my stream processor. Since I tried various stream protocols, I implemented all of them in this tool.
Currently supported sinks:
CSV file
Apache Kafka (encoded into JSON string)
HTTP REST (JSON)
gRPC
Console (just print to the terminal)
For example, to read input.pcap and to send it to a Kafka topic, you need to adjust the bootstrap endpoint and topic name in kafka_sink.py. Then, executing the following command from parent directory will read the file and send packets to Kafka queue.
python3 -m pcap_processor --sink kafka input.pcap
For more details and installation instructions, please check the GitHub readme and feel free to open GitHub issues if you encounter any problems.