With the below code to pass PcapPackets to a queue, is it possible to pass this into Kafka queue so that Kafka consumer can pull PcapPackets as such from Kafka producer?
StringBuilder errbuf = new StringBuilder();
Pcap pcap = Pcap.openOffline("tests/test-afs.pcap", errbuf);
PcapPacketHandler<Queue<PcapPacket>> handler = new PcapPacketHandler<Queue<PcapPacket>>() {
public void nextPacket(PcapPacket packet, Queue<PcapPacket> queue) {
PcapPacket permanent = new PcapPacket(packet);
queue.offer(packet);
}
}
Queue<PcapPacket> queue = new ArrayBlockingQueue<PcapPacket>();
pcap.loop(10, handler, queue);
System.out.println("we have " + queue.size() + " packets in our queue");
pcap.close();
Kafka supports storing an arbitrary binary data as messages. In your case you just need to provide a PcapPacket class binary serializer (and deserializer for reading).
See Kafka: writing custom serializer for an example.
Though I am late to the party, I share my tool: Pcap Processor (GitHub URL) here if anyone with similar requirements finds it useful. I have developed a tool in Python for my research to read raw pcap files, to process them and to feed them to my stream processor. Since I tried various stream protocols, I implemented all of them in this tool.
Currently supported sinks:
CSV file
Apache Kafka (encoded into JSON string)
HTTP REST (JSON)
gRPC
Console (just print to the terminal)
For example, to read input.pcap and to send it to a Kafka topic, you need to adjust the bootstrap endpoint and topic name in kafka_sink.py. Then, executing the following command from parent directory will read the file and send packets to Kafka queue.
python3 -m pcap_processor --sink kafka input.pcap
For more details and installation instructions, please check the GitHub readme and feel free to open GitHub issues if you encounter any problems.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 22 days ago.
Improve this question
I have a process that reads from a Kafka queue and writes into a DWH. The kafka queue is currently receiving data from a Java application, that reads from a local storage and writes into the Kafka queue.
We need to implement the following:
replace the local storage with an Azure Storage Account (DONE)
replace the Kafka queue with Azure Event Hubs
replace the java application with a Databricks simple job that does a readStream using Autloader from the Azure DataLake, and writes into the Azure Event Hubs
Constraint: the kafka client consumer cannot be changed, rather than its connection string.
Now, The good news is that Azure Event Hubs is Kafka-compliant (let's consider that the json body of each message is smaller than 10Kb), so my question is how to configure this architecture. More specifically:
A) how should Azure EH be configured to be kafka-compliant towards its consumer?
B) should I use kafka protocol also from Databricks to SEND the messages, or can I use it an Azure Event Hubs trusting the fact that it exposes itself with Kafka interface to the consumer, and with Event Hubs interface to the sender?
C) where can I retrieve the kafka endpoint to be used from the consumer, and what should I care of in addition to the new connection string? In the listen policy I find Primary Key, Connection String and SAS Policy ARM ID but I'm not sure how to convert them to a Kafka enpoint
To use EventHubs from Kafka protocol you need just configure Kafka options correctly. You need following:
we need to get Shared Access Signatures (SAS) to authenticate to Event Hubs topic - it should look like Endpoint=sb://<....>.windows.net/;?... and will be used as a password. For security reasons it's recommended to put it into a Databricks secret scope (update variables secret_scope and secret_name with your actual values).
we need to form the correct string (the eh_sasl variable) for SASL (Simple Authentication and Security Layer) authentication - as a user name we're using static value $ConnectionString, and Event Hubs SAS is used as a password. SASL string looks a bit different on Databricks - instead of org.apache.kafka.common.security.plain.PlainLoginModule... it should be prefixed with kafkashaded. as the original Java package is shaded to avoid conflicts with other packages.
you need to provide the name of the Event Hubs namespace & topic from which to read data in eh_namespace_name and topic_name variables.
secret_scope = "scope"
secret_name = "eventhub_sas"
topic_name = "topic1"
eh_namespace_name = "<eh-ns-name>"
readConnectionString = dbutils.secrets.get(secret_scope, secret_name)
eh_sasl = 'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule' \
+ f' required username="$ConnectionString" password="{readConnectionString}";'
bootstrap_servers = f"{eh_namespace_name}.servicebus.windows.net:9093"
kafka_options = {
"kafka.bootstrap.servers": bootstrap_servers,
"kafka.sasl.mechanism": "PLAIN",
"kafka.security.protocol": "SASL_SSL",
"kafka.request.timeout.ms": "60000",
"kafka.session.timeout.ms": "30000",
"startingOffsets": "earliest",
"kafka.sasl.jaas.config": eh_sasl,
"subscribe": topic_name,
}
df = spark.readStream.format("kafka") \
.options(**kafka_options).load()
Writing is done with the similar configuration.
from pyspark.sql.functions import struct, to_json
# work with your dataframe
kafka_options = {
"kafka.bootstrap.servers": bootstrap_servers,
"kafka.sasl.mechanism": "PLAIN",
"kafka.security.protocol": "SASL_SSL",
"kafka.sasl.jaas.config": eh_sasl,
"topic": topic_name,
}
df.select(to_json(struct("*")).alias("value")) \
.write.format("kafka").options(**kafka_options).save()
See more details about Spark + Kafka in the Spark & Databricks documentation.
I have a requirement where I need to read the OPC UA data via apache PLC4x and push to Apache Kafka server . I have configured the OPC UA simulator (ProSys OPC Simulator) , Configured my kafka cluster in virtual machine.
#The PLC4X connection string to be used. Examples for each protocol are included on the PLC4X website.
sources.machineA.connectionString=opcua:tcp://192.168.29.246:53530
#The source 'poll' method should return control to Kafka Connect every so often.
#This value controls how often it returns when no messages are received.
sources.machineA.pollReturnInterval=5000
#There is an internal buffer between the PLC4X scraper and Kafka Connect.
#This is the size of that buffer.
sources.machineA.bufferSize=1000
#A list of jobs associated with this source.
#sources.machineA.jobReferences=simulated-dashboard,simulated-heartbeat
sources.machineA.jobReferences=simulated-dashboard
#The Kafka topic to use to produce to. The default topic will be used if this isn't specified.
#sources.machineA.jobReferences.simulated-heartbeat.topic=simulated-heartbeat-topic
#A list of jobs specified in the following section.
#jobs=simulated-dashboard,simulated-heartbeat
jobs=simulated-dashboard
#The poll rate for this job. the PLC4X scraper will request data every interval (ms).
jobs.simulated-dashboard.interval=1000
#A list of fields. Each field is a map between an alias and a PLC4X address.
#The address formats for each protocol can be found on the PLC4X website.
jobs.simulated-dashboard.fields=Counter
jobs.simulated-dashboard.fields.Counter=3:1001:Integer
#jobs.simulated-dashboard.fields=running,conveyorEntry,load,unload,transferLeft,transferRight,conveyorLeft,conveyorRight,numLargeBoxes,numSmallBoxes,testString
#jobs.simulated-dashboard.fields.running=RANDOM/Running:Boolean
#jobs.simulated-dashboard.fields.conveyorEntry=RANDOM/ConveryEntry:Boolean
#jobs.simulated-dashboard.fields.load=RANDOM/Load:Boolean
#jobs.simulated-dashboard.fields.unload=RANDOM/Unload:Boolean
#jobs.simulated-dashboard.fields.transferLeft=RANDOM/TransferLeft:Boolean
#jobs.simulated-dashboard.fields.transferRight=RANDOM/TransferRight:Boolean
#jobs.simulated-dashboard.fields.conveyorLeft=RANDOM/ConveyorLeft:Boolean
#jobs.simulated-dashboard.fields.conveyorRight=RANDOM/ConveyorRight:Boolean
#jobs.simulated-dashboard.fields.numLargeBoxes=RANDOM/NumLargeBoxes:Integer
#jobs.simulated-dashboard.fields.numSmallBoxes=RANDOM/NumSmallBoxes:Integer
#jobs.simulated-dashboard.fields.testString=RANDOM/TestString:STRING
Help me solving the issue
The error message indicates that it is trying to match the PLC4X connection string to the endpoint string returned from the Prosys Simulation Server.
The Prosys endpoint string can be found on the status tab of the Simulation Server. Converting this to the PLC4X connection string format, the connection string should be.
opc.tcp://192.168.29.246:53530/OPCUA/SimulationServer
Just looking at the config file, there also seems to be an issue with the address
jobs.simulated-dashboard.fields.Counter=3:1001:Integer
This should probably be
jobs.simulated-dashboard.fields.Counter=ns=3;i=1001
I need to read Kafka messages with .Net from an external server. As the first step, I have installed Kafka on my local machine and then wrote the .Net code. It worked as wanted. Then, I moved to the cloud but the code did not work. Here is the setup that I have.
I have a Kafka Server deployed on a Windows VM (VM1: 10.0.0.4) on Azure. It is up and running. I have created a test topic and produced some messages with cmd. To test that everything is working I have opened a consumer with cmd and received the generated messages.
Then I have deployed another Windows VM (VM2, 10.0.0.5) with Visual Studio. Both of the VMs are deployed on the same virtual network so that I do not have to worry about opening ports or any other network configuration.
then, I have copied my Visual Studio project code and then changed the IP address of the bootstrap-server to point to the Kafka server. It did not work then, I read that I have to change the server configuration of Kafka, so I opened the server.properties and modified the listeners property to listeners=PLAINTEXT://10.0.0.4:9092. It still does not work.
I have searched online and tried many of the tips but it does not work. I think first of all to provide the credential to an external server (vm1), and probably some other configuration. Unfortunately, the official documentation of confluent is very short with very few examples. There is also no example to my case on the official GitHub. I have played with the "Sasl" properties in the Consumer Config class, but also no success.
the error message is:
%3|1622220986.498|FAIL|rdkafka#consumer-1| [thrd:10.0.0.4:9092/bootstrap]: 10.0.0.4:9092/bootstrap: Connect to ipv4#10.0.0.4:9092 failed: Unknown error (after 21038ms in state CONNECT)
Error: 10.0.0.4:9092/bootstrap: Connect to ipv4#10.0.0.4:9092 failed: Unknown error (after 21038ms in state CONNECT)
Error: 1/1 brokers are down
Here is my .Net core code:
static void Main(string[] args)
{
string topic = "AzureTopic";
var config = new ConsumerConfig
{
BootstrapServers = "10.0.0.4:9092",
GroupId = "test",
//SecurityProtocol = SecurityProtocol.SaslPlaintext,
//SaslMechanism = SaslMechanism.Plain,
//SaslUsername = "[User]",
//SaslPassword = "[Password]",
AutoOffsetReset = AutoOffsetReset.Latest,
//EnableAutoCommit = false
};
int x = 0;
using (var consumer = new ConsumerBuilder<Ignore, string>(config)
.SetErrorHandler((_, e) => Console.WriteLine($"Error: {e.Reason}"))
.Build())
{
consumer.Subscribe(topic);
var cancelToken = new CancellationTokenSource();
while (true)
{
// some tasks
}
consumer.Close();
If you set listeners to a hard-coded IP, it'll only start the server binding and accepting traffic to that ip
And your listener isn't defined as SASL, so I'm not sure why you've tried using that in the client. While using credentials is strongly encouraged when sending data to cloud resources, it's not required to fix a network connectivity problem. You definitely shouldn't send credentials over plaintext, however
Start with these settings
listeners=PLAINTEXT://0.0.0.0:9092
advertised.listeners=PLAINTEXT://10.0.0.4:9092
That alone should work within the VM shared network. You can use the console tools included with Kafka to test it.
And if that still doesn't work from your local client, then it's because 10.0.0.0/8 address space is considered a private network and you must advertise the VM's public IP and allow TCP traffic on port 9092 through Azure Firewall. It'd also make sense to expose multiple listeners for internal Azure network and external, forwarded network traffic
Details here discuss AWS and Docker, but the basics still apply
Overall, I think it'd be easier to setup Azure EventHub with Kafka support
This is related to embedded kafka server provided by spring. On running my test case, embedded kafka intantiation (as show below) fails.
public static KafkaEmbedded embeddedKafka = new KafkaEmbedded(1, true, KAFKA_TOPIC);
instantiation fails with following error.
1> C:\Users\r2dev\AppData\Local\Temp\kafka-1587343850239557903\version-2\log.1: The process cannot access the file because it is being used by another process
2> C:\Users\r2dev\AppData\Local\Temp\kafka-7315008084340411800.lock: The process cannot access the file because it is being used by another process.
I am using "1.2.0.RELEASE" spring kafka version and Java 8.
Any one faced this issue and were able to fix this issue. Please let me know
Thanks in advance.
According to the documentation there are two ways to send log information to the SwisscomDev ELK service.
Standard way via STDOUT: Every output to stdout is sent to Logstash
Directly send to Logstash
Asking about way 2. How is is this achieved, especially how is the input expected?
We're using Monolog in our PHP buildpack based application and using its stdout_handler is working fine.
I was trying the GelfHandler (connection refused), SyslogUdPHandler (no error, but no result), both configured to use VCAPServices logstashHost and logstashPort as API endpoint / host to send logs to.
Binding works, env variables are set, but I have no idea how to send SwisscomDev ELK service Logstash API endpoint compatible log information from our application.
Logstash is configured with a tcp input, which is reachable via logstashHost:logstashPort. The tcp input is configured with its default codec, which is the line codec (source code; not the plain codec as stated in the documentation).
The payload of the log event should be encoded in JSON so that the fields are automatically recognized by Elasticsearch. If this is the case, the whole log event is forwarded without further processing to Elasticsearch.
If the payload is not JSON, the whole log line will end up in the field message.
For your use case with Monolog, I suggest you to use the SocketHandler (pointing it to logstashHost:logstashPort) in combination with the LogstashFormatter which will take care of the JSON encoding with the log events being line delimited.