Azure Databricks only gets Event Hub Data sent while its runnng - pyspark

I'm trying to read Azure Event Hub data using databricks.
I have a producer running in nodejs, as well as a consumer for testing (on a different consumer group) and all seems to be running fine.
I am using the following pyspark code in databricks to get the data:
# Initialize event hub config dictionary with connectionString
connectionString = "Endpoint=sb://XXX.servicebus.windows.net/;SharedAccessKeyName=test;SharedAccessKey=XXX;EntityPath=XXX"
ehConf = {}
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
# Add consumer group to the ehConf dictionary
ehConf['eventhubs.consumerGroup'] = "databricks"
# Read events from the Event Hub
messages = spark.readStream.format("eventhubs").options(**ehConf).load()
# Visualize the Dataframe in realtime
display(messages)
The issue is that it only reads data from the stream if it is sent while the notebook is running. If i produce data and then run the notebook, it does not appear.
What am I missing? I want to use this to collect data from the stream every hour or so and save it.
Config:
Databricks Runtime: 7.3LTS (Spark 3.0.1, Scala 2.12)
Azure eventhub library: com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.21

You have two problems:
display by default is using a temporary checkpoint, so when you run it next time, it don't know where to continue, so it starts again & again. If you continue to use display, add the `checkpointLocation="some_path" to the display call (see docs)
by default, EventHubs connector reads only new data (that's why it consumes data only when running) - if you want to consume data from beginning (only on the initial call) - you need to add option eventhubs.startingPositions that encodes starting positions (see doc) - to start reading from beginning of the topic, assign following to this option:
import json
startingEventPosition = {
"offset": -1,
"seqNo": -1, #not in use
"enqueuedTime": None, #not in use
"isInclusive": True
}
ehConf["eventhubs.startingPosition"] = json.dumps(startingEventPosition)

Related

Spark Structured Stream - Kinesis as Data Source

I am trying to consume kinesis data stream records using psypark structured stream.
I am trying to run this code in aws glue batch job. My goal is to use checkpoint and save checkpoints and data to s3. I am able to consume the data but it is giving only few records for every trigger whereas kinesis data stream has lot of records. I am using TRIM_HORIZON which is alias to earliest and trigger spark.writestream once so that it executes once and shuts down the cluster. When i run the job again, it picks latest offset from checkpoint and runs.
kinesis = spark.readStream.format('kinesis') \
.option('streamName', kinesis_stream_name) \
.option('endpointUrl', 'blaablaa')\
.option('region', region) \
.option('startingPosition', 'TRIM_HORIZON')\
.option('maxOffsetsPerTrigger',100000)\
.load()
// do some transformation here
TargetKinesisData = stream_data.writeStream.format("parquet").outputMode('append').option(
"path", s3_target).option("checkpointLocation", checkpoint_location).trigger(once=True).start().awaitTermination()

Can you append to a file using the alpakka HDFS connector?

I'm trying to use this connector to pull messages from Kafka and write them to HDFS. Works fine as long as the file doesn't already exist, but if it does then it throws a FileAlreadyExistsException. Is there a way to append to an already-existing file using this connector? I'm using an HdfsFlow.dataWithPassThrough flow, and it takes an HdfsWritingSettings, but that only allows you to set an overwrite boolean; there's no append option.

How to use Flume's Kafka Channel without specifying a source

I have an existing Kafka topic and a flume agent that reads from there and writes to HDFS. I want to reconfigure my flume agent so it will move away from the existing setup; a Kafka Source, file Channel to HDFS Sink, to use a Kafka Channel.
I read in the cloudera documentation that it possible to achieve this by only using a Kafka Channel and HDFS sink (without a flume source).. (unless I have got the wrong end of the stick.) So I tried to create this configuration but it isn't working. It's not even starting the flume process on the box.
# Test
test.channels = kafka-channel
test.sinks = hdfs-sink
test.channels.kafka-channel.type =
org.apache.flume.channel.kafka.KafkaChannel
test.channels.kafka-channel.kafka.bootstrap.servers = localhost:9092
test.channels.kafka-channel.kafka.topic = test
test.channels.kafka-channel.parseAsFlumeEvent = false
test.sinks.hdfs-sink.channel = kafka-channel
test.sinks.hdfs-sink.type = hdfs
test.sinks.hdfs-sink.hdfs.path = hdfs://localhost:8082/data/test/
I'm using:
HDP Quickstart VM 2.6.3
Flume version 1.5.2
The HDFS directory does exist
ps -ef | grep flume only returns a process once I added a kafka-source, but this can't be right because doing this creates an infinite loop for any messages published onto the topic.
Is it possible to only use a Kafka Channel and HDFS Sink or do I need to use a kafka-source but change some other configurations that will prevent the infinite loop of messages?
Kafka-source -> kafka-channel -> HDFS Sink - This doesn't seem right to me.
After digging around a bit I noticed that Ambari didn't create any flume conf files for the specified agent. Ambari seems to only create/update the flume config if I specify test.sources = kafka-source. Once I added this into the flume config (via ambari) the config was created on the box and the flume agent started successfully.
The final flume config looked like this:
test.sources=kafka-source
test.channels = kafka-channel
test.sinks = hdfs-sink
test.channels.kafka-channel.type = org.apache.flume.channel.kafka.KafkaChannel
test.channels.kafka-channel.kafka.bootstrap.servers = localhost:9092
test.channels.kafka-channel.kafka.topic = test
test.channels.kafka-channel.parseAsFlumeEvent = false
test.sinks.hdfs-sink.channel = kafka-channel
test.sinks.hdfs-sink.type = hdfs
test.sinks.hdfs-sink.hdfs.path = hdfs:///data/test
Notice I didn't set any of the properties on the source (this would cause the infinite loop issue i mentioned in my question), it just needs to be mentioned so Ambari creates the flume config and starts the agent.
This doesn't directly answer your question about Flume, but in general since you're already using Apache Kafka this pattern is best solved using Kafka Connect (which is part of Apache Kafka).
There is a Kafka Connect HDFS connector which is simple to use, per this guide here.

Structured Streaming with Flume

Hi can anyone tell me how to read a flume stream using spark new API for structured streaming.
Example:
val lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
As of Spark 2.1, Spark supports only File, Kafka and Socket source. Socket SOurce is meant for debugging and development and shouldn't be productionalized. This leaves File and Kafka sources.
So, the only options you have are
a) take data from FLume and dump them into S3 files. Spark can get the data from S3 files. The way the File Source works is that it watches a folder, and when a new file appears, it loads it as a microbatch
b) Funnel your events into a Kafka instance
val flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port]) for push based approach and
val flumeStream = FlumeUtils.createPollingStream(streamingContext, [sink machine hostname], [sink port]) for pull-based approach

use of spark job server

I am new to spark.
I want know the use to spark job server like ooyala and Livy.
is spark code cannot communicate with HDFS directly.
Do we need a medium like webservice to send and receive data.
Thanks and Regards,
ooyala and Livy
Persistent Spark Context or not help you run spark like a http-service which you can request to create or run a spark job directly from a http call instead of calling from command like to run or cronjob for instance.
is spark code cannot communicate with HDFS directly
No. It can communicate with HDFS directly.
Do we need a medium like webservice to send and receive data.
Actually ooyala and Livy they return the result by json like you do call in a API.
So it depends on you build a medium or not.