I need to send data from a databricks delta table into azure event hubs.
The data will be selected with a sql select
spark.sql("SELECT [columns] FROM table WHERE [where clause]")
This select will return many many rows and after it, I will apply some transformation (mainly to be in accordance to the event hub event data message).
At the end I will send it to event hub.
As far as I can tell, at the moment of writing, I need to use "writeStream" but is this enough? How can I control how many messages are sent per batch? Do I even need to care about it or does the lib handle it?
Another question I have is, from the moment I use "writeStream" the command hangs in a running/streaming state for eternity. Is this correct or am I not being patient enough? If I'm correct, then how can I stop it (in a non-manual way) after sending all data?
Notes:
This will be running in a job that is to be triggered manually
The lib i use for the event hub connection is com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.14.1
Once you get your final records which you want to save in eventhub than after your write command you need to call .start() method which will enable your stream to write data back to eventhub.
Also if your jobs gets failed than in that case you need to stop your sparkContext using sc.stop() or spark.sparkContext.stop()
Related
I am utilizing change streams from documentDB to read timely sequenced events using lambda, event bridge to trigger event every 10min to invoke lambda and to archive the data to S3. Is there a way to scale the read from change stream using resume token and polling model? If a single lambda tries to read from change stream to archive then my process is falling way behind. As our application writes couple of millions during peak period my archival process is able to archive atmost 500k records to S3. Is there a way to scale this process? Running parallel lambda might not work as this will lead to racing condition.
can't you use step-functions? your event bridge fires the lambda which is a step-function, then it can keep the state while archiving the records.
I am not certain about documentDB, but I believe in MongoDB you can create a change stream with a filter. In this way, you can have multiple change streams, each acting on a portion (filter) of data. This allows multiple change streams to work concurrently on one cluster.
Currently developing a Beam pipeline using Flinkrunner and Apache Beam 2.29.
As per suggestion in Beam File processing patterns, I have an unbound pipeline listening to a Kafka topic for a CSV filename and once received processes it through TextIO readFile().
We end up with two PCollections, one is from the file being processed and the other is from a lookup from an external datastore. The PCollections are joined using the Join extension which forces us to setup some triggering on these two PCollections. So I have defined something like the below for each PCollection in hopes that the end result following the join would produce some new output every time a new filename arrives from the Kafka topic we are monitoring.
PCollection<KV<String, Map<String, AttributeValue>>> lookupTable = LookupTable.getPspLookupData(p, lookupTableName, lookupTableRegionFilter)
.apply("WindowB", Window.<KV<String, Map<String, AttributeValue>>>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(15))))
.withAllowedLateness(Duration.standardSeconds(5))
.discardingFiredPanes()
);
But it simply does not work more than once. It seems that if I send one or more kafka messages before the 15 seconds defined in plusDelayOf() the data gets processed but anything sent past those 15 seconds (from pipeline startup) is never processed and the pipeline is simply "stuck" despited having defined a trigger of Repeatedly.forever...
I have tried numerous combinations and I simply cant get it to work. Would welcome any ideas or suggestions to get this to work. Feels like I am missing something basic but I have been at this for hours.
Thanks,
Serge
I want to make something to monitor some Kafka topic continuously, and then execute some batch job when a message comes in (hitting some REST api and storing response). I set something up with KafkaItemReader, however, it turns off if it doesn't receive a message for 30 seconds based on pollTimeout. How can I make it run indefinitely? Since this is not an obvious option I'm wondering if I am using the right tool for the job.
Likely answer: you are not supposed to do this.
That's correct. Batch processing is about processing finite data sets. If your data source is an infinite stream of records and you want to monitor it continuously, then a streaming solution is more appropriate for your use case.
I am looking a way to trigger my Databricks notebook once to process Kinesis Stream and using following pattern
import org.apache.spark.sql.streaming.Trigger
// Load your Streaming DataFrame
val sdf = spark.readStream.format("json").schema(my_schema).load("/in/path")
// Perform transformations and then write…
sdf.writeStream.trigger(Trigger.Once).format("delta").start("/out/path")
It looks like it's not possible with AWS Kinesis and that's what Databricks documentation suggest as well. My Question is what else can we do to Achieve that?
As you mentioned in the question the trigger once isn't supported for Kinesis.
But you can achieve what you need by adding into the picture the Kinesis Data Firehose that will write data from Kinesis into S3 bucket (you can select format that you need, like, Parquet, ORC, or just leave in JSON), and then you can point the streaming job to given bucket, and use Trigger.Once for it, as it's a normal streaming source (For efficiency it's better to use Auto Loader that is available on Databricks). Also, to have the costs under the control, you can setup retention policy for your S3 destination to remove or archive files after some period of time, like 1 week or month.
A workaround is to stop after X runs, without trigger. It'll guarantee a fix number of rows per run.
The only issue is that if you have millions of rows waiting in the queue you won't have the guarantee to process all of them
In scala you can add an event listener, in python count the number of batches.
from time import sleep
s = sdf.writeStream.format("delta").start("/out/path")
#by defaut keep spark.sql.streaming.numRecentProgressUpdates=100 in the list. Stop after 10 microbatch
#maxRecordsPerFetch is 10 000 by default, so we will consume a max value of 10x10 000= 100 000 messages per run
while len(s.recentProgress) < 10:
print("Batchs #:"+str(len(s.recentProgress)))
sleep(10)
s.stop()
You can have a more advanced logic counting the number of message processed per batch and stopping when the queue is empty (the throughput should lower once it's all consumed as you'll only get the "real-time" flow, not the history)
We are using Debezium + PostgreSQL.
Notice that we get 4 types of events for create, read, update and delete - c, r, u and d.
The read type of event is unused for our application. Actually, I could not think of an use case for the 'r' events unless we are doing auditing or mirroring the activities of a transaction.
We are facing difficulties scaling & I suspect it is because of network getting hogged by read type of events.
How do we filter out those events in postgreSQL itself?
I got a clue from one of the contributors to use snapshot.mode. I guess something that has to be done when Debezium creates a snapshot. I am unable to figure out how to do that.
It is likely that your database has existed for some time and contains data and changes that have been purged from the logical decoding logs. If you then start using the Debezium PostgreSQL connector to start capturing changes into Kafka, the question becomes what a consumer of the events in Kafka should be able to see.
One scenario is that a consumer should be able to see events for all rows in the database, even those that existed prior to the start of CDC. For example, this allows a consumer to completely reproduce/replicate all of the existing data and keep that data in sync over time. To accomplish this, the Debezium PostgreSQL connector starts up can begin by creating a snapshot of the database contents before it starts capturing the changes. This is done atomically, so that even if the snapshot process takes a while to run, the connector will still see all of the events that occurred since the snapshot process was started. These events are represented as "read" events, since in effect the connector is simply reading the existing rows. However, they are identical to "insert" events, so any application could treat reads and inserts in the same way.
On the other hand, if consumers of the events in Kafka do not need to see events for all existing rows, then the connector can be configured to avoid the snapshot and to instead begin by capturing the changes. This may be useful in some scenarios where the entire database state need not be found in Kafka, but instead the goal is to simply capture the changes that are occurring.
The Debezium PostgreSQL connector will work either way, so you should use the approach that works for how you're consuming the events.