I am looking a way to trigger my Databricks notebook once to process Kinesis Stream and using following pattern
import org.apache.spark.sql.streaming.Trigger
// Load your Streaming DataFrame
val sdf = spark.readStream.format("json").schema(my_schema).load("/in/path")
// Perform transformations and then write…
sdf.writeStream.trigger(Trigger.Once).format("delta").start("/out/path")
It looks like it's not possible with AWS Kinesis and that's what Databricks documentation suggest as well. My Question is what else can we do to Achieve that?
As you mentioned in the question the trigger once isn't supported for Kinesis.
But you can achieve what you need by adding into the picture the Kinesis Data Firehose that will write data from Kinesis into S3 bucket (you can select format that you need, like, Parquet, ORC, or just leave in JSON), and then you can point the streaming job to given bucket, and use Trigger.Once for it, as it's a normal streaming source (For efficiency it's better to use Auto Loader that is available on Databricks). Also, to have the costs under the control, you can setup retention policy for your S3 destination to remove or archive files after some period of time, like 1 week or month.
A workaround is to stop after X runs, without trigger. It'll guarantee a fix number of rows per run.
The only issue is that if you have millions of rows waiting in the queue you won't have the guarantee to process all of them
In scala you can add an event listener, in python count the number of batches.
from time import sleep
s = sdf.writeStream.format("delta").start("/out/path")
#by defaut keep spark.sql.streaming.numRecentProgressUpdates=100 in the list. Stop after 10 microbatch
#maxRecordsPerFetch is 10 000 by default, so we will consume a max value of 10x10 000= 100 000 messages per run
while len(s.recentProgress) < 10:
print("Batchs #:"+str(len(s.recentProgress)))
sleep(10)
s.stop()
You can have a more advanced logic counting the number of message processed per batch and stopping when the queue is empty (the throughput should lower once it's all consumed as you'll only get the "real-time" flow, not the history)
Related
I am utilizing change streams from documentDB to read timely sequenced events using lambda, event bridge to trigger event every 10min to invoke lambda and to archive the data to S3. Is there a way to scale the read from change stream using resume token and polling model? If a single lambda tries to read from change stream to archive then my process is falling way behind. As our application writes couple of millions during peak period my archival process is able to archive atmost 500k records to S3. Is there a way to scale this process? Running parallel lambda might not work as this will lead to racing condition.
can't you use step-functions? your event bridge fires the lambda which is a step-function, then it can keep the state while archiving the records.
I am not certain about documentDB, but I believe in MongoDB you can create a change stream with a filter. In this way, you can have multiple change streams, each acting on a portion (filter) of data. This allows multiple change streams to work concurrently on one cluster.
I want to make something to monitor some Kafka topic continuously, and then execute some batch job when a message comes in (hitting some REST api and storing response). I set something up with KafkaItemReader, however, it turns off if it doesn't receive a message for 30 seconds based on pollTimeout. How can I make it run indefinitely? Since this is not an obvious option I'm wondering if I am using the right tool for the job.
Likely answer: you are not supposed to do this.
That's correct. Batch processing is about processing finite data sets. If your data source is an infinite stream of records and you want to monitor it continuously, then a streaming solution is more appropriate for your use case.
EDITED:
I have a requirement to skip records that are created before 10s and 20s after if a gap in incoming data occurs.
(A gap is said to occur when the event-time1 - event-time2 > 3 seconds)
the resulting data is used to calculate average or median in a timewindow,
Is this possible to be done with Kinesis analytics, Dataflow, flink API, or some solution that works?
If I understand correctly, you want to find the median and average of records that are created between 10 and 20 seconds after a gap of at least 3 seconds.
Using Flink (or Kinesis Analytics, which is a managed Flink service), you could do that with session windows, or with a ProcessFunction. Process functions are more flexible, and are capable of handling pretty much anything you might need. However, in this case, session windows are probably simpler, especially if you are willing to wait until a session ends (i.e., until the next gap) to get the results. You could avoid this delay by implementing a custom window Trigger.
window tutorial
process function tutorial
I need to send data from a databricks delta table into azure event hubs.
The data will be selected with a sql select
spark.sql("SELECT [columns] FROM table WHERE [where clause]")
This select will return many many rows and after it, I will apply some transformation (mainly to be in accordance to the event hub event data message).
At the end I will send it to event hub.
As far as I can tell, at the moment of writing, I need to use "writeStream" but is this enough? How can I control how many messages are sent per batch? Do I even need to care about it or does the lib handle it?
Another question I have is, from the moment I use "writeStream" the command hangs in a running/streaming state for eternity. Is this correct or am I not being patient enough? If I'm correct, then how can I stop it (in a non-manual way) after sending all data?
Notes:
This will be running in a job that is to be triggered manually
The lib i use for the event hub connection is com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.14.1
Once you get your final records which you want to save in eventhub than after your write command you need to call .start() method which will enable your stream to write data back to eventhub.
Also if your jobs gets failed than in that case you need to stop your sparkContext using sc.stop() or spark.sparkContext.stop()
I am new to AWS. I have implemented some functionalities in aws using java. My requirement is to insert a csv of 50MB to RDS PostgreSQL instance at a time.
I tried with aws lmabda service. But after 5 minutes lambda will be stopped so i dropped that way.(Limitation of lambda function)
The second step I followed I wrote a java lambda code of s3 event which will read the csv file falls on s3 to a kinesis stream using putrecord command. According to my understanding, kinesis is capable of read csv file record by record. This kinesis stream will invoke a second lambda function which is saving data to postgreSQL.
Everything was fine. But my confusion is that only 32000 record is inserting. I have 50000 records in my csv. according to kinesis stream it will reading each row as a record so each time it will invoke lambda separately right? so why it is not saving completely?
One more question in my kinesis stream configured like below.
Also in my lambda i configured kinesis as
Is this the correct configuration for my requirement? if I give batchsize as 1 will my function insert the complete record?Please let me know ur knowledge about this. It would be a great help from you thanks in advance!!!!
You are exceeding your limits for a single shard.
Review the following document:
Amazon Kinesis Data Streams Limits
Make sure that your code is checking for errors on each AWS call.