AWS Kinesis Stream In Detail Review - postgresql

I am new to AWS. I have implemented some functionalities in aws using java. My requirement is to insert a csv of 50MB to RDS PostgreSQL instance at a time.
I tried with aws lmabda service. But after 5 minutes lambda will be stopped so i dropped that way.(Limitation of lambda function)
The second step I followed I wrote a java lambda code of s3 event which will read the csv file falls on s3 to a kinesis stream using putrecord command. According to my understanding, kinesis is capable of read csv file record by record. This kinesis stream will invoke a second lambda function which is saving data to postgreSQL.
Everything was fine. But my confusion is that only 32000 record is inserting. I have 50000 records in my csv. according to kinesis stream it will reading each row as a record so each time it will invoke lambda separately right? so why it is not saving completely?
One more question in my kinesis stream configured like below.
Also in my lambda i configured kinesis as
Is this the correct configuration for my requirement? if I give batchsize as 1 will my function insert the complete record?Please let me know ur knowledge about this. It would be a great help from you thanks in advance!!!!

You are exceeding your limits for a single shard.
Review the following document:
Amazon Kinesis Data Streams Limits
Make sure that your code is checking for errors on each AWS call.

Related

Scale read from change streams in documentdb

I am utilizing change streams from documentDB to read timely sequenced events using lambda, event bridge to trigger event every 10min to invoke lambda and to archive the data to S3. Is there a way to scale the read from change stream using resume token and polling model? If a single lambda tries to read from change stream to archive then my process is falling way behind. As our application writes couple of millions during peak period my archival process is able to archive atmost 500k records to S3. Is there a way to scale this process? Running parallel lambda might not work as this will lead to racing condition.
can't you use step-functions? your event bridge fires the lambda which is a step-function, then it can keep the state while archiving the records.
I am not certain about documentDB, but I believe in MongoDB you can create a change stream with a filter. In this way, you can have multiple change streams, each acting on a portion (filter) of data. This allows multiple change streams to work concurrently on one cluster.

Can the Kafka Connect JDBC Sink dump raw data?

Partly for testing and debugging but also to work around an issue we are seeing in a topic where we have are unable to change the producer I would like to be able to store the value as a string in a CLOB in a database table.
I have this working as a Java based consumer but I am looking at whether this could be achieved using Kafka Connect.
Everything I have read says you need a schema with the reasoning being that how else would it know how to process the data into columns (which makes sense) but I don't want to do any processing of the data (which could be JSON but might just be text) I just want to treat the whole value as a string and load it into one column.
Is there any way this can be done within the Connect config or am I looking at adding extra processing to update the message (in which case the Java client is probably going to end up being simpler)
No, the JDBC Sink connector requires a schema to work. You could modify the source code to add in this behaviour.
I would personally try to stick with Kafka Connect for streaming data to a database since it does all the difficult stuff (scale out, restarts, etc etc etc) very well. Depending on the processing that you're talking about, it could well be that Single Message Transform would be very applicable, since they fit into the Kafka Connect pipeline. Or for more complex processing, Kafka Streams or ksqlDB.

Calling Trigger once in Databricks to process Kinesis Stream

I am looking a way to trigger my Databricks notebook once to process Kinesis Stream and using following pattern
import org.apache.spark.sql.streaming.Trigger
// Load your Streaming DataFrame
val sdf = spark.readStream.format("json").schema(my_schema).load("/in/path")
// Perform transformations and then write…
sdf.writeStream.trigger(Trigger.Once).format("delta").start("/out/path")
It looks like it's not possible with AWS Kinesis and that's what Databricks documentation suggest as well. My Question is what else can we do to Achieve that?
As you mentioned in the question the trigger once isn't supported for Kinesis.
But you can achieve what you need by adding into the picture the Kinesis Data Firehose that will write data from Kinesis into S3 bucket (you can select format that you need, like, Parquet, ORC, or just leave in JSON), and then you can point the streaming job to given bucket, and use Trigger.Once for it, as it's a normal streaming source (For efficiency it's better to use Auto Loader that is available on Databricks). Also, to have the costs under the control, you can setup retention policy for your S3 destination to remove or archive files after some period of time, like 1 week or month.
A workaround is to stop after X runs, without trigger. It'll guarantee a fix number of rows per run.
The only issue is that if you have millions of rows waiting in the queue you won't have the guarantee to process all of them
In scala you can add an event listener, in python count the number of batches.
from time import sleep
s = sdf.writeStream.format("delta").start("/out/path")
#by defaut keep spark.sql.streaming.numRecentProgressUpdates=100 in the list. Stop after 10 microbatch
#maxRecordsPerFetch is 10 000 by default, so we will consume a max value of 10x10 000= 100 000 messages per run
while len(s.recentProgress) < 10:
print("Batchs #:"+str(len(s.recentProgress)))
sleep(10)
s.stop()
You can have a more advanced logic counting the number of message processed per batch and stopping when the queue is empty (the throughput should lower once it's all consumed as you'll only get the "real-time" flow, not the history)

Databricks Scala : How to stream result from sql select

I need to send data from a databricks delta table into azure event hubs.
The data will be selected with a sql select
spark.sql("SELECT [columns] FROM table WHERE [where clause]")
This select will return many many rows and after it, I will apply some transformation (mainly to be in accordance to the event hub event data message).
At the end I will send it to event hub.
As far as I can tell, at the moment of writing, I need to use "writeStream" but is this enough? How can I control how many messages are sent per batch? Do I even need to care about it or does the lib handle it?
Another question I have is, from the moment I use "writeStream" the command hangs in a running/streaming state for eternity. Is this correct or am I not being patient enough? If I'm correct, then how can I stop it (in a non-manual way) after sending all data?
Notes:
This will be running in a job that is to be triggered manually
The lib i use for the event hub connection is com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.14.1
Once you get your final records which you want to save in eventhub than after your write command you need to call .start() method which will enable your stream to write data back to eventhub.
Also if your jobs gets failed than in that case you need to stop your sparkContext using sc.stop() or spark.sparkContext.stop()

Data streamed from Kafka to Postgres and missing seconds later

I am trying to save data from local Kafka instance to local Postgres with Spark Streaming. I have configured all connections and parameters, and data actually gets to the database. However, it is there only for a couple of seconds. After that, the table simply becomes empty. If I stop the app as soon there is some data in Postgres, data persists, so I suppose I have missed some parameter for streaming in Spark or something in Kafka configuration files. The code is in Java, not Scala, so there is dataset instead of DataFrame.
I tried setting spark.driver.allowMultipleContexts to true, but this has nothing with context. When I run count on database with complete data set streaming in the background, there is always about 1700 records, which means there might be some parameter for batch size.
censusRecordJavaDStream.map(e -> {
Row row = RowFactory.create(e.getAllValues());
return row;
}).foreachRDD(rdd -> {
Dataset<Row> censusDataSet = spark.createDataFrame(rdd, CensusRecord.getStructType());
censusDataSet
.write()
.mode(SaveMode.Overwrite)
.jdbc("jdbc:postgresql:postgres", "census.census", connectionProperties);
});
My goal is to stream data from Kafka and save it to Postgre. Each record has unique ID, which is used as a key in Kafka, so there should be no conflicts regarding primary key or double entries. For current testing purposes, I am using small subset of about 100 records; complete dataset is over 300MB.