I'm pretty new to AWS, and I'm trying to find a way to reliably transfer data from a Kinesis stream to an AWS RDS postgres database table. The records will need to undergo small transformations on the way in, like filter (not all records will be inserted, depending on a key), and parsed for an insert into postgres. Currently, the data from the Kinesis stream is being dumped by firehose into S3 buckets as parquet.
I'm a bit lost in the many possible ways there seems to be of doing this, like maybe:
Kinesis streams -> Firehose -> Lambda -> RDS
Kinesis streams -> Firehose -> S3 -> Data Pipeline ETL job -> RDS
Database migration for S3 -> RDS?
AWS Glue?
others...?
In a non serverless world, I would run a chron job every, say, one hour which would take the files in the most recent S3 bucket partition (which is year/month/day/hour), so the latest hour, and filter out the records not needed in RDS, and do a bulk insert the rest into the RDS. I don't want to have a EC2 instance that sits idle 95% of the time to do this. Any advice?
Thanks for the clarification. Doing it in traditional ETL way with servers has some drawbacks. Either you'll need to keep a machine idle most of the time or you'll need to wait every time before the machine is created on demand - exactly as you're saying.
For Firehose, IMO it's interesting when you have a lot of real-time data to ingest. Regarding to AWS Glue, for me it's more like a "managed" Apache Spark, hence if you have some data processing logic to implement in a big amount of batch data, it can be interesting. But according to your description, it's not the case right ?
To sum up, if you think the amount of inserted data will always be still a few mb at a time, for me the simplest solution is the best, i.e. Kinesis -> Lambda -> RDS with maybe another Lambda to backup data on S3 (Kinesis retention period is limited to 7 days). It's especially interesting from the pricing point of view - apparently you have not a lot data, Lambda is executed at demand, for instance by batching 1000 Kinesis records, so it's a good occasion to save some money. Otherwise, if you expect to have more and more data, the use of "Firehose -> Lambda" version seems t be a better fit for me because you don't load the database with a big amount of data at once.
Related
The goal:
Real time CDC from Oracle and PostgreSQL to Kinesis on a single thread/process without much time lag and no record drop.
The system:
We have a system where we are doing a real time CDC from Oracle and PostgreSQL to Kinesis using AWS DMS.
The problem with doing a real time CDC with only one thread is that it takes many hours to replicate the changes to Kinesis when the data grows big(MBs).
Alternate approach:
The approach we took was to pull the real time changes from Oracle and PostgreSQL using multiple threads and push to Kinesis while still using DMS.
The challenge:
We noticed that while pulling data in real time using multiple threads, there is a drop in some records from Oracle and PostgreSQL. This happens in like 1 in 3 million records.
Tried different solutions on the Oracle and PostgreSQL side, talked to AWS and nothing works.
Notes:
We are using Logminner or Binary leader on Oracle and PostgreSQL side.
Is there a solution to this or has anybody tried to build this kind of system? Please let me know.
As a part of our architecture, we are having Kinesis streams which will send data streams to Redshift, the current thought process is to create external external schema on top of the Kinesis streams and then use materialized views to persist the data with minimal transformations as needed.
However In addition to this there is a need to perform a series of transformations (which are quite complex) on this data and load it into target tables. Was thinking about using stored procedures to perform these transformations and load into a target table. So the flow is like Kinesis Streams -> External View (Real time) -> Batch Processing (Materialized view and stored procedure).
To call the stored procedures (SP) on a timely schedule, the thought process was to schedule the SQL queries. Being fairly new to AWS Redshift and the Kinesis streaming ingestion and exploring on the available options, would like to get thoughts on the above approach.
Further understand that there are limitations with stored procedures, (https://docs.aws.amazon.com/redshift/latest/dg/stored-procedure-constraints.html) and scheduling of SQL queries might not give the feasibility of implementing a good orchestration flow. Hence would like to understand alternate methods that are available to implement the above specifically post the availability of data in external view.
I have an RDS PostgreSQL instance. I have my own s3. In my project I will upload some excel files in s3. When a file falls on s3 i need to read the excel file and store it into the DB.
I created a lambda function using java and added trigger to the lambda function. When a file falls on s3 my lambda function invokes automatically and reading the file and saving it into the db.
But the probelm is when in my case my file is more than 100MB at a time. The lambda works only for 5 minutes.
So i could not save my file completely tho the DB. I heard kinesis in AWS. As a newbie to the AWS I don't know how to use it? Is there any who can help me to get this?
Kinesis is probably not the solution to your problem.
The limiting factor in your case is the execution time limit of the Lambda function, which tops at 5 minutes.
In order to extend that execution time, what I would do is use a SQS queue and a EC2 instance. When a file falls in your S3 bucket, you configure your bucket to publish events to a SQS queue. See examples here on how to do that.
The EC2 instance periodically polls the SQS queue for new events. When a new event is put in the queue, the EC2 instance retrieves the event, and copies the S3 file's data into your DB.
EDIT: Lambda now stops at 15 minutes of execution instead of 5
From lots of sources i am planning to use Amazon kinesis to catch the stream and after certain level of data transformation i want to direct the stream to Redshift Cluster in some table schema. Here i am not sure as is it right way to do this or not ?
From the Kineis documentation i have found that they have direct connector to redshift. However i have also found that Redshift looks better if we take bulk upload as data ware house system needs indexing. So the recommendation was to store all stream to S3 and then COPY command to make bulk push on redshift . Could someone please add some more view ?
When you use the connector library for Kinesis you will be pushing data into Redshift, both through S3 and in batch.
It is true that calling INSERT INTO Redshift is not efficient as you are sending all the data through a single leader node instead of using the parallel power for Redshift that you get when running COPY from S3.
Since Kinesis is designed to handle thousands of events per second, running a COPY every few seconds or minutes will already batch many thousands of records.
If you want to squeeze the juice from Kinesis and Redshift, you can calculate exactly how many shards you need, how many nodes in Redshift you need and how many temporary files in S3 you need to accumulate from Kinisis, before calling the COPY command to Redshift.
Currently, we have mysql based analytics in place. We read our logs after every 15 mins, process them & add to mysql database.
As our data is growing(In one case, 9 million rows added till now & 0.5 million rows are adding in each month), we are planning to move analytics to no sql database.
As per my study, Hadoop seems to be better fit as we need to process the logs & it can handle very large data set.
However, it would be great if I can get some suggests from experts.
I agree with the other answers and comments. But if you want to evaluate Hadoop option then one solution can be following.
Apache Flume with Avro for log collection, agregation. Flume can ingest data into Hadoop File System (HDFS)
Then you can have Hbase as distributed scalable data store.
with Cloudera Impala on top of hbase you can have a near to real time (streaming) query engine. Impala uses SQL as its query language so it will be beneficial for you.
This is just one option. There can be multiple alternatives e.g. flume + hdfs + hive.
This is probably not a good q. for this forum but I would say that 9 million row and 0.5m per month hardly seems like a good reason to go to noSQL. This is a very small database and your best action would be to scale up the server a little (RAM, more disks, move to SSDs etc.)