AWS DMS real time replication CDC from Oracle and PostgreSQL to Kinesis on a single thread is taking a lot of time - postgresql

The goal:
Real time CDC from Oracle and PostgreSQL to Kinesis on a single thread/process without much time lag and no record drop.
The system:
We have a system where we are doing a real time CDC from Oracle and PostgreSQL to Kinesis using AWS DMS.
The problem with doing a real time CDC with only one thread is that it takes many hours to replicate the changes to Kinesis when the data grows big(MBs).
Alternate approach:
The approach we took was to pull the real time changes from Oracle and PostgreSQL using multiple threads and push to Kinesis while still using DMS.
The challenge:
We noticed that while pulling data in real time using multiple threads, there is a drop in some records from Oracle and PostgreSQL. This happens in like 1 in 3 million records.
Tried different solutions on the Oracle and PostgreSQL side, talked to AWS and nothing works.
Notes:
We are using Logminner or Binary leader on Oracle and PostgreSQL side.
Is there a solution to this or has anybody tried to build this kind of system? Please let me know.

Related

Transfer data from Kinesis (or s3) to RDS postgres chron job

I'm pretty new to AWS, and I'm trying to find a way to reliably transfer data from a Kinesis stream to an AWS RDS postgres database table. The records will need to undergo small transformations on the way in, like filter (not all records will be inserted, depending on a key), and parsed for an insert into postgres. Currently, the data from the Kinesis stream is being dumped by firehose into S3 buckets as parquet.
I'm a bit lost in the many possible ways there seems to be of doing this, like maybe:
Kinesis streams -> Firehose -> Lambda -> RDS
Kinesis streams -> Firehose -> S3 -> Data Pipeline ETL job -> RDS
Database migration for S3 -> RDS?
AWS Glue?
others...?
In a non serverless world, I would run a chron job every, say, one hour which would take the files in the most recent S3 bucket partition (which is year/month/day/hour), so the latest hour, and filter out the records not needed in RDS, and do a bulk insert the rest into the RDS. I don't want to have a EC2 instance that sits idle 95% of the time to do this. Any advice?
Thanks for the clarification. Doing it in traditional ETL way with servers has some drawbacks. Either you'll need to keep a machine idle most of the time or you'll need to wait every time before the machine is created on demand - exactly as you're saying.
For Firehose, IMO it's interesting when you have a lot of real-time data to ingest. Regarding to AWS Glue, for me it's more like a "managed" Apache Spark, hence if you have some data processing logic to implement in a big amount of batch data, it can be interesting. But according to your description, it's not the case right ?
To sum up, if you think the amount of inserted data will always be still a few mb at a time, for me the simplest solution is the best, i.e. Kinesis -> Lambda -> RDS with maybe another Lambda to backup data on S3 (Kinesis retention period is limited to 7 days). It's especially interesting from the pricing point of view - apparently you have not a lot data, Lambda is executed at demand, for instance by batching 1000 Kinesis records, so it's a good occasion to save some money. Otherwise, if you expect to have more and more data, the use of "Firehose -> Lambda" version seems t be a better fit for me because you don't load the database with a big amount of data at once.

Integration of Kafka in Web Application

I have a java based web application which is using 2 backend database servers of Microsoft SQL (1 server is live database as it is transactional and the other one is reporting database). Lag between transactional and reporting databases is of around 30 minutes and incremental data is loaded using a SQL job which runs every 30 minutes and takes around 20-25 minutes in execution. This job is executing an SSIS package and using this package, data from reporting database is further processed and is stored in HDFS and HBase which is eventually used for analytics.
Now, I want to reduce this lag and to do this, I am thinking of implementing a messaging framework. After doing some research, I learned that Kafka could solve my purpose since Kafka can also work as an ETL tool apart from being a messaging framework.
How should I proceed? should I create topics similar to the table structures in SQL server and perform operations on that? Should I redirect my application to write any change happening in Kafka first and then in Transactional database? Please advise on usage of Kafka considering the mentioned use case.
There's a couple ways to do this that require minimal code, and then there's always the option to write your own code.
(Some coworkers just got finished looking at this, with SQL Server and Oracle, so I know a little about this here)
If you're using the enterprise version of SQL Server you could use Change Data Capture and Confluent Kakfa Connect to read all the changes to the data. This (seems to) require both a Enterprise license and may include some other additional cost (I was fuzzy on the details here. This may have been because we're using an older version of SQL Server or because we have many database servers ).
If you're not / can't use the CDC stuff, Kafka Connect's JDBC support also has a mode where it polls the database for changes. This works best if your records have some kind of timestamp column, but usually this is the case.
A poll only mode without CDC means you won't get every change - ie if you poll every 30 seconds and the record changes twice, you won't get individual messages about this change, but you'll get one message with those two changes, if that makes sense. This is Probably acceptable for your business domain, but something to be aware of.
Anyway, Kafka Connect is pretty cool - it will auto create Kafka topics for you based on your table names, including posting the Avro schemas to Schema Registry. (The topic names are knowable, so if you're in an environment with auto topic creation = false, well you can create the topics manually yourself based on the table names). Starting from no Kafka Connect knowledge it took me maybe 2 hours to figure out enough of the configuration to dump a large SQL Server database to Kafka.
I found additional documentation in a Github repository of a Confluent employee describing all this, with documentation of the settings, etc.
There's always the option of having your web app be a Kafka producer itself, and ignore the lower level database stuff. This may be a better solution, like if a request creates a number of records across the data store, but really it's one related event (an Order may spawn off some LineItem records in your relational database, but the downstream database only cares that an order was made).
On the consumer end (ie "next to" your other database) you could either use Kafka Connect on the other end to pick up changes, maybe even writing a custom plugin if required, or write your own Kafka consumer microservice to put the changes into the other database.

Can i use Amazon Kinesis to connect to amazon redshift for data load in every couple of mins

From lots of sources i am planning to use Amazon kinesis to catch the stream and after certain level of data transformation i want to direct the stream to Redshift Cluster in some table schema. Here i am not sure as is it right way to do this or not ?
From the Kineis documentation i have found that they have direct connector to redshift. However i have also found that Redshift looks better if we take bulk upload as data ware house system needs indexing. So the recommendation was to store all stream to S3 and then COPY command to make bulk push on redshift . Could someone please add some more view ?
When you use the connector library for Kinesis you will be pushing data into Redshift, both through S3 and in batch.
It is true that calling INSERT INTO Redshift is not efficient as you are sending all the data through a single leader node instead of using the parallel power for Redshift that you get when running COPY from S3.
Since Kinesis is designed to handle thousands of events per second, running a COPY every few seconds or minutes will already batch many thousands of records.
If you want to squeeze the juice from Kinesis and Redshift, you can calculate exactly how many shards you need, how many nodes in Redshift you need and how many temporary files in S3 you need to accumulate from Kinisis, before calling the COPY command to Redshift.

Analytics implementation in hadoop

Currently, we have mysql based analytics in place. We read our logs after every 15 mins, process them & add to mysql database.
As our data is growing(In one case, 9 million rows added till now & 0.5 million rows are adding in each month), we are planning to move analytics to no sql database.
As per my study, Hadoop seems to be better fit as we need to process the logs & it can handle very large data set.
However, it would be great if I can get some suggests from experts.
I agree with the other answers and comments. But if you want to evaluate Hadoop option then one solution can be following.
Apache Flume with Avro for log collection, agregation. Flume can ingest data into Hadoop File System (HDFS)
Then you can have Hbase as distributed scalable data store.
with Cloudera Impala on top of hbase you can have a near to real time (streaming) query engine. Impala uses SQL as its query language so it will be beneficial for you.
This is just one option. There can be multiple alternatives e.g. flume + hdfs + hive.
This is probably not a good q. for this forum but I would say that 9 million row and 0.5m per month hardly seems like a good reason to go to noSQL. This is a very small database and your best action would be to scale up the server a little (RAM, more disks, move to SSDs etc.)

CDC (Change Data Capture) in Google Cloud SQL

Any way to capture data changes in a Google Cloud SQL database that can trigger external scripts, for real time data replication (e.g. Cloud SQL instance that replicates changes to a MySQL instance in an office, and viceversa).
A poll solution would work but it wouldn't be "real time" ... if I poll the Cloud SQL instance say every 10 seconds, I would have a max latency of 9 seconds and a huge Cloud SQL invoice for 8,640 reads per day ;).
Thanks!
M
You can now create read-only replicas of your Google Cloud SQL instances in other hosting environments, see these docs.