DB2 throughput when PaaS is offline? - db2

I'm curious why there are a number of statements per hour when the PaaS is offline and nothing is actually accessing the Cloud DB2 database? Does that mean anything below 120/min or 250k/min rows read should be subtracted to get the real value?
db2 throughput

Related

AWS DMS real time replication CDC from Oracle and PostgreSQL to Kinesis on a single thread is taking a lot of time

The goal:
Real time CDC from Oracle and PostgreSQL to Kinesis on a single thread/process without much time lag and no record drop.
The system:
We have a system where we are doing a real time CDC from Oracle and PostgreSQL to Kinesis using AWS DMS.
The problem with doing a real time CDC with only one thread is that it takes many hours to replicate the changes to Kinesis when the data grows big(MBs).
Alternate approach:
The approach we took was to pull the real time changes from Oracle and PostgreSQL using multiple threads and push to Kinesis while still using DMS.
The challenge:
We noticed that while pulling data in real time using multiple threads, there is a drop in some records from Oracle and PostgreSQL. This happens in like 1 in 3 million records.
Tried different solutions on the Oracle and PostgreSQL side, talked to AWS and nothing works.
Notes:
We are using Logminner or Binary leader on Oracle and PostgreSQL side.
Is there a solution to this or has anybody tried to build this kind of system? Please let me know.

How can I limit the amount of data in my db2 warehouse on cloud entry instance?

I have an entry plan instance of DB2 Warehouse on Cloud that I'm looking to use for development of a streaming application.
If I keep the data to <= 1GB, it will cost me $50/month. I'm worried that I could easily fill the database up with 20GB and the cost jumps up to $1000/month.
Is there a way that I can limit the amount of data in my DB2 Warehouse on Cloud to < 1GB?
As per this link
Db2 Warehouse pricing plans
You will not be charged anything as long as your data usage does not exceed 1 GB. From 1 GB to 20 GB the price will vary based on the data used.
You should be able to see the current % of usage at any time in your console. Other than that I am not aware of any method to automatically restrict the usage to less than 1 GB at this time.
One of the problem would be the data compression which determines the actual amount of data stored and it can vary based on the type of data stored.
Hope this helps.
Regards
Murali

Loading data to Postgres RDS is still slow after tuning parameters

We have created a RDS postgres instance (m4.xlarge) with 200GB storage (Provisioned IOPS). We are trying to upload data from company data mart to the 23 tables in RDS using DataStage. However the uploads are quite slow. It takes about 6 hours to load 400K records.
Then I started tuning the following parameters according to Best Practices for Working with PostgreSQL:
autovacuum 0
checkpoint_completion_target 0.9
checkpoint_timeout 3600
maintenance_work_mem {DBInstanceClassMemory/16384}
max_wal_size 3145728
synchronous_commit off
Other than these, I also turned off multi AZ and back-up. SSL is enabled though, not sure this will change anything. However, after all the changes, still not much improvement. DataStage is uploading data in parallel already ~12 threads. Write IOPS is around 40/sec. Is this value normal? Is there anything else I can do to speed up the data transfer?
In Postgresql, you're going to have to wait 1 full round trip (latency) for each insert statement written. This latency is the latency between the database all the way to the machine where the data is being loaded from.
In AWS you have many options to improve performance.
For starters, you can load your raw data onto an EC2 instance and start importing from there, however, you will likely not be able to use your dataStage tool unless it can be loaded directly on the ec2 instance.
You can configure dataStage to use batch processing where each insert statement actually contains many rows.. generally, the more, the faster.
disable data compression and make sure you've done everything you can to minimize latency between the two endpoints.

Analytics implementation in hadoop

Currently, we have mysql based analytics in place. We read our logs after every 15 mins, process them & add to mysql database.
As our data is growing(In one case, 9 million rows added till now & 0.5 million rows are adding in each month), we are planning to move analytics to no sql database.
As per my study, Hadoop seems to be better fit as we need to process the logs & it can handle very large data set.
However, it would be great if I can get some suggests from experts.
I agree with the other answers and comments. But if you want to evaluate Hadoop option then one solution can be following.
Apache Flume with Avro for log collection, agregation. Flume can ingest data into Hadoop File System (HDFS)
Then you can have Hbase as distributed scalable data store.
with Cloudera Impala on top of hbase you can have a near to real time (streaming) query engine. Impala uses SQL as its query language so it will be beneficial for you.
This is just one option. There can be multiple alternatives e.g. flume + hdfs + hive.
This is probably not a good q. for this forum but I would say that 9 million row and 0.5m per month hardly seems like a good reason to go to noSQL. This is a very small database and your best action would be to scale up the server a little (RAM, more disks, move to SSDs etc.)

CDC (Change Data Capture) in Google Cloud SQL

Any way to capture data changes in a Google Cloud SQL database that can trigger external scripts, for real time data replication (e.g. Cloud SQL instance that replicates changes to a MySQL instance in an office, and viceversa).
A poll solution would work but it wouldn't be "real time" ... if I poll the Cloud SQL instance say every 10 seconds, I would have a max latency of 9 seconds and a huge Cloud SQL invoice for 8,640 reads per day ;).
Thanks!
M
You can now create read-only replicas of your Google Cloud SQL instances in other hosting environments, see these docs.