How can I limit the amount of data in my db2 warehouse on cloud entry instance? - db2

I have an entry plan instance of DB2 Warehouse on Cloud that I'm looking to use for development of a streaming application.
If I keep the data to <= 1GB, it will cost me $50/month. I'm worried that I could easily fill the database up with 20GB and the cost jumps up to $1000/month.
Is there a way that I can limit the amount of data in my DB2 Warehouse on Cloud to < 1GB?

As per this link
Db2 Warehouse pricing plans
You will not be charged anything as long as your data usage does not exceed 1 GB. From 1 GB to 20 GB the price will vary based on the data used.
You should be able to see the current % of usage at any time in your console. Other than that I am not aware of any method to automatically restrict the usage to less than 1 GB at this time.
One of the problem would be the data compression which determines the actual amount of data stored and it can vary based on the type of data stored.
Hope this helps.
Regards
Murali

Related

slow query fetch time

I am using gcp cloud sql (mysql 8.0.18) and I am trying to execute a query for only 5000 rows,
SELECT * FROM client_1079.c_crmleads ORDER BY LeadID DESC LIMIT 5000;
but I think the execution is taking long time to fetch data
here is the time details
Affected rows: 0 Found rows: 5,000 Warnings: 0 Duration for 1 query: 0.797 sec. (+ 117.609 sec. network)
Instance configuration is vCPU: 8 , RAM: 20 GB, SSD: 410GB
screenshot of gcp cloud sql instance
also I am facing some issues on high table_open_cache and high ram utilization.
how do I reduce open_table_cache also how to increase instance performance?
Looks like the size of the data retrieved is quite large and the time spent on sending the data from the SQL instance to your App is the reason of the latency observed.
You may review your use case and maybe retrieve less information, or try to parallellize queries, or improve the SQL instance I/O performance (it is related to DB Disk Size).

AWS RDS Storage increasing unexpectedly

I’m using RDS PostgreSQL with the free tier configuration (t2.micro instance, 20 GB of general purpose database storage (SSD) and 20 GB of backups storage).
I’ve created a database with a few tables and the current usage is below 200 MB.
The issue is that the storage has been increasing day by day at a rate of 1 GB per day since I’ve created the database. As a workaround I’ve tried to stop this behavior by decreasing the backup retention period from 7 to 0 days. From yesterday on, no more backups were stored. However, the storage kept on growing at almost the same rate.
I have no clue what's going on and I don't know what to do to stop the consumption and avoid exceeding the free tier storage.
Apart from that, I also don't know which storage is the one that has been increasing (data or backup storage) because the AWS platform only says that the increase is in the RDS Storage (it doesn't distinguish between both types of storages). As I’m not storing a lot of data in the database I suspect the problem is in the backups and snapshots storage and not in the data storage itself.
Thanks in advance!
Update 2022-09-13
The curious fact is that I have a budget threshold of 8 Gb/month and on the same day every month the threshold is exceeded (in my case, on the 13th of each month). But at the end of the month it kinda resets by itself (it returns to zero). I don't know what is happening. Does anyone know where that temporary Gbs usage comes from?

DB2 throughput when PaaS is offline?

I'm curious why there are a number of statements per hour when the PaaS is offline and nothing is actually accessing the Cloud DB2 database? Does that mean anything below 120/min or 250k/min rows read should be subtracted to get the real value?
db2 throughput

How can I choose the right key-value store for my use case?

I will describe the data and case.
record {
customerId: "id", <---- indexed
binaryData: "data" <---- not indexed
}
Expectations:
customerId is random 10 digit number
Average size of binary record data - 1-2 kilobytes
There may be up to 100 records per one customerId
Overall number of records - 500M
Write pattern #1: insert one record at a time
Write pattern #2: batch, maybe in parallel, with speed of at least 20M record per hour
Search pattern #1: find all records by customerId
Search pattern #2: find of all records by customerId group, in parallel, at a rate of at least 10M customerId per hour
Data is not too important, we can trade some aspects of reliability for speed
We suppose to work in AWS / GCP - it's best we key-value store is administered by the cloud
We want to spend no more that 1K USD per month on cloud costs for this solution
What we have tried:
We have this approach implemented in relational database, in AWS RDS MariaDB. Server is 32GB RAM, 2TB GP2 SSD, 8 CPU. I found that IOPS usage was high and insert speed was not satisfactory. After investigation I concluded that due to random nature of customerId there is high rate of different writes to index. After this I did the following:
input data is sorted by customerId ASC
Additional trade was made to reduce index size with little degradation of single record read speed. For this I did some sort of buckets where records 1111111185 and 1111111186 go to same "bucket" 11111111. This way bucket can't contain more than 100 customerIds so read speed will be ok, and write speed improves.
Even like this, I could not make more than 1-3M record writes per hour. Different write concurrencies were tested, current value is 4 concurrent writers. After all modifications it's not clear what else we can improve:
IOPS is not at the top use (~4K per second),
CPU use is not high,
Network is not fully utilized,
Write and read throughputs are not capped.
Apparently, ACID principles are holding us back. I am in look for flatly scalable key-value store and will be glad to hear any ideas and roughly estimations.
So if I understand you...
2kb * 500m records ≈ 1 TB of data
20m writes/hr ≈ 5.5k writes/sec
That's quite doable in NoSQL.
The scale is not the issue. It's your cost.
$1k a month for 1 TB of data sounds like a reasonable goal. I just don't think that the public clouds are quite there yet.
Let me give an example with my recommendation: Scylla Cloud and Scylla Open Source. (Disclosure: I work for ScyllaDB.)
I will caution you that your $1k/month capitation on costs might cause you to consider and make some tradeoffs.
As is typical in high availability deployments, to ensure data redundancy in case of node failure, you could use 3x i3.2xlarge instances on AWS (can store 1.9 TB per instance).
You want the extra capacity to run compactions. We use incremental compaction, which saves on space amplification, but you don't want to go with the i3.xlarge (0.9 tb each), which is under your 1 tb limit unless really pressed for costs. In which case you'll have to do some sort of data eviction (like a TTL) to keep your data to around <600 gb.
Even with annual reserved pricing for Scylla Cloud (see here: https://www.scylladb.com/product/scylla-cloud/#pricing) of $764.60/server, to run the three i3.2xlarge would be $2,293.80/month. More than twice your budget.
Now, if you eschew managed services, and want to run self-service, you could go Scylla Open Source, and just look at the on-demand instance pricing (see here: https://aws.amazon.com/ec2/pricing/on-demand/). For 3x i3.2xlarge, you are running each at $0.624/hour. That's a raw on-demand cost of $449.28 each, which doesn't include incidentals like backups, data transfer, etc. But you could get three instances for $1,347.84. Open Source. Not managed.
Still over your budget, but closer. If you could get reserved pricing, that might just make it.
Edit: Found the reserve pricing:
3x i3.2xlarge is going to cost you
At monthly pricing $312.44 x 3 = $937.32, or
1 year up-front $3,482 annual/12 = $290.17/month/server x 3 = $870.50.
So, again, backups, monitoring, and other costs are above that. But you should be able to bring the raw server cost <$1,000 to meet your needs using Scylla Open Source.
But the admin burden is on your team (and their time isn't exactly zero cost).
For example, if you want monitoring on your system, you'll need to set up something like Prometheus, Grafana or Datadog. That will be other servers or services, and they aren't free. (The cost of backups and monitoring by our team are covered with Scylla Cloud. Part of the premium for the service.)
Another way to save money is to only do 2x replication. Which puts your data in a real risky place in case you lose a server. It is not recommended.
All of this was based on maximal assumptions of your data. That your records are all around 2k (not 1k). That you're not getting much utility out of data compression, which ScyllaDB has built in – see part one (https://www.scylladb.com/2019/10/04/compression-in-scylla-part-one/) and part two (https://www.scylladb.com/2019/10/07/compression-in-scylla-part-two/).
To my mind, you should be able to squeak through with your $1k/month budget if you go reserved pricing and open source. Though adding on monitoring and backups and other incidental costs (which I haven't calculated here) may end you up back over that number again.
Otherwise, $2.3k/month in a fully-managed-cloud enterprise package and you can sleep easy at night.

Google Cloud Storage Usage Pricing from byte_hours

I recently setup the Google Cloud Storage Access Logs & Storage Data and the logs are getting logged but I could see 4 logs at the same time.
For example:
usage_2017_02_14_07_00_00_00b564_v0
usage_2017_02_14_07_00_00_01b564_v0
usage_2017_02_14_07_00_00_02b564_v0
usage_2017_02_14_07_00_00_03b564_v0
So there are 4 usage logs logged for every hour, what's the different between them.
I connected all the logs to big query to query the table - and all 4 of them have different values.
Also analysing on storage logs - I could see storage_byte_hours to 43423002260.
How to calculate the cost from storage_byte_hours?
It is normal for GCS to sometimes produce more than one logfiles for the same hour. From Downloading logs (emphasis mine):
Note:
Any log processing of usage logs should take into account the possibility that they may be delivered later than 15 minutes after the
end of an hour.
Usually, hourly usage log object(s) contain records for all usage that occurred during that hour. Occasionally, an hourly usage log
object contains records for an earlier hour, but never for a later
hour.
Cloud Storage may write multiple log objects for the same hour.
Occasionally, a single record may appear twice in the usage logs. While we make our best effort to remove duplicate records, your log
processing should be able to remove them if it is critical to your log
analysis. You can use the s_request_id field to detect duplicates.
You calculate the bucket size from storage_byte_hours. From Access and storage log format:
Storage data fields:
Field Type Description
storage_byte_hours integer Average size in byte-hours over a 24 hour period of the bucket.
To get the total size of the bucket, divide byte-hours by 24.
In your case 43423002260 byte-hours / 24 hours = 1809291760 bytes
You can use the bucket size to estimate the cost for the storage itself:
1809291760 bytes = 1809291760 / 2^^30 GB ~= 1.685 GB
Assuming Multi Regional Storage (per GB per Month ) $0.026 your storage cost be:
1.685 GB x $0.026 = $0.04381 / month ~= $0.00146033333333 / day (w/ 30 days month)
But a pile of other data (network, ops, etc) is needed to compute additional related costs, see Google Cloud Storage Pricing.