Can we know last read and write's usage's date and time of object in ceph? - ceph

IS their any function or class from where last read or write usage of objects can be known in ceph?
can we also see the last usage of a month or year?
Thanks.

The rados_read_op_stat function of the librados API can be used to query the last modification time and the size of an object.
The stat sub-command in rados can conveniently be used to test it:
$ rados --pool rbd put FOO /etc/group
$ rados --pool rbd stat FOO
rbd/FOO mtime 2015-03-24 15:04:47.000000, size 1253
Ceph does not collect usage statistics on objects, except for the cache tier which needs to know what objects have not been modified in the past hours and demote them to a slower / less expensive pool. It is however probably not what you're looking for.

Related

GCS - Cost migrating data to Archive Bucket

I would like to know how much I would save by transferring 1 TB of data from a standard regional bucket to an Archive bucket located in the same region (and within the same project).
I understand that the cost can be split in Data Storage, Network Usage and Operations Usage.
For the Data Storage:
The cost of storing 1 TB in a Standard bucket per month : 1024 * 0.020 $ = 20.48 $
The cost of storing 1 TB in an Archive bucket per month : 1024 * 0.0012 $ = 1.2288 $
Which means that I would save 19.2512 $ per month.
For the Network Usage:
I assume that this cost for the transfer will be 0 because the data will move from one region to the same.
For the Operations Usage:
Retrieval cost from the Standard bucket : 0.004 $
It should need less than 10000 Class B operations to gather all the files.
Insertion cost in the Archive bucket : 0.50 $
It should need around 1024 * 1024 / 128 = 8192 operations of Class A. (1 per directory, 1 per file, and for each file larger than 128MB 1 per additional 128MB.)
So in total, I would have to pay 0.504$ once to transfer all the files to the Archive bucket and the bucket will cost me 1.2288 $ instead of 20.48 $.
Is my calculation correct or did I miss something ?
Regards,
According to the documentation on Cloud Storage Pricing your estimates seem to be correct. Moreover, the amount of data you would like to transfer is quite minimal so the charges would be low as well.
Keep in mind that Archive storage class implies that reads, early writes and deletions would be charged accordingly as shown here, so if you pretend to access that data often or overwrite the files therein it might be better to stay with the Stadard storage class.
Lastly, there is also a pricing calculator to make this kind of estimates that could be found here.

Is db.stats() a blocking call for MongoDB?

While researching how to check the size of a MongoDB, I found this comment:
Be warned that dbstats blocks your database while it runs, so it's not suitable in production. https://jira.mongodb.org/browse/SERVER-5714
Looking at the linked bug report (which is still open), it quotes the Mongo docs as saying:
Command takes some time to run, typically a few seconds unless the .ns file is very large (via use of --nssize). While running other operations may be blocked.
However, when I check the current Mongo docs, I don't find that text. Instead, they say:
The time required to run the command depends on the total size of the database. Because the command must touch all data files, the command may take several seconds to run.
For MongoDB instances using the WiredTiger storage engine, after an unclean shutdown, statistics on size and count may off by up to 1000 documents as reported by collStats, dbStats, count. To restore the correct statistics for the collection, run validate on the collection.
Does this mean the WiredTiger storage engine changed this to a non-blocking call by keeping ongoing stats?
a bit late to the game but I found this question while looking for the answer, and the answer is: Yes until 3.6.12 / 4.0.5 it was acquiring a "shared" lock ("R") which block all write requests during the execution. After that it's now an "intent shared" lock ("r") which doesn't block write requests. Read requests were not impacted.
Source: https://jira.mongodb.org/browse/SERVER-36437

File system snapshots moongodb for backup [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
I have some questions about Backup with Filesystem Snapshots.
I just wonder that only taking the snapshot is equal to backup? It's because snapshot is the pointer to the disk block in the mongo manner ( copy on write )
If the original disc is broken, it can't be restored.
So the question is that if we want to need to backup, is only the filesystem snapshot enough or we need to implement additionally ?
A filesystem snapshot itself is surely not enough to backup your data. When using a filesystem snapshot, this does not help you with corrupted disks, etc. So you need to move the data in the snapshot to somewhere else.
When you create a snapshot, the changes applied to the original data files will fill up the remaining space, in case of LVM snapshot even just the space explicitly allocated. So you need to over provision your disks by at least the amount the data you write within the time you need to process the backup and store it on a remote location.
Note In the following description, most security considerations are set aside for the sake of brevity. You must apply the security measures appropriate for your data, for example using stunnel for the encryption of the data transfer when using the first approach.
One approach
Here is what I tend to do to keep the need for over provisioning your disk size, CPU and RAM as low as possible while still being able to create snapshots and use them properly. Note that this is sort of a poor mans solution (when being really tight on budget) and you really should carefully think about wether this solution suits you.
I usually have a cheapo VM as a backup system, with some big, cheap, slow(ish) storage attached. On that VM, I open a listening netcat, piping it's output through some sort of compressor. I use my own snap for that, which is a crude implementation utilizing the snappy compression algorithm. The reason is that snappy is optimized for speed rather than size, and that is what I am interested in here to reduce the time needed for getting the snapshot from the MongoDB server.
nc -l <internalIp> <somePort> | snap > /mnt/datastore/backup-$(date +%Y-%m-%dT%T%Z).tar.sz
Make sure you only listen to an internal IP or one you have limited access to via a firewall.
Next, I mount my snapshot (how to do that varies) on the MongoDB server I took the backup at. Then I jump to the directory I mounted said snapshot into (the one containing the data folder) and run tar over it, piping the output to another netcat, which sends the tarred data to the netcat listening on the backup server
tar -cv <directory> | nc <internalIpOfBackupServer> <listeningPort>
Another approach
This is basically the same, though it utilizes ssh instead of nc:
tar -cv <directory> | ssh user#backupserver "snap > /mnt/datastore/backup-$(date +%Y-%m-%dT%T%Z).tar.sz"
We are trading speed for security here. This approach will take considerably longer and needs more resources on the machine you take the backup from.
Some thoughts
You really want the snapshot be moved and archived on another server as fast as possible so that you can destroy the snapshot and return to normal operations. This way, you can reduce the necessary over provisioning of your resources to a bare minimum. Since RAM is a very precious resource on any MongoDB deployment, you want the compression to be done on a different machine.
If space is an issue on your backup machine, you can either use gzip or bzip2 instead of snap right away (reducing the speed with which the backup is done), or you can do the following on the backup machine after the backup is finished:
snap -c -u yourBackup.tar.sz | bzip2 -v9 > yourBackup.tar.bz2
rm yourBackup.tar.sz

Fastest way to get Google Storage bucket size?

I'm currently doing this, but it's VERY slow since I have several terabytes of data in the bucket:
gsutil du -sh gs://my-bucket-1/
And the same for a sub-folder:
gsutil du -sh gs://my-bucket-1/folder
Is it possible to somehow obtain the total size of a complete bucket (or a sub-folder) elsewhere or in some other fashion which is much faster?
The visibility for google storage here is pretty shitty
The fastest way is actually to pull the stackdriver metrics and look at the total size in bytes:
Unfortunately there is practically no filtering you can do in stackdriver. You can't wildcard the bucket name and the almost useless bucket resource labels are NOT aggregate-able in stack driver metrics
Also this is bucket level only- not prefixes
The SD metrics are updated daily so unless you can wait a day you cant use this to get the current size right now
UPDATE: Stack Driver metrics now support user metadata labels so you can label your GCS buckets and aggregate those metrics by custom labels you apply.
Edit
I want to add a word of warning if you are creating monitors off of this metric. There is a really crappy bug with this metric right now.
GCP occasionally has platform issues that cause this metric to stop getting written. And I think it's tenant specific (maybe?) so you also won't see it on their public health status pages. And it seems poorly documented for their internal support staff as well because every time we open a ticket to complain they seem to think we are lying and it takes some back and forth before they even acknowledge its broken.
I think this happens if you have many buckets and something crashes on their end and stops writing metrics to your projects. While it does not happen all the time we see it several times a year.
For example it just happened to us again. This is what I'm seeing in stack driver right now across all our projects:
Response from GCP support
Just adding the last response we got from GCP support during this most recent metric outage. I'll add all our buckets were accessible it was just this metric was not being written:
The product team concluded their investigation stating that this was indeed a widespread issue, not tied to your projects only. This internal issue caused unavailability for some GCS buckets, which was affecting the metering systems directly, thus the reason why the "GCS Bucket Total Bytes" metric was not available.
Unfortunately, no. If you need to know what size the bucket is right now, there's no faster way than what you're doing.
If you need to check on this regularly, you can enable bucket logging. Google Cloud Storage will generate a daily storage log that you can use to check the size of the bucket. If that would be useful, you can read more about it here: https://cloud.google.com/storage/docs/accesslogs#delivery
If the daily storage log you get from enabling bucket logging (per Brandon's suggestion) won't work for you, one thing you could do to speed things up is to shard the du request. For example, you could do something like:
gsutil du -s gs://my-bucket-1/a* > a.size &
gsutil du -s gs://my-bucket-1/b* > b.size &
...
gsutil du -s gs://my-bucket-1/z* > z.size &
wait
awk '{sum+=$1} END {print sum}' *.size
(assuming your subfolders are named starting with letters of the English alphabet; if not; you'd need to adjust how you ran the above commands).
Use the built in dashboard
Operations -> Monitoring -> Dashboards -> Cloud Storage
The graph at the bottom shows the bucket size for all buckets, or you can select an individual bucket to drill down.
Note that the metric is only updated once per day.
With python you can get the size of your bucket as follows:
from google.cloud import storage
storage_client = storage.Client()
blobs = storage_client.list_blobs(bucket_or_name='name_of_your_bucket')
blobs_total_size = 0
for blob in blobs:
blobs_total_size += blob.size # size in bytes
blobs_total_size / (1024 ** 3) # size in GB
Google Console
Platform -> Monitoring -> Dashboard -> Select the bucket
Scroll down can see the object size for that bucket
I found that that using the CLI it was frequently timing out. But that my be as I was reviewing a coldline storage.
For a GUI solution. Look at Cloudberry Explorer
GUI view of storage
For me following command helped:
gsutil ls -l gs://{bucket_name}
It then gives output like this after listing all files:
TOTAL: 6442 objects, 143992287936 bytes (134.1 GiB)

CLUSTER USING for Postgres in Django (table defragmenting / packing)

Let's say that I'm building a stack exchange clone, and every time I examine a question, I also load each and every answer. The table might look like:
id integer
question_id FOREIGN KEY
answer bool
date timestamp
How can I tell django to tell postgres to keep all the answers together for fast access? Postgres has the underlying feature CLUSTER USING.
(CLUSTER USING is 'defragmenting' feature for tables. This works especially well for small records, since they may all end up in the same disk block and greatly reduce load time. The defragmenting is typically done as a batch job at times of low load).
As far as I can tell, you can't. But you can treat this as a database administration task, and do it from the psql command line:
# CLUSTER table USING index_name;
# ANALYZE VERBOSE table;
# CLUSTER VERBOSE;
This will be remembered. Each time you run CLUSTER VERBOSE it will lock all the tables and sort the data. All your answers (in the example above) will be gathered together on disk. This makes sense even for solid state storage, since the eventual database read will cover fewer sectors, meaning fewer I/O operations to retrieve the group.
Obviously you must pick your index well: the wrong choice can scatter the data you actually access. The performance benefit is the best for sparse datasets, and becomes less relevant if most everything is frequently accessed.
A better name for the CLUSTER feature might be "DEFRAG", as this is an operation analogous defragmenting a filesystem.