Google Cloud warnings that conflict each other - command-line

TLDR in brief
gcloud compute instance to have hard disk size under 200Gb get warning but when set to 200Gb will get another warning.
in details
When we create instance template, as command 01 below, we have this warning if setting hard disk size 10Gb
WARNING: You have selected a disk size of under [200GB]. This may result in poor I/O performance. For more information, see: https://developers.google.com/compute/docs/disks#performance
Though when we do set it to 200Gb, we get this warning when creating instance from that template
WARNING: Some requests generated warnings:
Disk size: '200 GB' is larger than image size: '10 GB'. You might need to resize the root repartition manually if the operating system does not support automatic resizing. See https://cloud.google.com/compute/docs/disks/persistent-disks#repartitionrootpd for details.
So why you gcloud warn us from the start about the 200Gb?
And what should we do to get things set up correctly with no warning?
commands used
*command 01 - create instance template
gcloud compute instance-templates create TEMPLATE-NAME
--image='projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20190212' \
--custom-cpu=1 --custom-memory=2 \
--boot-disk-size=200 --boot-disk-type=pd-standard
command 02 - create compute instance from instance template
gcloud compute instances create INSTANCE_NAME \
--source-instance-template TEMPLATE_NAME \
--project=$GC_PROJECT --zone=$zone

Related

Perf collection on kubernetes pods

I am trying to find performance bottlenecks by using the perf tool on a kubernetes pod. I have already set the following on the instance hosting the pod:
"kernel.kptr_restrict" = "0"
"kernel.perf_event_paranoid" = "0"
However, I have to problems.
When I collect samples through perf record -a -F 99 -g -p <PID> --call-graph dwarf and feed it to speedscore or similarly to a flamegraph, I still see question marks ??? and the process that I would like to see its CPU usage breakdown (C++ based), the aforementioned ??? is on the top of the stack and system calls fall below it. The main process is the one that has ??? around on it.
I tried running perf top and it says
Failed to mmap with 1 (Operation not permitted)
My questions are:
For collecting perf top, what permissions do I need to change on the host instance of the pod?
Which other settings do I need to change at the instance level so I don't see any more ??? showing up on perf's output. I would like to see the function call stack of the process, not just the system calls. See the following stack:
The host OS is ubuntu.
Zooming in on the first system call, you would see this, but this only gives me a fraction of the CPU time spent and only the system calls.
UPDATE/ANSWER:
I was able to run perf top, by setting
"kernel.perf_event_paranoid" = "-1". However, as seen in the image below, the process I'm trying to profile (I've blackened out the name to hide the name), is not showing me function names but just addresses. I try running them through addr2line, but it says addr2line: 'a.out': No such file.
How can I get the addresses resolve to function names on the pod? Is it even possible?
I was also able to fix the memory-function mapping with perf top. This was due to the fact that I was trying to run perf from a different container than where the process was running (same pod, different container). There may be a way to add extra information, but just moving the perf to the container running the process fixed it.

Ulimits on AWS ECS Fargate

The default ULIMIT "NOFILE" is set to 1024 for containers launched using Fargate. So if I have a cluster of let's say 10 services with two or three tasks each (all running on Fargate), what are the implications if I set them all to use a huge NOFILE number such as 900000?
More specifically, do we need to care about the host machine? It's my assumption that if I were using the EC2 launch type and set all my tasks to effectively use as many files as they wanted, the hosting EC2 instance(s) could easily get overwhelmed. Or maybe the hosts wouldn't get overwhelmed but the containers registered on the hosts would get a first come first served number of files they can open potentially leading to one service starving another? But as we don't manage the instances on EC2, what's the harm in setting the ULIMIT as high as possible for all services? Do our containers sit side-by-side on a host and would therefore share the hosts resource limits. Or do we get a host per service / per task?
Of course it's possible my assumptions are wrong about how this all works.
The maximum nofile limit on fargate is 4096
Amazon ECS tasks hosted on Fargate use the default resource limit values set by the operating system with the exception of the nofile resource limit parameter which Fargate overrides. The nofile resource limit sets a restriction on the number of open files that a container can use. The default nofile soft limit is 1024 and hard limit is 4096.
https://docs.aws.amazon.com/AmazonECS/latest/userguide/task_definition_parameters.html
A slight correction on this answer. Like the linked documentation states, these are the DEFAULT soft and hard limits for ulimit nofile. You can override this by updating your ECS Task Definition. The Ulimit settings go under the ContainerDefinitions section of the Definition.
I've successfully set the soft and hard limits for nofile on some of my AWS Fargate Tasks using this method.
So while you cannot use the Linux "ulimit -n" command to change this on the fly, you can alter it via the ECS Task Definition.
EDIT:
I've done some testing and for my setup, running AWS ECS Fargate on a Python Bullseye distro, I was able to max out at NOFILE = 1024 x 1024 = 1048576 files.
{
"ulimits": [
{
"name": "nofile",
"softLimit": 1048576,
"hardLimit": 1048576
}
],
}
Any integer multiple added to this (1024 x 1024 x INT) caused ECS to report an error when trying to start up the ECS Fargate Task:
CannotStartContainerError: ResourceInitializationError: failed to create new container runtime task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container
Hope this helps someone.
Please refer to:
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-ecs-taskdefinition-containerdefinitions-ulimit.html

GCloud SQL - unexpected disk consumption

Ive got mysql 5.7 cloud instance in europe-west1 zone with 1vCPU and 4Gb RAM without redundancy enabled.
my db always wasnt above 30Gb, but suddenly became about 11TB.
I logged to instance and notify that DB size only 15,8Gb.
can somebody explain what happened and how i can reduce space using, which costs me 70$/day?
If you don't see an abnormal increment of binary logs while running the command SHOW BINARY LOGS; try to check the size of temporary tablespace by running the following command:
SELECT FILE_NAME, TABLESPACE_NAME, ENGINE, INITIAL_SIZE, TOTAL_EXTENTS*EXTENT_SIZE
AS TotalSizeBytes, DATA_FREE, MAXIMUM_SIZE FROM INFORMATION_SCHEMA.FILES
WHERE TABLESPACE_NAME = 'innodb_temporary'\G
You can also try to check the size of the general logs if you enabled them by running:
SELECT ROUND(SUM(LENGTH(argument)/POW(1024,2),2) AS GB from mysql.general_log;
If you want to solve this issue quickly to avoid more charges you can try to restart your instance (if temporal logs are filling your disk this will delete them) and export your database to a new instance with a smaller disk and then delete your old instance.
You can also try to contact Google Cloud support if you need further help with your Cloud SQL instance

AWS EB should create new instance once my docker reached its maximum memory limit

I have deployed my dockerized micro services in AWS server using Elastic Beanstalk which is written using Akka-HTTP(https://github.com/theiterators/akka-http-microservice) and Scala.
I have allocated 512mb memory size for each docker and performance problems. I have noticed that the CPU usage increased when server getting more number of requests(like 20%, 23%, 45%...) & depends on load, then it automatically came down to the normal state (0.88%). But Memory usage keeps on increasing for every request and it failed to release unused memory even after CPU usage came to the normal stage and it reached 100% and docker killed by itself and restarted again.
I have also enabled auto scaling feature in EB to handle a huge number of requests. So it created another duplicate instance only after CPU usage of the running instance is reached its maximum.
How can I setup auto-scaling to create another instance once memory usage is reached its maximum limit(i.e 500mb out of 512mb)?
Please provide us a solution/way to resolve these problems as soon as possible as it is a very critical problem for us?
CloudWatch doesn't natively report memory statistics. But there are some scripts that Amazon provides (usually just referred to as the "CloudWatch Monitoring Scripts for Linux) that will get the statistics into CloudWatch so you can use those metrics to build a scaling policy.
The Elastic Beanstalk documentation provides some information on installing the scripts on the Linux platform at http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers-cw.html.
However, this will come with another caveat in that you cannot use the native Docker deployment JSON as it won't pick up the .ebextensions folder (see Where to put ebextensions config in AWS Elastic Beanstalk Docker deploy with dockerrun source bundle?). The solution here would be to create a zip of your application that includes the JSON file and .ebextensions folder and use that as the deployment artifact.
There is also one thing I am unclear on and that is if these metrics will be available to choose from under the Configuration -> Scaling section of the application. You may need to create another .ebextensions config file to set the custom metric such as:
option_settings:
aws:elasticbeanstalk:customoption:
BreachDuration: 3
LowerBreachScaleIncrement: -1
MeasureName: MemoryUtilization
Period: 60
Statistic: Average
Threshold: 90
UpperBreachScaleIncrement: 2
Now, even if this works, if the application will not lower its memory usage after scaling and load goes down then the scaling policy would just continue to trigger and reach max instances eventually.
I'd first see if you can get some garbage collection statistics for the JVM and maybe tune the JVM to do garbage collection more often to help bring memory down faster after application load goes down.

Deleting files in Ceph does not free up space

I am using Ceph, uploading many files through radosgw. After, I want to delete the files. I am trying to do that in Python, like this:
bucket = conn.get_bucket(BUCKET)
for key in bucket.list():
bucket.delete_key(key)
Afterwards, I use bucket.list() to list files in the bucket, and this says that the bucket is now empty, as I intended.
However, when I run ceph df on the mon, it shows that the OSDs still have high utilization (e.g. %RAW USED 90.91). If I continue writing (thinking that the status data just hasn't caught up with the state yet), Ceph essentially locks up (100% utilization).
What's going on?
Note: I do have these standing out in ceph status:
health HEALTH_WARN
3 near full osd(s)
too many PGs per OSD (2168 > max 300)
pool default.rgw.buckets.data has many more objects per pg than average (too few pgs?)
From what I gather online, this wouldn't cause my particular issue. But I'm new to Ceph and could be wrong.
I have one mon and 3 OSDs. This is just for testing.
You can check if the object is really deleted by rados -p $pool list,
I knew for cephfs, when you delete a file, it will return ok when mds mark
it as deleted in local memory and then do real delete by sending delete messages to related osd.
Maybe radosgw use the same design to speed up delete