AWS Container Insight on ECS have a big delay (~2/3mins) - amazon-ecs

I've setup container insight on a ECS cluster running Fargate.
I'm experiencing quite big delay to get metrics into AWS Container.
When looking at the metric log /aws/ecs/containerinsights/{cluster_name}/performance, in log insight:
I can see delay from 130s to 170s between the #ingestionTime and the #timestamp
I also see a delay of like 60s between the advertised #ingestionTime, and the time the corresponding log appeared in the Logs insight query.
This appearingly also impact auto-scaling, making it very slow to react.
The metrics are 60s appart, made at the start of every minutes.
Anyone experienced this, or know how to tune it?

Related

GridGain server deployment/Statefulset Termination grace period

I deployed gridgain cluster in google kubernetes cluster following[1]. I enabled native persistency using statefulset. In statefulset.yaml in [2] terminationGracePeriodSeconds set to 60000. What is the purpose of this large timeout?
When deleting pod using kubectl delete pod command it take very large time. What is the suitable value for terminationGracePeriodSeconds without loss any data.
[1]. https://www.gridgain.com/docs/latest/installation-guide/kubernetes/gke-deployment
[2]. https://www.gridgain.com/docs/latest/installation-guide/kubernetes/gke-deployment#creating-pod-configuration
I believe the reason behind setting it to 60000 was - do not rely on it. Prior to Ignite 2.9 there was an issue with the startup script that didn't bypass SYS SIGNAL to the underlying Java app, making it impossible to perform a graceful shutdown.
If a node is being restarted gracefully and IGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN is enabled, Ignite will ensure that the node leave won't lead to a data loss. Sometimes a rebalance might take a while.
Keeping the above in mind: the hang issue might happen for Apache Ignite 2.8 and below, keeping the recommended terminationGracePeriodSeconds should be fine and never be used in practice (in a normal flow).

Cassandra pod is taking more bootstrap time than expected

I am running Cassandra as a Kubernetes pod . One pod is having one Cassandra container.we are running Cassandra of version 3.11.4 and auto_bootstrap set to true.I am having 5 node in production and it holds 20GB data.
Because of some maintenance activity and if I restart any Cassandra pod it is taking 30 min for bootstrap then it is coming UP and Normal state.In production 30 min is a huge time.
How can I reduce the bootup time for cassandra pod ?
Thank you !!
If you're restarting the existing node, and data is still there, then it's not a bootstrap of the node - it's just restart.
One of the potential problems that you have is that you're not draining the node before restart, and all commit logs need to be replayed on the start, and this can take a lot of time if you have a lot of data in commit log (you can just check system.log on what Cassandra is doing at that time). So the solution could be is to execute nodetool drain before stopping the node.
If the node is restarted before crash or something like, you can thing in the direction of the regular flush of the data from memtable, for example via nodetool flush, or configuring tables with periodic flush via memtable_flush_period_in_ms option on the most busy tables. But be careful with that approach as it may create a lot of small SSTables, and this will add more load on compaction process.

ELK (Elastic Stack) Visualization of running processes of a server

I am monitoring an application with MetricBeat only. Everything works fine. But I have no idea, how I can show running processes at a specific time (or period of time) with Kibana Visualization. I need to now, which processes have been started upon a request, how long they where running and how much resources they consumed. I was thinking about line. But I have no idea how to configure. Any ideas?

Using MySQL Workbench causes errors on Galera Cluster

I have a 5 cluster MariaDB/Galera cluster running in production environment.
I also have a monitor which checks every 20 seconds for cluster size changes. One of our other engineers has been running queries using MySQL Workbench, and when that application is running, I start seeing alerts coming from my monitor where cluster size is 1. It does recover in a few seconds back to the correct size of 5, however it's disconcerting that this client app is causing issues on the cluster. I've requested everyone on our team to not use this app... however I wonder if anyone else has seen this, or knows what it is doing to the cluster.

High CPU Utilisation on AWS RDS - Postgres

Attempted to migrate my production environment from Native Postgres environment (hosted on AWS EC2) to RDS Postgres (9.4.4) but it failed miserably. The CPU utilisation of RDS Postgres instances shooted up drastically when compared to that of Native Postgres instances.
My environment details goes here
Master: db.m3.2xlarge instance
Slave1: db.m3.2xlarge instance
Slave2: db.m3.2xlarge instance
Slave3: db.m3.xlarge instance
Slave4: db.m3.xlarge instance
[Note: All the slaves were at Level 1 replication]
I had configured Master to receive only write request and this instance was all fine. The write count was 50 to 80 per second and they CPU utilisation was around 20 to 30%
But apart from this instance, all my slaves performed very bad. The Slaves were configured only to receive Read requests and I assume all writes that were happening was due to replication.
Provisioned IOPS on these boxes were 1000
And on an average there were 5 to 7 Read request hitting each slave and the CPU utilisation was 60%.
Where as in Native Postgres, we stay well with in 30% for this traffic.
Couldn't figure whats going wrong on RDS setup and AWS support is not able to provide good leads.
Did anyone face similar things with RDS Postgres?
There are lots of factors, that maximize the CPU utilization on PostgreSQL like:
Free disk space
CPU Usage
I/O usage etc.
I came across with the same issue few days ago. For me the reason was that some transactions was getting stuck and running since long time. Hence forth CPU utilization got inceased. I came to know about this, by running some postgreSql monitoring command:
SELECT max(now() - xact_start) FROM pg_stat_activity
WHERE state IN ('idle in transaction', 'active');
This command shows the time from which a transaction is running. This time should not be greater than one hour. So killing the transaction which was running from long time or that was stuck at any point, worked for me. I followed this post for monitoring and solving my issue. Post includes lots of useful commands to monitor this situation.
I would suggest increasing your work_mem value, as it might be too low, and doing normal query optimization research to see if you're using queries without proper indexes.