Elastic Cloud APM Server - Queue is full - apm

I have many Java microservices running in a Kubernetes Cluster. All of them are APM agents sending data to an APM server in our Elastic Cloud Cluster.
Everything was working fine but suddenly every microservice received the error below showed in the logs.
I tried to restart the cluster, increase the hardware power and I tried to follow the hints but no success.
Obs: The disk is almost empty and the memory usage is ok.
Everything is in 7.5.2 version

I deleted all the indexes related to APM and everything worked after some minutes.

for better performance u can fine tune these fields in apm-server.yml file
internal queue size increase queue.mem.events=output.elasticsearch.worker * output.elasticsearch.bulk_max_size
default is 4096
output.elasticsearch.worker (increase) default is 1
output.elasticsearch.bulk_max_size (increase) default is 50 very less
Example : for my use case i have used following stats for 2 apm-server nodes and 3 es nodes (1 master 2 data nodes )
queue.mem.events=40000
output.elasticsearch.worker=4
output.elasticsearch.bulk_max_size=10000

Related

First 10 long running transactions

I have a fairly small cluster of 6 nodes, 3 client, and 3 server nodes. Important configurations,
storeKeepBinary = true,
cacheMode = Partitioned (some caches's about 5-8, out of 25 have this as TRANSACTIONAL)
AtomicityMode = Atomic
backups = 1
readFromBackups = false
no persistence
When I run the app for some load/performance test on-prem on 2 large boxes, 3 clients on one box, and 3 servers on another box, all within docker containers, I get a decent performance.
However, when I move them over to AWS and run them in EKS, the only change I make is to change the cluster discovery from standard TCP (default) to Kubernetes-based discovery and run the same test.
But now the performance is very bad, I keep getting,
WARN [sys-#145%test%] - [org.apache.ignite] First 10 long-running transactions [total=3]
Here the transactions are running more than a min long.
In other cases, I am getting,
WARN [sys-#196%test-2%] - [org.apache.ignite] First 10 long-running cache futures [total=1]
Here the associated future has been running for > 3 min.
Most of the places 'google search' has taken me, talks flaky/inconsistent n/w as the cause.
The app and the test seem to be ok since on a local on-prem this works just fine and the performance is decent as well.
Wanted to check if others have faced this or when running on Kubernetes in the public cloud something else needs to be done. Like somewhere I read nodes need to be pinned to the host in a cloud/virtual environment, but it's not mandatory.
TIA

Using MySQL Workbench causes errors on Galera Cluster

I have a 5 cluster MariaDB/Galera cluster running in production environment.
I also have a monitor which checks every 20 seconds for cluster size changes. One of our other engineers has been running queries using MySQL Workbench, and when that application is running, I start seeing alerts coming from my monitor where cluster size is 1. It does recover in a few seconds back to the correct size of 5, however it's disconcerting that this client app is causing issues on the cluster. I've requested everyone on our team to not use this app... however I wonder if anyone else has seen this, or knows what it is doing to the cluster.

AWS EB should create new instance once my docker reached its maximum memory limit

I have deployed my dockerized micro services in AWS server using Elastic Beanstalk which is written using Akka-HTTP(https://github.com/theiterators/akka-http-microservice) and Scala.
I have allocated 512mb memory size for each docker and performance problems. I have noticed that the CPU usage increased when server getting more number of requests(like 20%, 23%, 45%...) & depends on load, then it automatically came down to the normal state (0.88%). But Memory usage keeps on increasing for every request and it failed to release unused memory even after CPU usage came to the normal stage and it reached 100% and docker killed by itself and restarted again.
I have also enabled auto scaling feature in EB to handle a huge number of requests. So it created another duplicate instance only after CPU usage of the running instance is reached its maximum.
How can I setup auto-scaling to create another instance once memory usage is reached its maximum limit(i.e 500mb out of 512mb)?
Please provide us a solution/way to resolve these problems as soon as possible as it is a very critical problem for us?
CloudWatch doesn't natively report memory statistics. But there are some scripts that Amazon provides (usually just referred to as the "CloudWatch Monitoring Scripts for Linux) that will get the statistics into CloudWatch so you can use those metrics to build a scaling policy.
The Elastic Beanstalk documentation provides some information on installing the scripts on the Linux platform at http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers-cw.html.
However, this will come with another caveat in that you cannot use the native Docker deployment JSON as it won't pick up the .ebextensions folder (see Where to put ebextensions config in AWS Elastic Beanstalk Docker deploy with dockerrun source bundle?). The solution here would be to create a zip of your application that includes the JSON file and .ebextensions folder and use that as the deployment artifact.
There is also one thing I am unclear on and that is if these metrics will be available to choose from under the Configuration -> Scaling section of the application. You may need to create another .ebextensions config file to set the custom metric such as:
option_settings:
aws:elasticbeanstalk:customoption:
BreachDuration: 3
LowerBreachScaleIncrement: -1
MeasureName: MemoryUtilization
Period: 60
Statistic: Average
Threshold: 90
UpperBreachScaleIncrement: 2
Now, even if this works, if the application will not lower its memory usage after scaling and load goes down then the scaling policy would just continue to trigger and reach max instances eventually.
I'd first see if you can get some garbage collection statistics for the JVM and maybe tune the JVM to do garbage collection more often to help bring memory down faster after application load goes down.

High CPU Utilisation on AWS RDS - Postgres

Attempted to migrate my production environment from Native Postgres environment (hosted on AWS EC2) to RDS Postgres (9.4.4) but it failed miserably. The CPU utilisation of RDS Postgres instances shooted up drastically when compared to that of Native Postgres instances.
My environment details goes here
Master: db.m3.2xlarge instance
Slave1: db.m3.2xlarge instance
Slave2: db.m3.2xlarge instance
Slave3: db.m3.xlarge instance
Slave4: db.m3.xlarge instance
[Note: All the slaves were at Level 1 replication]
I had configured Master to receive only write request and this instance was all fine. The write count was 50 to 80 per second and they CPU utilisation was around 20 to 30%
But apart from this instance, all my slaves performed very bad. The Slaves were configured only to receive Read requests and I assume all writes that were happening was due to replication.
Provisioned IOPS on these boxes were 1000
And on an average there were 5 to 7 Read request hitting each slave and the CPU utilisation was 60%.
Where as in Native Postgres, we stay well with in 30% for this traffic.
Couldn't figure whats going wrong on RDS setup and AWS support is not able to provide good leads.
Did anyone face similar things with RDS Postgres?
There are lots of factors, that maximize the CPU utilization on PostgreSQL like:
Free disk space
CPU Usage
I/O usage etc.
I came across with the same issue few days ago. For me the reason was that some transactions was getting stuck and running since long time. Hence forth CPU utilization got inceased. I came to know about this, by running some postgreSql monitoring command:
SELECT max(now() - xact_start) FROM pg_stat_activity
WHERE state IN ('idle in transaction', 'active');
This command shows the time from which a transaction is running. This time should not be greater than one hour. So killing the transaction which was running from long time or that was stuck at any point, worked for me. I followed this post for monitoring and solving my issue. Post includes lots of useful commands to monitor this situation.
I would suggest increasing your work_mem value, as it might be too low, and doing normal query optimization research to see if you're using queries without proper indexes.

ElasticSearch architecture and hosting

I am looking to use Elastic Search with MongoDB to support our full text search requirements. I am struggling to find information on the architecture and hosting and would like some help. I am planning to host ES on premise rather than in the cloud. We currently have MongoDB running in replica set with three nodes.
How many servers are required to run ElasticSearch for high availability?
What is the recommended server specification. Currently my thoughts are 2 x CPU, 4GB RAM, C drive: 40GB , D drive: 40GB
How does ES support failover
Thanks
Tariq
How many servers are required to run ElasticSearch for high availability?
At least 2
What is the recommended server specification. Currently my thoughts are 2 x CPU, 4GB RAM, C drive: 40GB , D drive: 40GB
It really depends on the amount of data you're indexing, but that amount of RAM and (I'm assuming a decent dual core CPU) should be enough to get you started
How does ES support failover
you set up a clustering with multiple nodes in such a way that each node has a replica of another
So in a simple example your cluster would consist of two servers, each with one node on them.
You'd set replicas to 1 so that the shards in your node, would have a backup copy stored on the other node and vice versa.
So if a node goes down, elasticsearch will detect the failure and route the requests for that node to its replica on another node, until you fix the problem.
Of course you could make this even more robust by having 4 servers with one node each and 2 replicas, as an example. What you must understand is that elasticsearch will optimize that distribution of replicas and primary shards based on the number of shards you have.
so with the 2 nodes and 1 replica example above, say you added 2 extra servers/nodes (1 node/server is recommended), Elasticsearch would move the replicas off the nodes and to their own node, so that you'd have 2 nodes with 1 primary shard(s) and nothing else then 2 other nodes with 1 copy of those shards (replicas) each.
How many servers are required to run ElasticSearch for high
availability?
I recommend 3 servers with 3 replication factor index. It will be more stable in case of one server goes down, plus it's better for highload, cause of queries can be distributed through cluster.
What is the recommended server specification. Currently my thoughts
are 2 x CPU, 4GB RAM, C drive: 40GB , D drive: 40GB
I strongly recommend more RAM. We have 72Gb on each machine in the cluster and ES works perfectly smooth (and we still never fall in garbage collector issues)
How does ES support failover
In our case at http://indexisto.com we had a lot of test and some production cluster server fails. Starting from 3 server no any issues in case server goes down. More servers in cluster - less impact of one server fail.