Keepalived vrrp_script does not failover - centos

I have 2 nodes with keepalived and haproxy services (CentOS7).
If I'm shutdown one node all working fine. But I want to failover the VIPS if haproxy is down.
This is 1st node config:
vrrp_script ha_check {
script "/etc/keepalived/haproxy_check"
interval 2
weight 21
}
vrrp_instance VI_1 {
state MASTER
interface eno16777984
virtual_router_id 151
priority 101
advert_int 1
authentication {
auth_type PASS
auth_pass 11111
}
virtual_ipaddress {
10.0.100.233
}
smtp_alert
track_script {
ha_check
}
}
2nd node:
vrrp_script ha_check {
script "/etc/keepalived/haproxy_check"
interval 2
fall 2
rise 2
timeout 1
weight 2
}
vrrp_instance VI_1 {
state BACKUP
interface eno16777984
virtual_router_id 151
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass 11111
}
virtual_ipaddress {
10.0.100.233
}
smtp_alert
track_script {
ha_check
}
}
cat /etc/keepalived/haproxy_check
systemctl status haproxy | grep "inactive"
When I stop haproxy it still does not failover the VIPs to the next
host.
[root#cks-hatest1 keepalived]# tail /var/log/messages
Nov 30 10:35:24 cks-hatest1 Keepalived_vrrp[5891]: VRRP_Script(ha_check) failed
Nov 30 10:35:33 cks-hatest1 systemd: Started HAProxy Load Balancer.
Nov 30 10:35:45 cks-hatest1 systemd: Stopping HAProxy Load Balancer...
Nov 30 10:35:45 cks-hatest1 systemd: Stopped HAProxy Load Balancer.
Nov 30 10:35:46 cks-hatest1 Keepalived_vrrp[5891]: VRRP_Script(ha_check) succeeded
What I am doing wrong? Thank you in advance!

In your script you are checking if
systemctl status haproxy
contains keyword "inactive". Is that the value you get when you stop haproxy service manually?
As soon as haproxy service is stopped your logs contains it is started again. Can you verify that?
Also, try with replacing the script as
script "killall -0 haproxy"

It's easy. Try this for example:
vrrp_script check_haproxy {
script "pidof haproxy"
interval 2
weight 2
}
In the end of config you should add following part too:
track_script {
check_haproxy
}

Related

How to make keepalived respond on failover on HAProxy?

I am using HAProxy as load balancer for my application and to make it highly available I am using keepalive service and floating ip address concept. But whenever my primary load balancer server gets down, by removing it from network or turning it off, my all services go down instead of making secondary load balancer server available.
My keepalived.conf for master server is,
global_defs
{
# Keepalived process identifier
lvs_id haproxy_DH
}
# Script used to check if HAProxy is running
vrrp_script check_haproxy
{
script "pidof haproxy"
interval 2
weight 2
}
# Virtual interface
vrrp_instance VI_01
{
state MASTER
interface eno16777984 #here eth0 is the name of network interface
virtual_router_id 51
priority 101
# The virtual ip address shared between the two loadbalancers
virtual_ipaddress {
172.16.231.162
}
track_script {
check_haproxy
}
}
For backup server it is like,
global_defs
{
# Keepalived process identifier
lvs_id haproxy_DH_passive
}
# Script used to check if HAProxy is running
vrrp_script check_haproxy
{
script "pidof haproxy"
interval 2
weight 2
}
# Virtual interface
vrrp_instance VI_01
{
state BACKUP
interface eno16777984 #here eth0 is the name of network interface
virtual_router_id 51
priority 100
# The virtual ip address shared between the two loadbalancers
virtual_ipaddress {
172.16.231.162
}
track_script {
check_haproxy
}
}
The virtual IP address is assigned and working when both load balancers are up. But whenever machine goes down, my service also goes down. I am using CentOS7, Please help.
Use this,
global_defs {
router_id ovp_vrrp
}
vrrp_script haproxy_check {
script "killall -0 haproxy"
interval 2
weight 2
}
vrrp_instance OCP_EXT {
interface ens192
virtual_router_id 51
priority 100
state MASTER
virtual_ipaddress {
10.19.114.231 dev ens192
}
track_script {
haproxy_check
}
authentication {
auth_type PASS
auth_pass 1cee4b6e-2cdc-48bf-83b2-01a96d1593e4
}
}
more info: read here, https://www.openshift.com/blog/haproxy-highly-available-keepalived

EKS kube-system deployments CrashLoopBackOff

I am trying to deploy Kube State Metrics into the kube-system namespace in my EKS Cluster (eks.4) running Kubernetes v1.14.
Kubernetes Connection
provider "kubernetes" {
host = var.cluster.endpoint
token = data.aws_eks_cluster_auth.cluster_auth.token
cluster_ca_certificate = base64decode(var.cluster.certificate)
load_config_file = true
}
Deployment Manifest (as .tf)
resource "kubernetes_deployment" "kube_state_metrics" {
metadata {
name = "kube-state-metrics"
namespace = "kube-system"
labels = {
k8s-app = "kube-state-metrics"
}
}
spec {
replicas = 1
selector {
match_labels = {
k8s-app = "kube-state-metrics"
}
}
template {
metadata {
labels = {
k8s-app = "kube-state-metrics"
}
}
spec {
container {
name = "kube-state-metrics"
image = "quay.io/coreos/kube-state-metrics:v1.7.2"
port {
name = "http-metrics"
container_port = 8080
}
port {
name = "telemetry"
container_port = 8081
}
liveness_probe {
http_get {
path = "/healthz"
port = "8080"
}
initial_delay_seconds = 5
timeout_seconds = 5
}
readiness_probe {
http_get {
path = "/"
port = "8080"
}
initial_delay_seconds = 5
timeout_seconds = 5
}
}
service_account_name = "kube-state-metrics"
}
}
}
}
I have deployed all the required RBAC manifests from https://github.com/kubernetes/kube-state-metrics/tree/master/kubernetes as well - redacted here for brevity.
When I run terraform apply on the deployment above, the Terraform output is as follows :
kubernetes_deployment.kube_state_metrics: Still creating... [6m50s elapsed]
Eventually timing out at 10m.
Here are the outputs of the logs for the kube-state-metrics pod
I0910 23:41:19.412496 1 main.go:140] metric white-blacklisting: blacklisting the following items:
W0910 23:41:19.412535 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
W0910 23:41:19.412565 1 client_config.go:546] error creating inClusterConfig, falling back to default config: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
F0910 23:41:19.412782 1 main.go:148] Failed to create client: invalid configuration: no configuration has been provided
Adding the following to the spec has taken me to a successful deployment.
automount_service_account_token = true
For posterity :
resource "kubernetes_deployment" "kube_state_metrics" {
metadata {
name = "kube-state-metrics"
namespace = "kube-system"
labels = {
k8s-app = "kube-state-metrics"
}
}
spec {
replicas = 1
selector {
match_labels = {
k8s-app = "kube-state-metrics"
}
}
template {
metadata {
labels = {
k8s-app = "kube-state-metrics"
}
}
spec {
automount_service_account_token = true
container {
name = "kube-state-metrics"
image = "quay.io/coreos/kube-state-metrics:v1.7.2"
port {
name = "http-metrics"
container_port = 8080
}
port {
name = "telemetry"
container_port = 8081
}
liveness_probe {
http_get {
path = "/healthz"
port = "8080"
}
initial_delay_seconds = 5
timeout_seconds = 5
}
readiness_probe {
http_get {
path = "/"
port = "8080"
}
initial_delay_seconds = 5
timeout_seconds = 5
}
}
service_account_name = "kube-state-metrics"
}
}
}
}
I didn't try with terraform.
I have just run this deployment locally i got the same error.
Please run your deployment locally to see the state of your deployment and pods.
I0910 13:25:49.632847 1 main.go:140] metric white-blacklisting: blacklisting the following items:
W0910 13:25:49.632871 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
and finally:
I0910 13:25:49.634748 1 main.go:185] Testing communication with server
I0910 13:25:49.650994 1 main.go:190] Running with Kubernetes cluster version: v1.12+. git version: v1.12.8-gke.10. git tree state: clean. commit: f53039cc1e5295eed20969a4f10fb6ad99461e37. platform: linux/amd64
I0910 13:25:49.651028 1 main.go:192] Communication with server successful
I0910 13:25:49.651598 1 builder.go:126] Active collectors: certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,limitranges,namespaces,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses
I0910 13:25:49.651607 1 main.go:226] Starting metrics server: 0.0.0.0:8080
I0910 13:25:49.652149 1 main.go:201] Starting kube-state-metrics self metrics server: 0.0.0.0:8081
verification:
Connected to kube-state-metrics (xx.xx.xx.xx) port 8080 (#0)
GET /metrics HTTP/1.1
Host: kube-state-metrics:8080
User-Agent: curl/7.58.0
Accept: */*
HTTP/1.1 200 OK
Content-Type: text/plain; version=0.0.4
Date: Tue, 10 Sep 2019 13:39:52 GMT
Transfer-Encoding: chunked
[49027 bytes data]
HELP kube_certificatesigningrequest_labels Kubernetes labels converted to
Prometheus labels.
If you are building own image please follow issues on gihtub and docs
update: Just to clarify.
AS mentioned in my answer. I didn't try with terraform but it seems that the first question described only one problem W0910 13:25:49.632871 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
So I suggested to run this deployment locally and verify all errors from the logs. Later occurred that there is a problem with automount_service_account_token. This important errors wasn't applied to the the original question.
So please follow terraform issues on github how you can manage to solve this problem
As per description on github:
I spent hours trying to figure out why a service account and deployment wasn't working in Terraform, but worked with no issues in kubectl - it was the AutomountServiceAccountToken being hardcoded to False in the deployment resource.
At a minimum this should be documented in the Terraform docs for the resource with something noting the resource does not behave like kubectl does.
I hope it explains this problem.

Connect two machines in AKKA remotely ,connection refused

I'm new to akka and wanted to connect two PC using akka remotely just to run some code in both as (2 actors). I had tried the example in akka doc. But what I really do is to add the 2 IP addresses into config file I always get this error?
First machine give me this error:
[info] [ERROR] [11/20/2018 13:58:48.833]
[ClusterSystem-akka.remote.default-remote-dispatcher-6]
[akka.remote.artery.Association(akka://ClusterSystem)] Outbound
control stream to [akka://ClusterSystem#192.168.1.2:2552] failed.
Restarting it. Handshake with [akka://ClusterSystem#192.168.1.2:2552]
did not complete within 20000 ms
(akka.remote.artery.OutboundHandshake$HandshakeTimeoutException:
Handshake with [akka://ClusterSystem#192.168.1.2:2552] did not
complete within 20000 ms)
And second machine:
Exception in thread "main"
akka.remote.RemoteTransportException: Failed to bind TCP to
[192.168.1.3:2552] due to: Bind failed because of
java.net.BindException: Cannot assign requested address: bind
Config file content :
akka {
actor {
provider = cluster
}
remote {
artery {
enabled = on
transport = tcp
canonical.hostname = "192.168.1.3"
canonical.port = 0
}
}
cluster {
seed-nodes = [
"akka://ClusterSystem#192.168.1.3:2552",
"akka://ClusterSystem#192.168.1.2:2552"]
# auto downing is NOT safe for production deployments.
# you may want to use it during development, read more about it in the docs.
auto-down-unreachable-after = 120s
}
}
# Enable metrics extension in akka-cluster-metrics.
akka.extensions=["akka.cluster.metrics.ClusterMetricsExtension"]
# Sigar native library extract location during tests.
# Note: use per-jvm-instance folder when running multiple jvm on one host.
akka.cluster.metrics.native-library-extract-folder=${user.dir}/target/native
First of all, you don't need to add cluster configuration for AKKA remoting. Both the PCs or nodes should be enabled remoting with a concrete port instead of "0" that way you know which port to connect.
Have below configurations
PC1
akka {
actor {
provider = remote
}
remote {
artery {
enabled = on
transport = tcp
canonical.hostname = "192.168.1.3"
canonical.port = 19000
}
}
}
PC2
akka {
actor {
provider = remote
}
remote {
artery {
enabled = on
transport = tcp
canonical.hostname = "192.168.1.4"
canonical.port = 18000
}
}
}
Use below actor path to connect any actor in remote from PC1 to PC2
akka://<PC2-ActorSystem>#192.168.1.4:18000/user/<actor deployed in PC2>
Use below actor path to connect from PC2 to PC1
akka://<PC2-ActorSystem>#192.168.1.3:19000/user/<actor deployed in PC1>
Port numbers and IP address are samples.

Google container engine cluster showing large number of dns errors in logs

I am using google container engine and getting tons of dns errors in the logs.
Like:
10:33:11.000 I0720 17:33:11.547023 1 dns.go:439] Received DNS Request:kubernetes.default.svc.cluster.local., exact:false
And:
10:46:11.000 I0720 17:46:11.546237 1 dns.go:539] records:[0xc8203153b0], retval:[{10.71.240.1 0 10 10 false 30 0 /skydns/local/cluster/svc/default/kubernetes/3465623435313164}], path:[local cluster svc default kubernetes]
This is the payload.
{
metadata: {
severity: "ERROR"
serviceName: "container.googleapis.com"
zone: "us-central1-f"
labels: {
container.googleapis.com/cluster_name: "some-name"
compute.googleapis.com/resource_type: "instance"
compute.googleapis.com/resource_name: "fluentd-cloud-logging-gke-master-cluster-default-pool-f5547509-"
container.googleapis.com/instance_id: "instanceid"
container.googleapis.com/pod_name: "fdsa"
compute.googleapis.com/resource_id: "someid"
container.googleapis.com/stream: "stderr"
container.googleapis.com/namespace_name: "kube-system"
container.googleapis.com/container_name: "kubedns"
}
timestamp: "2016-07-20T17:33:11.000Z"
projectNumber: ""
}
textPayload: "I0720 17:33:11.547023 1 dns.go:439] Received DNS Request:kubernetes.default.svc.cluster.local., exact:false"
log: "kubedns"
}
Everything is working just the logs are polluted with errors. Any ideas on why this is happening or if I should be concerned?
Thanks for the question, Aaron. Those error messages are actually just tracing/debugging output from the container and don't indicate that anything is wrong. The fact that they get written out as error messages has been fixed in Kubernetes at head and will be better in the next release of Kubernetes.

Mesos DCOS doesn't install Kafka

I'm trying to install Kafka on Mesos. Installation seems to have succeeded.
vagrant#DevNode:/dcos$ dcos package install kafka
This will install Apache Kafka DCOS Service.
Continue installing? [yes/no] yes
Installing Marathon app for package [kafka] version [0.9.4.0]
Installing CLI subcommand for package [kafka] version [0.9.4.0]
New command available: dcos kafka
The Apache Kafka DCOS Service is installed:
docs - https://github.com/mesos/kafka
issues - https://github.com/mesos/kafka/issues
vagrant#DevNode:/dcos$ dcos package list
NAME VERSION APP COMMAND DESCRIPTION
kafka 0.9.4.0 /kafka kafka Apache Kafka running on top of Apache Mesos
But kafka task is not started.
vagrant#DevNode:/dcos$ dcos kafka
Error: Kafka is not running
vagrant#DevNode:/dcos$
Marathon UI says service is waiting. Looks like it is not accepting resource that was allocated to it. More logs here.
Mar 23 03:52:59 ip-10-0-4-194.ec2.internal java[1425]: [2016-03-23 03:52:59,335] INFO Offer ID: [54f71504-b37a-4954-b082-e1f2a04b7fa4-O77]. Considered resources with roles: [*]. Not all basic resources satisfied: cpu not in offer, disk SATISFIED (0.0 <= 0.0), mem not in offer (mesosphere.mesos.ResourceMatcher$:marathon-akka.actor.default-dispatcher-11)
Mar 23 03:52:59 ip-10-0-4-194.ec2.internal java[1425]: [2016-03-23 03:52:59,370] INFO Offer [54f71504-b37a-4954-b082-e1f2a04b7fa4-O77]. Insufficient resources for [/kafka] (need cpus=0.5, mem=307.0, disk=0.0, ports=(1 dynamic), available in offer: [id { value: "54f71504-b37a-4954-b082-e1f2a04b7fa4-O77" } framework_id { value: "54f71504-b37a-4954-b082-e1f2a04b7fa4-0000" } slave_id { value: "54f71504-b37a-4954-b082-e1f2a04b7fa4-S1" } hostname: "10.0.4.190" resources { name: "ports" type: RANGES ranges { range { begin: 1 end: 21 } range { begin: 23 end: 5050 } range { begin: 5052 end: 32000 } } role: "slave_public" } resources { name: "cpus" type: SCALAR scalar { value: 4.0 } role: "slave_public" } resources { name: "mem" type: SCALAR scalar { value: 14019.0 } role: "slave_public" } resources { name: "disk" type: SCALAR scalar { value: 32541.0 } role: "slave_public" } attributes { name: "public_ip" type: TEXT text { value: "true" } } url { scheme: "http" address { hostname: "10.0.4.190" ip: "10.0.4.190" port: 5051 } path: "/slave(1)" }] (mesosphere.mesos.TaskBuilder:marathon-akka.actor.default-dispatcher-11)
Mesos master logs..
Mar 23 15:38:22 ip-10-0-4-194.ec2.internal mesos-master[1371]: I0323 15:38:22.339759 1376 master.cpp:5350] Sending 2 offers to framework 54f71504-b37a-4954-b082-e1f2a04b7fa4-0000 (marathon) at scheduler-f86b567c-de59-4891-916b-fb00c7959a09#10.0.4.194:60450
Mar 23 15:38:22 ip-10-0-4-194.ec2.internal mesos-master[1371]: I0323 15:38:22.341790 1381 master.cpp:3673] Processing DECLINE call for offers: [ 54f71504-b37a-4954-b082-e1f2a04b7fa4-O373 ] for framework 54f71504-b37a-4954-b082-e1f2a04b7fa4-0000 (marathon) at scheduler-f86b567c-de59-4891-916b-fb00c7959a09#10.0.4.194:60450
Mar 23 15:38:22 ip-10-0-4-194.ec2.internal mesos-master[1371]: I0323 15:38:22.342041 1381 master.cpp:3673] Processing DECLINE call for offers: [ 54f71504-b37a-4954-b082-e1f2a04b7fa4-O374 ] for framework 54f71504-b37a-4954-b082-e1f2a04b7fa4-0000 (marathon) at scheduler-f86b567c-de59-4891-916b-fb00c7959a09#10.0.4.194:60450
No sure why marathon didn't like that offer. I'm fairly sure there is enough resource.
Marathon is waiting for an Offer with enough resources for the Kafka scheduler. The offers it is rejecting appear not to have any cpu or memory. The Offer you see which does have sufficient resources is already statically reserved for the Role "slave_public".
The Kafka scheduler will be running with the Role *. Your cluster lacks sufficient resources in the default Role *. This is the role associated with private agents
You should go look at mesos/state and look at the available Resources on agents with the "*" role.