Dangling discs after cluster removal - kubernetes

As a part of university course I had to deploy an application to IBM Kubernetes.
I have pay-as-you-go type of account with my credit card attached to it.
I deployed the application to the cluster (the paid tier with public IP) and after few days and the demonstration the cluster was not needed anymore.
The cluster was configured to use dynamic provisioning with persistent storage via ibmcloud-block-storage-plugin.
The problem is that the cluster provisioned tens of discs and then when I removed it using IBM Cloud UI (with option to remove all persistent volumes checked) the discs are still displayed as active.
Result of invoking ibmcloud sl block volume-list:
77394321 SL02SEL1854117-1 dal13 endurance_block_storage 20 - 161.26.114.100 0 1
78180815 SL02SEL1854117-2 dal10 endurance_block_storage 20 - 161.26.98.107 0 1
78180817 SL02SEL1854117-3 dal10 endurance_block_storage 20 - 161.26.98.107 1 1
78180827 SL02SEL1854117-4 dal10 endurance_block_storage 20 - 161.26.98.106 3 1
78180829 SL02SEL1854117-5 dal10 endurance_block_storage 20 - 161.26.98.108 2 1
78184235 SL02SEL1854117-6 dal10 endurance_block_storage 20 - 161.26.98.88 4 1
78184249 SL02SEL1854117-7 dal10 endurance_block_storage 20 - 161.26.98.86 5 1
78184285 SL02SEL1854117-8 dal10 endurance_block_storage 20 - 161.26.98.107 6 1
78184289 SL02SEL1854117-9 dal10 endurance_block_storage 20 - 161.26.98.105 7 1
78184457 SL02SEL1854117-10 dal10 endurance_block_storage 20 - 161.26.98.85 9 1
78184465 SL02SEL1854117-11 dal10 endurance_block_storage 20 - 161.26.98.88 8 1
78184485 SL02SEL1854117-12 dal10 endurance_block_storage 20 - 161.26.98.86 10 1
78184521 SL02SEL1854117-13 dal10 endurance_block_storage 20 - 161.26.98.106 0 1
78184605 SL02SEL1854117-14 dal10 endurance_block_storage 20 - 161.26.98.87 1 1
78184643 SL02SEL1854117-15 dal10 endurance_block_storage 20 - 161.26.98.85 2 1
78184689 SL02SEL1854117-16 dal10 endurance_block_storage 20 - 161.26.98.87 3 1
78184725 SL02SEL1854117-17 dal10 endurance_block_storage 20 - 161.26.98.108 11 1
[ ... more entries there ... ]
All of those discs was created using default ibm bronze block storage class for Kubernetes clusters and have standard Remove policy set (so should've been delted automatically).
When I'm trying to delete any of those with ibmcloud sl block volume-cancel --immediate --force 77394321 I got:
Failed to cancel block volume: 77394321.
No billing item is found to cancel.
What's more the IBM Cloud displays those discs as active and there's no option to delete them (the option in the menu is grayed):
I don't want to get billing for more than 40 x 20GB discs as the cluster don't even need that many resources (the fault was in badly defined Kubernetes configs).
What is correct way to remove the discs or is it only a delay on IBM Cloud and everything will be fine with my billings (my billings show only around $19 for public IP for cluster, nothing more)?
Edit
It seems that after some time the problem was resolved (I created a ticket, but don't know if the sales team solved the problem. Probably it was just enough to wait as #Sandip Amin suggested in comments).

Opening a support case would probably be the best course of action here as we'll likely need some account info from you to figure out what happened (or rather, why the expected actions didn't happen).
Log into the cloud and visit https://cloud.ibm.com/unifiedsupport/supportcenter (or click the Support link in the masthead of the page). If you'll comment back here with your case number, I'll help follow up on it.

Related

Nagios event handler ignoring check interval

I have recently created an event handler for a service check which will restart Tomcat on 3 different boxes.
The check settings are:
5 checks
2 Minute checks when Ok
5 Minute checks otherwise
In the event handler script I have:
# What state is the iOS PN in?
case "$1" in
OK)
# The service is ok, so don't do anything...
;;
WARNING)
# Is this a "soft" or a "hard" state?
case "$2" in
SOFT)
case "$3" in
#Check number
2)
echo "`date` Restarting Tomcat on Node 1 for iOS PN (2nd soft warning state)..." >> /tmp/iOSPN.log
;;
3)
echo "`date` Restarting Tomcat on Node 2 for iOS PN (3rd soft warning state)..." >> /tmp/iOSPN.log
;;
4)
echo "`date` Restarting Tomcat on Node 3 for iOS PN (4th soft warning state)..." >> /tmp/iOSPN.log
;;
esac
;;
HARD)
# Do nothing let Nagios send alert
;;
esac
;;
CRITICAL)
# In theory nothing should reach this point...
;;
esac
exit 0
So the event handler should restart Tomcat on node 1 after the 2nd warning check, wait 5 minutes before checking again, then restart node 2 if it is still an issue, then wait 5 minutes and check again then restart node 3 if it is still an issue.
However when I check the log file I can see the following:
Thu Apr 18 15:09:13 2019 Restarting Tomcat on Node 1 for iOS PN (2nd soft warning state)...
Thu Apr 18 15:09:23 2019 Restarting Tomcat on Node 2 for iOS PN (3rd soft warning state)...
Thu Apr 18 15:09:33 2019 Restarting Tomcat on Node 3 for iOS PN (4th soft warning state)...
As you can see it would have restarted each box after 10 seconds not 5 minutes, I have removed the lines which actually call the restart of Tomcat as this cannot be done in this short amount of time.
I cannot see anything in the Nagios logs detailing why it did the next check so quickly after, so any help would be appreciated.
Additional:
This is the service definition:
define service{
use 5check-service
host_name ACTIVEMQ1
contact_groups tyrell-admins-non-critical
service_description ActiveMQ - iOS PushNotification Queue Pending Items
event_handler restartRemote_Tomcat!$SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
check_command check_activemq_queue_item2!http://activemq1:8161/admin/xml/queues.jsp!IosPushNotificationQueue!100!300
}
define service{
name 5check-service ; The 'name' of this service template
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
is_volatile 0 ; The service is not volatile
check_period 24x7 ; The service can be checked at any time of the day
max_check_attempts 5 ; Re-check the service up to 5 times in order to determine its final (hard) state
normal_check_interval 2 ; Check the service every 5 minutes under normal conditions
retry_check_interval 5 ; Re-check the service every two minutes until a hard state can be determined
contact_groups support ; Notifications get sent out to everyone in the 'admins' group
notification_options w,u,c,r ; Send notifications about warning, unknown, critical, and recovery events
notification_interval 5 ; Re-notify about service problems every 5 mins
notification_period 24x7 ; Notifications can be sent out at any time
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}

Running error opening app when deploying; uses a Github Watson Voice Bot tutorial

https://github.com/IBM/watson-voice-bot
I am fairly new to using Watson Assistant + IBM CLI but am trying to link a Watson Assistant to IBM's speech to text plugin/API. There's a great tutorial provided on GitHub, but I have run into problems trying to deploy the app, and have been unable to get any assistance so far (from people on Github).
0 of 1 instances running, 1 starting
0 of 1 instances running, 1 starting
0 of 1 instances running, 1 starting
0 of 1 instances running, 1 starting
0 of 1 instances running, 1 starting
0 of 1 instances running, 1 starting
0 of 1 instances running, 1 starting
0 of 1 instances running, 1 starting
0 of 1 instances running, 1 starting
0 of 1 instances running, 1 crashed
FAILED
Error restarting application: Start unsuccessful
TIP: use 'cf logs watson-voice-bot-20181121030328640 --recent' for more information
Finished: FAILED
This is what occurs as I try to deploy it. What should I do?

How to bring one master in one AZ in kops?

I deployed a cluster in AWS 3 AZ, I want to start one master on each AZ. Everything else works except I cannot start one master in one AZ.
Here is my validation:
INSTANCE GROUPS
NAME ROLE MACHINETYPE MIN MAX SUBNETS
bastions Bastion t2.micro 1 1 utility-us-east-1a,utility-us-east-1c,utility-us-east-1d
master-us-east-1a Master m3.medium 1 1 us-east-1a
master-us-east-1c Master m3.medium 2 2 us-east-1c
master-us-east-1d Master m3.medium 1 1 us-east-1d
nodes Node m4.xlarge 3 3 us-east-1a,us-east-1c,us-east-1d
workers Node m4.2xlarge 2 2 us-east-1a,us-east-1c,us-east-1d
NODE STATUS
NAME ROLE READY
ip-10-0-100-34.ec2.internal node True
ip-10-0-107-127.ec2.internal master True
ip-10-0-120-160.ec2.internal node True
ip-10-0-35-184.ec2.internal node True
ip-10-0-39-224.ec2.internal master True
ip-10-0-59-109.ec2.internal node True
ip-10-0-87-169.ec2.internal node True
VALIDATION ERRORS
KIND NAME MESSAGE
InstanceGroup master-us-east-1c InstanceGroup "master-us-east-1c" did not have enough nodes 0 vs 2
Validation Failed
And if I use rolling update, it shows there is one master not started:
NAME STATUS NEEDUPDATE READY MIN MAX NODES
bastions Ready 0 1 1 1 0
master-us-east-1a Ready 0 1 1 1 1
master-us-east-1c Ready 0 0 1 1 0
master-us-east-1d Ready 0 1 1 1 1
nodes Ready 0 3 3 3 3
workers Ready 0 2 2 2 2
What shall I do to bring that machine up?
I solved this problem. It is because the m3.medium type (the default one in kops for master) is no longer available in that AZ. Change it to m4.large makes it work.

What is the use of health-check-type attribute

I have deployed one app both in bluemix and pivotal. Below is the manifest file
---
applications:
- name: test
memory: 128M
instances: 1
no-route: true
health-check-type: none //Why we have to use this?
In bluemix, without the health-check-type attribute my app is getting started. But in pivotal, I get the below message continuously due to which the app is getting crashed.
0 of 1 instances starting
0 of 1 instances starting
0 of 1 instances starting
0 of 1 instances starting
0 of 1 instances starting
0 of 1 instances starting
0 of 1 instances starting
FAILED
After passing health-check-type: none in manifest.yml (in pivotal), app is getting started without any issues.
So can someone tell me is it mandatory to use health-check-type attribute?
IBM Bluemix is on the older "DEA" architecture, while Pivotal is on the current "Diego" architecture. You can see how the two differ when it comes to the no-route option here.

what does mellanox interrupt mlx4-async#pci:0000 ... means?

I'm using an InfiniBand Mellanox card [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] with OFED version 4-1.0.0 on an ubuntu 3.13.0 running on a x86_64 computer with 4 cores.
Here is the result of ibstat on my computer
CA 'mlx4_0'
CA type: MT26428
Number of ports: 1
Firmware version: 2.8.600
Hardware version: b0
Node GUID: 0x0002c903004d58ee
System image GUID: 0x0002c903004d58f1
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x0002c903004d58ef
Link layer: InfiniBand
and my /proc/interrupts looks like this :
67: 17923 4654 0 0 PCI-MSI-edge mlx4-async#pci:0000:01:00.0
68: 26696 0 54 0 PCI-MSI-edge mlx4_0-0
69: 0 34 23 0 PCI-MSI-edge mlx4_0-1
70: 0 0 0 0 PCI-MSI-edge mlx4_0-2
71: 0 0 0 0 PCI-MSI-edge mlx4_0-3
I read that each mlx4_0-x interrupts are associated to each CPU. My question is : what does the first interrupt mlx4-async#pci:0000:01:00.0 means ? I experiment that when the opensm deamon is not yet running, this interrupt occur every 5 minutes.
mlx4-async is used for asynchronous events other than completion events, e.g. link events, catastrophic events, cq overrun, etc.
the interrupt is handled by the adapter driver and depending on the event different modules are activated, such a link event notifications or cleanups due to asynchronous errors.