Retry instruction in salt after fail - deployment

I'm using salt for my deployment issues and have the following question.
Is there any mechanism to retry a command?
For instance I have some thing like this:
platform_deps_git:
git.latest:
- name: ...
- rev: master
- target: ...
- user: ...
- identity: ...
But sometimes the network may fail. Is there any way to retry platform_deps_git instruction?

The next version of Salt (2014.7.0) will have an "onfail" requisite. This will allow you to take another action if something fails.
The docs are here:
http://docs.saltstack.com/en/latest/ref/states/requisites.html#onfail

What I do is grep through salt output whenever I run a highstate and if it sees any failures I rerun the highstate.

There's a first-class retry mechanism for states that was added in 2017:
platform_deps_git:
git.latest:
- name: ...
- rev: master
- target: ...
- user: ...
- identity: ...
- retry:
attempts: 5
until: True
interval: 60
splay: 10
The retry option supports a few different options for controlling its behavior.

Related

How do I modify existing jobs to switch owner?

I installed Rundeck v3.3.5 (on CentOS 7 via RPM) to replace an old Rundeck instance that was decommissioned. I did the export/import of projects (which worked brilliantly) while connected to the new server as the default admin user. The imported jobs run properly on the correct schedule. I subsequently configured the new server to use LDAP authentication and configured ACLs for users/roles. That also works properly.
However, I see an error like this in the service.log:
ERROR services.NotificationService - Error sending notification email to foo#bar.com for Execution 9358 Error executing tag <g:render>: could not initialize proxy [rundeck.Workflow#9468] - no Session
My thought is to switch job owners from admin to a user that exists in LDAP. I mean, I would like to switch job owners regardless, but I'm also hoping it addresses the error.
Is there a way in the web interface or using rd that I can bulk-modify jobs to switch the owner?
It turns out that the error in the log was caused by notification settings in an included job. I didn't realize that notifications were configured on the parameterized shared job definition, but there were; removing the notification settings caused the error to stop being added to /var/log/rundeck/service.log.
To illustrate the problem, here are chunks of YAML I've edited to show just the important parts. Here's the common job:
- description: Do the actual work with arguments passed
group: jobs/common
id: a618ceb6-f966-49cf-96c5-03a0c2efb9d8
name: do_the_work
notification:
onstart:
email:
attachType: file
recipients: ops#company.com
subject: Actual work being started
notifyAvgDurationThreshold: null
options:
- enforced: true
name: do_the_job
required: true
values:
- yes
- no
valuesListDelimiter: ','
- enforced: true
name: fail_a_lot
required: true
values:
- yes
- no
valuesListDelimiter: ','
scheduleEnabled: false
sequence:
commands:
- description: The actual work
script: |-
#!/bin/bash
echo ${RD_OPTION_DO_THE_JOB} ${RD_OPTION_FAIL_A_LOT}
keepgoing: false
strategy: node-first
timeout: '60'
uuid: a618ceb6-f966-49cf-96c5-03a0c2efb9d8
And here's the job that calls it (the one that is scheduled and causes an error to show up in the log when it runs):
- description: Do the job
group: jobs/individual
name: do_the_job
...
notification:
onfailure:
email:
recipients: ops#company.com
subject: '[Rundeck] Failure of ${job.name}'
notifyAvgDurationThreshold: null
...
sequence:
commands:
- description: Call the job that does the work
jobref:
args: -do_the_job yes -fail_a_lot no
group: jobs/common
name: do_the_work
If I remove the notification settings from the common job, the error in the log goes away. I'm not sure if sending notifications from an included job is not supported. It would be useful to me if it was, so I could place notification settings in a single location. However, I can understand why it presents a problem for the scheduler/executor.

run ansible task only if tag is NOT specified

Say I want to run a task only when a specific tag is NOT in the list of tags supplied on the command line, even if other tags are specified. Of these, only the last one will work as I expect in all situations:
- hosts: all
tasks:
- debug:
msg: 'not TAG (won't work if other tags specified)'
tags: not TAG
- debug:
msg: 'always, but not if TAG specified (doesn't work; always runs)'
tags: always,not TAG
- debug:
msg: 'ALWAYS, but not if TAG in ansible_run_tags'
when: "'TAG' not in ansible_run_tags"
tags: always
Try it with different CLI options and you'll hopefully see why I find this a bit perplexing:
ansible-playbook tags-test.yml -l HOST
ansible-playbook tags-test.yml -l HOST -t TAG
ansible-playbook tags-test.yml -l HOST -t OTHERTAG
Questions: (a) is that expected behavior? and (b) is there a better way or some logic I'm missing?
I'm surprised I had to dig into the (undocumented, AFAICT) variable ansible_run_tags.
Amendment: It was suggested that I post my actual use case. I'm using ansible to drive system updates on Debian family systems. I'm trying to notify at the end if a reboot is required unless the tag reboot was supplied, in which case cause a reboot (and wait for system to come back up). Here is the relevant snippet:
- name: check and perhaps reboot
block:
- name: Check if a reboot is required
stat:
path: /var/run/reboot-required
get_md5: no
register: reboot
tags: always,reboot
- name: Alert if a reboot is required
fail:
msg: "NOTE: a reboot required to finish uppdates."
when:
- ('reboot' not in ansible_run_tags)
- reboot.stat.exists
tags: always
- name: Reboot the server
reboot:
msg: rebooting after Ansible applied system updates
when: reboot.stat.exists or ('force-reboot' in ansible_run_tags)
tags: never,reboot,force-reboot
I think my original question(s) still have merit, but I'm also willing to accept alternative methods of accomplishing this same functionality.
For completeness, and since only #paul-sweeney has offered any alternative solution, I'll answer my own question with my current best solution and let people pick / up-vote their favorite:
---
- name: run only if 'TAG' not specified
debug:
msg: 'ALWAYS, but not if TAG in ansible_run_tags'
when: "'TAG' not in ansible_run_tags"
tags: always
I know it's an old(ish) question, but I had a similar requirement.
It's probably something best implemented another way ... but ... sometimes it can be useful.
I'd achieve it by setting a fact if the tag IS specified, then outputting the message only if the fact is not set, something like:
---
- name: "test task runs only if tag missing"
hosts: all
tasks:
- name: "suppress message if tag given"
set_fact: suppress_message=yes
tags: reboot,never
- name: "message"
debug:
msg: "You didn't say 'reboot'"
when: suppress_message is not defined
I think that we have states for controlling (example: started, restarted, stopped), states for installing (present,absent) and components (webserver, db,...).
Ansible is lacking a good separation of those 3 dimensions and mixing those 3 dimensions in a single tag system is leading to confusion.
For example, if you have a 'webserver' and a 'DB' tag, you want to 'restart' the DB and not the webserver using a 'restart' tag.
But it won't work if the 'restart' tasks of the DB and the webserver are in the same tasks file with the same 'restart' tag as the 'restart' tag will start both the DB and the webserver...
So you will have probably to separate webserver and DB tasks in 2 separate files and use the tag at the level of the include.
Using tags means that you have a tree of options, not a matrix of options.
I like the tag concept but the fact that it is not possible to use it in conditional expressions is making it less appealing.
What I recommend is to declare tags in a role but map them into variables as a first task. So the 'restart' and 'db' tags would become boolean variables in my role and use when: instead of tags:
ansible-playbook has a skip-tags option. The example from the docs is
ansible-playbook example.yml --skip-tags "packages"
https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_tags.html

How to collect more than 22 event ids with winlogbeat?

I've got a task to collect over 500 events from DC with winlogbeat. But windows got a limit 22 events to query. I'm using version 6.1.2. I've tried with processors like this:
winlogbeat.event_logs:
- name: Security
processors:
- drop_event.when.not.or:
- equals.event_id: 4618
...
but with these settings client doesn't work, nothing in logs. If I run it from exe file it just starts and stops with no error.
If I try to do like it was written in the official manual:
winlogbeat.event_logs:
- name: Security
event_id: ...
processors:
- drop_event.when.not.or:
- equals.event_id: 4618
...
client just crashes with "invalid event log key processors found". Also I've tried to create new custom view and take event from there, but apparently it also has query limit to 22 events.

Saltstack - Schedule to ensure service is running not working

I'm trying to set up a Saltstack schedule that will check to ensure that a service is running on the minion. However, it doesn't seem like service.running is working as a function on the schedule.
Here's my run.sls file:
test-service-sched:
schedule.present:
- name: test-service-sched
- function: service.running
- seconds: 60
- job_kwargs:
name: test-service
- persist: True
- enabled: True
- run_on_start: True
And I execute the following: salt 'service*' state.apply run
This ends up with the following error on the minion:
2017-03-28 02:47:11,493 [salt.utils.schedule ][ERROR ][6172] Unhandled exception running service.running
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/salt/utils/schedule.py", line 826, in handle_func
message=self.functions.missing_fun_string(func))
File "/usr/lib/python2.6/site-packages/salt/utils/error.py", line 36, in raise_error
raise ex(message)
Exception: 'service.running' is not available.
I haven't seen anything in the documentation that says I can't run service.running from a schedule. Is it a known limitation of Salt? Or am I just doing it wrong?
I can use cmd.run, but it ends up spamming the logs with errors if the service is already running.
So, I was pointed in the right direction on the Salt Google Group. There's a difference between execution modules and state modules. Since service.running is an execution module, and schedule only supports state modules, I had to reference it indirectly. I used 2 files:
schedule.sls:
service_schedule:
schedule.present:
- function: state.apply
- minutes: 1
- job_args:
- running
running.sls:
service_running:
service.running:
- name: test_service
Now, running salt 'service*' state.apply schedule did exactly what I wanted it to.

Ansible - repeat a task while wait_for

Still learning #Ansible. Trying to automate a MongoDB restore.
I have three servers which run MongoDB. After the restore, the status of the MongoDB servers can be outputted with a shell command (see below).
What I want Ansible to do is to perform a task when the string 'lastHeartbeatMessage' is present after 10 min in the output.
- name: Register MongoDB sync status
shell: mongo --eval "printjson(rs.status())"
register: mongoReplInfo
- debug: var=mongoReplInfo
- name: Copy rs.status to local log
local_action: copy content={{ mongoReplInfo }} dest=/tmp/mongoStatus
- name: Copy rs.status to server
copy: src=/tmp/mongoStatus dest=/tmp/mongoStatus
- name: Check if slave is still syncing
wait_for: path=/tmp/mongoStatus search_regex=lastHeartbeatMessage
- name: Succesfull sync
shell: 'run_succesfull_command'
when: lastHeartbeatMessage is absent after 10 min
- name: Failed sync
shell: 'run_succesfull_command'
when: lastHeartbeatMessage is present after 10 min
Right now i'm using the wait_for. But the status is only written once to the file, and it is not updated. Which module should I use to repeat the tasks which output the rs.status to the server?
Or am I taking this playbook the whole wrong way?
That's a use case for a do-until loop rather than wait_for.
The following will register mongoReplInfo twice: immediately and after 600 seconds. Then you can check the value for your condition.
- name: Register MongoDB sync status
shell: mongo --eval "printjson(rs.status())"
register: mongoReplInfo
until: false
retries: 2
delay: 600
But you should rather increase the number of retries and check for the condition in until parameter, so that the loop exits when the condition is met. Just like in the linked doc chapter.