Letting Concourse retry a build which failed because of a flaky issue

Letting Concourse retry a build which failed because of a flaky issue - concourse

According to Concourse documentation
If any step in the build plan fails, the build will fail and subsequent steps will not be executed
It makes sense. However I'm wondering how I could deal with flaky steps.
For instance if I have a pipeline with
a get step with trigger: true
and then a task which performs several operations, including an HTTP call to an external service.
If the HTTP call fails because of a temporary network error then it makes sense that Concourse fails the build. But I would also appreciate if I could have a way to tell Concourse that this type of errors does not mean that the current version is corrupted and that it should automatically retry to build it after some time.
I've looked for it in the Concourse documentation but couldn't find such feature. Is it possible?

Check out the attempts step modifier, the example from the doc:
plan:
- get: foo
- task: unit
file: foo/unit.yml
attempts: 10
It will attempt to run the task 10 times before it declares the task failed.

Using attempts as explained in the other answer can be an option. But, before going that route, I would think more about the possible consequences and alternatives.
Attempts has two potential problems:
it cannot know wether the failure is due to a flake or to a real error. If it is due to a real error, it will keep banging on the task for, say, 10 times, potentially consuming compute resource (it depends on how heavy the task is).
it will work as expected only if the task is as focused as possible and idempotent. For example, if the flake HTTP request you mention comes after other operations that make a change to the external world, then you must ensure (when designing the task) that redoing such operations due to a flake to the HTTP request is safe.
If you know that your task is not subject to these kind of problems, then attempts can make sense.
On the other hand, this discussion makes us realize that maybe we can restructure the pipeline to be more Concourse idiomatic.
Since you mention an HTTP request, another option is to proxy that HTTP request via a Concourse resource (see https://concourse-ci.org/implementing-resource-types.html). Once done, the side-effect is visible in the pipeline (instead of being hidden in the task) and its success could be made optional with try or another hook modifier (see https://concourse-ci.org/try-step.html and https://concourse-ci.org/modifier-and-hook-steps.html).
The trade-off in this case is the time to write your own Concourse resource (in case you don't find a community-provided one). Only you are in the position to take this decision. What I can say is that writing a resource is not that complicated, once you get familiar with the concept. For some tricks on quick iterations during development, that apply to any Concourse resource, you can have a look at https://github.com/Pix4D/cogito/blob/master/CONTRIBUTING.md#quick-iterations-during-development.

Related

VSTS Test fails but vstest.console passes; the assert executes before the code for some reason?

Well the system we have has a bunch of dependencies, but I'll try to summarize what's going on without divulging too much details.
Test assembly in the form of a .dll is the one being executed. A lot of these tests call an API.
In the problematic method, there's 2 API calls that have an await on them: one to write a record to that external interface, and another to extract all records and then read the last one in that external interface, both via API. The test is simply to check if writing the last record was successful in an end-to-end context, that's why there's both a write and then a read.
If we execute the test in Visual Studio, everything works as expected. I also tested it manually via command lining vstest.console.exe, and the expected results always come out as well.
However, when it comes to VS Test task in VSTS, it fails for some reason. We've been trying to figure it out, and eventually we reached the point where we printed the list from the 'read' part. It turns out the last record we inserted isn't in the data we pulled, but if we check the external interface via a different method, we confirmed that the write process actually happened. What gives? Why is VSTest getting like an outdated set of records?
We also noticed two things:
1.) For the tests that passed, none of the Console.WriteLine outputs appear in the logs. Only on Failed test do they do so.
2.) Even if our Data.Should.Be call is at the very end of the TestMethod, the logs report the fail BEFORE it prints out the lines! And even then, the printing should happen after reading the list of records, and yet when the prints do happen we're still missing the record we just wrote.
Is there like a bottom-to-top thing we're missing here? It really seems to me like VSTS vstest is executing the assert before the actual code. The order of TestMethods happen the right order though (the 4th test written top-to-bottom in the code is executed 4th rather than 4th to last) and we need them to happen in the right order because some of the later tests depend on the former tests succeeding.
Anything we're missing here? I'd put a source code but there's a bunch of things I need to scrub first if so.

Turns out we were sorely misunderstanding what 'await' does. We're using .Wait() instead for the culprit and will also go back through the other tests to check for quality.

Recursive Workflow in Powershell

I'm trying to automate a lengthy process that can be broken down into several steps. (say Steps 1-5)
I have written a script that separates these into functions and call them sequentially.
However, we now have the additional requirement of making the script restartable. That is, if it fails in any one of the steps, rerunning the script would cause it to skip all completed steps and retry from the failed one.
Is this at all possible without referencing an external log file?
I've tried using workflows but it seems like recursion isn't supported.
Any ideas?

Some options aside from using a log file.
Use the registry
you can set a registry value to a number depending on what step you stopped on, this removes the need for a log file but is somewhat similar in terms of 'external' storage
Check the task status on each run
depending on the tasks you could have the script 'test', for example, step 3 to see if it has already been completed, then check step 4, 5 etc. until it encounters one it needs to run and continue from there, this may be impossible or require a lot of overhead code though for not much payoff.
Allow the user to continue from within the script.
this is probably the best way of doing it (aside from just using a log file), run the script in blocks, and when an error is encountered you can prompt the user to fix the issue before pressing 'enter' to re-run the previous script block, this makes it easy to provide information about what failed as well.
the main thing here is that once a script 'quits', in order to know what happened in it's last run it needs an external source of information, or to handle it in another way.

how to debug in simpy

I have a general question about how to debug in Simpy. Normal debugging tools don't seem to work, since everything is working on the event loop, and you can't step through the code line by line and inspect what exists at any point in time.
Primarily, I'm interested in finding what kinds of processes and callbacks are in existence at a particular time, and how to remove them at the appropriate point. Are there any best practices surrounding debugging in discrete event simulation generally?

I would just use a bunch of print()s.

One thing you might find useful is the specific requests that can be passed to primitives such as resources. For example you can ask a resource how many users it currently has or how big the queue to use the resource is with:
All of these commands can be found in the documentation, here is the resource example: https://simpy.readthedocs.io/en/latest/api_reference/simpy.resources.html

RESTful APIs: what to return when updating an entity produces side-effects

One of our APIs has a tasks resource. Consumers of the API might create, delete and update a given task as they wish.
If a task is completed (i.e., its status is changed via PUT /tasks/<id>), a new task might be created automatically as a result.
We are trying to keep it RESTful. What would be the correct way to tell the calling user that a new task has been created? The following solutions came to my mind, but all of them have weaknesses in my opinion:
Include an additional field on the PUT response which contains information about an eventual new task.
Return only the updated task, and expect the user to call GET /tasks in order to check if any new tasks have been created.
Option 1 breaks the RESTful-ness in my opinion, since the API is expected to return only information regarding the updated entity. Option 2 expects the user to do stuff, but if he doesn't then no one will realize that a new task was created.
Thank you.
UPDATE: the PUT call returns an HTTP 200 code along the full JSON representation of the updated task.
#tophallen suggests having a task tree so that (if I got it right) the returned entity in option 2 contains the new task as a direct child.

You really have 2 options with a 200 status PUT, you can do headers (which if you do, check out this post). Certainly not a bad option, but you would want to make sure it was normalized site-wide, well documented, and that you didn't have anything such as firewalls/F5's/etc/ re-writing your headers.
Something like this would be a fair option though:
HTTP/1.1 200 OK
Related-Tasks: /tasks/11;/tasks/12
{ ...task response... }
Or you have to give some indication to the client in the response body. You could have a task structure that supports child tasks being on it, or you could normalize all responses to include room for "meta" stuff, i.e.
HTTP/1.1 200 OK
{
"data": { ...the task },
"related_tasks": [],
"aggregate_status": "PartiallyComplete"
}
Something like this used everywhere (a bit of work as it sounds like you aren't just starting this project) can be very useful, as you can also use it for scenarios like paging.
Personally, I think if you made the related_tasks property just contain either routes to call for the child tasks, or id's to call, that might be best, lighter responses, since the client might not always care to call to check on said child-task immediately anyways.
EDIT:
Actually, the more I think about it - the more headers would make sense in your situation - as a client can update a task at any point during the task processing, there may or may not be a child task in play - so modifying the data structure for the off-chance the client calls to update a task when a child task has started seems more work than benefit. A header would allow you to easily add a child task and notify the user at any point - you could apply the same thing for a POST of a task that happens to immediately finish and kicks off a child task, etc. It can easily support more than one task. I think this as well keeps it the most restful and reduces server calls, a client would always be able to know what is going on in the process chain. The details of the header could define, but I believe it is more traditional in a scenario like this to have it point to a resource, rather than a key within a resource.
If there are other options though, I'm very interested to hear them.

It looks like you're very concerned about being RESTful, but you're not using HATEOAS, which is contradictory. If you use HATEOAS, the related entity is just another link and the client can follow them as they please. What you have is a non-problem in REST. If this sounds new to you, read this: http://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven
Option 1 breaks the RESTful-ness in my opinion, since the API is
expected to return only information regarding the updated entity.
This is not true. The API is expected to return whatever is documented as the information available for that media-type. If you documented that a task has a field for related side-effects tasks, there's nothing wrong with it.

pytest: are pytest_sessionstart() and pytest_sessionfinish() valid hooks?

are pytest_sessionstart(session) and pytest_sessionfinish(session) valid hooks? They are not described in dev hook docs or latest hook docs
What is the difference between them and pytest_configure(config)/pytest_unconfigure(config)?
In docs it is said:
pytest_configure(config)called after command line options have been parsed. and all plugins
and initial conftest files been loaded.
and
pytest_unconfigure(config) called before test process is exited.
Session is the same, right?
Thanks!

The bad news is that the situation with sessionstart/configure is not very well specified. Sessionstart in particular is not much documented because the semantics differ if one is in the xdist/distribution case or not. One can distinguish these situations but it's all a bit too complicated.
The good news is that pytest-2.3 should make things easier. If you define a #fixture with scope="session" you can implement a fixture that is called once per process within which test execute.
For distributed testing, this means once per test slave. For single-process testing, it means once for the whole test run. In either case, if you do a "--collectonly" run, or "-h" or other options that do not involve the running of tests, then fixture functions will not execute at all.
Hope this clarifies.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse