How to debug when hypothesis produces flaky test error? - pytest

I am using the hypothesis python package for stateful testing.
I am getting the following error when I run my tests:
hypothesis.errors.Flaky: Unreliable assumption: An example which satisfied assumptions on the first run now fails it.
I understand what flaky error means from a similar post. I have a test which failed the first time but passed during the second time. I can understand from the log, which test has led to this failure. Hypothesis tries the same test sequence 4 times during the overall test run among which, 2 of them pass and 2 of them fail.
I have tried the failing test individually without hypothesis and it does not fail. I am trying to understand what leads to the flaky error. Is it possibly a bug in Hypothesis as given in the post below:
What does Flaky: Hypothesis test produces unreliable results mean?
How do I get around this? Please find the log file of the test run at the link:
https://github.com/aparnasbose/hypothesis/blob/master/flaky%20test

The problem is almost certainly that your test is not deterministic for all inputs; there are some arguments or sequences of actions that Hypothesis can find which sometimes pass and sometimes fail. Hypothesis considers this a bug in your test, and raises the Flaky error.
To diagnose this in more detail I'd need to see your actual source code.
FYI verbose verbosity is much more useful here than debug (which dumps too much internal state). You may also want upgrade to Hypothesis >= 4.41.1 for improved statistics.

Related

VSTS Test fails but vstest.console passes; the assert executes before the code for some reason?

Well the system we have has a bunch of dependencies, but I'll try to summarize what's going on without divulging too much details.
Test assembly in the form of a .dll is the one being executed. A lot of these tests call an API.
In the problematic method, there's 2 API calls that have an await on them: one to write a record to that external interface, and another to extract all records and then read the last one in that external interface, both via API. The test is simply to check if writing the last record was successful in an end-to-end context, that's why there's both a write and then a read.
If we execute the test in Visual Studio, everything works as expected. I also tested it manually via command lining vstest.console.exe, and the expected results always come out as well.
However, when it comes to VS Test task in VSTS, it fails for some reason. We've been trying to figure it out, and eventually we reached the point where we printed the list from the 'read' part. It turns out the last record we inserted isn't in the data we pulled, but if we check the external interface via a different method, we confirmed that the write process actually happened. What gives? Why is VSTest getting like an outdated set of records?
We also noticed two things:
1.) For the tests that passed, none of the Console.WriteLine outputs appear in the logs. Only on Failed test do they do so.
2.) Even if our Data.Should.Be call is at the very end of the TestMethod, the logs report the fail BEFORE it prints out the lines! And even then, the printing should happen after reading the list of records, and yet when the prints do happen we're still missing the record we just wrote.
Is there like a bottom-to-top thing we're missing here? It really seems to me like VSTS vstest is executing the assert before the actual code. The order of TestMethods happen the right order though (the 4th test written top-to-bottom in the code is executed 4th rather than 4th to last) and we need them to happen in the right order because some of the later tests depend on the former tests succeeding.
Anything we're missing here? I'd put a source code but there's a bunch of things I need to scrub first if so.
Turns out we were sorely misunderstanding what 'await' does. We're using .Wait() instead for the culprit and will also go back through the other tests to check for quality.

Scala Test: Auto-skip tests that exceed timeout

As I have a collection of scala tests that connect with remote services (some of which may not be available at the time of test execution), I would like to have a way of indicating Scala tests that should be ignored, if the time-out exceeds a desired threshold.
Indeed, I could enclose the body of a test in a future and have it auto-pass, if the time-out is exceeded but having slow tests silently pass strikes me as risky. It would be better if it were explicitly skipped during the test run. So, what I would really like is something like the following:
ignorePast(10 seconds) should "execute a service that is sometimes unavailable" in {
invokeServiceThatIsSometimesUnavailable()
....
}
Looking at the ScalaTest documentation, I don't see this feature supported directly but suspect that there might be away to add this capability? Indeed, I could just add a "tag" to "slow" tests and tell the runner not to execute them, but I would rather the tests be automatically skipped when the timeout is exceeded.
I believe that's not something you're test framework should be responsible for.
Wrap your invokeServiceThatIsSometimesUnavailable() in an exception handling block and you'll be fine.
try {
invokeServiceThatIsSometimesUnavailable()
} catch {
case e : YourServiceTimeoutException => reportTheIgnoredTest()
}
I agree with Maciej that exceptions are probably the best way to go, since the timeout happens within your test itself.
There's also assume (see here), which allows to cancel a test if some pre-requisite fails. You could use it also within a single test, I think.

may the compiler optimize based on assert(...) expressions/contracts?

http://dlang.org/expression.html#AssertExpression
Regarding assert(0): "The optimization and code generation phases of compilation may assume that it is unreachable code."
The same documentation claims assert(0) is a 'special case', but there are several reasons that follow.
Can the D compiler optimize based on general assert-ions made in contracts and elsewhere?
(as if I needed another reason to enjoy the in{} and out{} constructs, but it certainly would make me feel a little more giddy to know that writing them could make things go fwoosh-ier)
In theory, yes, in practice, I don't think it does, especially since the asserts are killed before even getting to the optimizer on dmd -release. I'm not sure about gdc and ldc, but I think they share this portion of the code.
The spec's special case reference btw is that assert(0) is still present, in some form, with the -release compile flag. It is translated into an illegal instruction there (asm {hlt;} - non-kernel programs on x86 aren't allowed to use that so it will segfault upon hitting it), whereas all other asserts are simply left out of the code entirely in -release mode.
GDC certainly does optimise based on asserts. The if conditions make for much better code, even causing unnecessary code to disappear. However, unfortunately at the moment the way it is implemented is that the entire assert can disappear in release build mode so then the compiler never sees the beneficial if-condition info and actually generates worse code in release than in debug mode! Ironic. I have to admit that I've only looked at this effect with if conditions in asserts in the body, I haven't checked what effect in and out blocks have. The in- and out- etc contract blocks can be turned off based on a command line switch iirc, so they are not even compiled, I think this possibly means the compiler doesn't even look at them. So this is another thing that might possibly affect code generation, I haven't looked at it. But there is a feature here that I would very much like to see, that the if condition truth values in the assert conditions (checking that there is no side-effect code in the expression for the assert cond) can always be injected into the compiler as an assumption, just as if there had been an if statement even in release mode. It would involve pretending you had just seen an if ( xxx ) but with the actual code generation for the test suppressed in release mode, and with subsequent code feeling the beneficial effects of say known truth values, value-range limits and so on.

How should I deal with failing tests for bugs that will not be fixed

I have a complex set of integration tests that uses Perl's WWW::Mechanize to drive a web app and check the results based on specific combinations of data. There are over 20 subroutines that make up the logic of the tests, loop through data, etc. Each test runs several of the test subroutines on a different dataset.
The web app is not perfect, so sometimes bugs cause the tests to fail with very specific combinations of data. But these combinations are rare enough that our team will not bother to fix the bugs for a long time; building many other new features takes priority.
So what should I do with the failing tests? It's just a few tests out of several dozen per combination of data.
1) I can't let it fail because then the whole test suite would fail.
2) If we comment them out, that means we miss out on making that test for all the other datasets.
3) I could add a flag in the specific dataset that fails, and have the test not run if that flag is set, but then I'm passing extra flags all over the place in my test subroutines.
What's the cleanest and easiest way to do this?
Or are clean and easy mutually exclusive?
That's what TODO is for.
With a todo block, the tests inside are expected to fail. Test::More will run the tests normally, but print out special flags indicating they are "todo". Test::Harness will interpret failures as being ok. Should anything succeed, it will report it as an unexpected success. You then know the thing you had todo is done and can remove the TODO flag.
The nice part about todo tests, as opposed to simply commenting out a block of tests, is it's like having a programmatic todo list. You know how much work is left to be done, you're aware of what bugs there are, and you'll know immediately when they're fixed.
Once a todo test starts succeeding, simply move it outside the block. When the block is empty, delete it.
I see two major options
disable the test (commenting it out), with a reference to your bugtracking system (i.e. a bug ig), possibly keeping a note in the bug as well that there is a test ready for this bug
move the failing tests in a seperate test suite. You could even reverse the failing assertion so you can run the suite and while it is green the bug is still there and if it becomes red either the bug is gone or something else is fishy. Of course a link to the bugtracking system and bag is still a good thing to have.
If you actually use Test::More in conjunction with WWW::Mechanize, case closed (see comment from #daxim). If not, think of a similar approach:
# In your testing module
our $TODO;
# ...
if (defined $TODO) {
# only print warnings
};
# in a test script
local $My::Test::TODO = "This bug is delayed until iteration 42";

Why do I need to know how many tests I will be running with Test::More?

Am I a bad person if I use use Test::More qw(no_plan)?
The Test::More POD says
Before anything else, you need a testing plan. This basically declares how many tests your script is going to run to protect against premature failure...
use Test::More tests => 23;
There are rare cases when you will not know beforehand how many tests your script is going to run. In this case, you can declare that you have no plan. (Try to avoid using this as it weakens your test.)
use Test::More qw(no_plan);
But premature failure can be easily seen when there are no results printed at the end of a test run. It just doesn't seem that helpful.
So I have 3 questions:
What is the reasoning behind requiring a test plan by default?
Has anyone found this a useful and time saving feature in the long run?
Do other test suites for other languages support this kind of thing?
What is the reason for requiring a test plan by default?
ysth's answer links to a great discussion of this issue which includes comments by Michael Schwern and Ovid who are the Test::More and Test::Most maintainers respectively. Apparently this comes up every once in a while on the perl-qa list and is a bit of a contentious issue. Here are the highlights:
Reasons to not use a test plan
Its annoying and takes time.
Its not worth the time because test scripts won't die without the test harness noticing except in some rare cases.
Test::More can count tests as they happen
If you use a test plan and need to skip tests, then you have the additional pain of needing a SKIP{} block.
Reasons to use a test plan
It only takes a few seconds to do. If it takes longer, your test logic is too complex.
If there is an exit(0) in the code somewhere, your test will complete successfully without running the remaining test cases. An observant human may notice the screen output doesn't look right, but in an automated test suite it could go unnoticed.
A developer might accidentally write test logic so that some tests never run.
You can't really have a progress bar without knowing ahead of time how many tests will be run. This is difficult to do through introspection alone.
The alternative
Test::Simple, Test::More, and Test::Most have a done_testing() method which should be called at the end of the test script. This is the approach I take currently.
This fixes the problem where code has an exit(0) in it. It doesn't fix the problem of logic which unintentionally skips tests though.
In short, its safer to use a plan, but the chances of this actually saving the day are low unless your test suites are complicated (and they should not be complicated).
So using done_testing() is a middle ground. Its probably not a huge deal whatever your preference.
Has this feature been useful to anyone in the real world?
A few people mention that this feature has been useful to them in the real word. This includes Larry Wall. Michael Schwern says the feature originates with Larry, more than 20 years ago.
Do other languages have this feature?
None of the xUnit type testing suites has the test plan feature. I haven't come across any examples of this feature being used in any other programming language.
I'm not sure what you are really asking because the documentation extract seems to answer it. I want to know if all my tests ran. However, I don't find that useful until the test suite stabilizes.
While developing, I use no_plan because I'm constantly adding to the test suite. As things stabilize, I verify the number of tests that should run and update the plan. Some people mention the "test harness" catching that already, but there is no such thing as "the test harness". There's the one that most modules use by default because that's what MakeMaker or Module::Build specify, but the TAP output is independent of any particular TAP consumer.
A couple of people have mentioned situations where the number of tests might vary. I figure out the tests however I need to compute the number then use that in the plan. It also helps to have small test files that target very specific functionality so the number of tests is low.
use vars qw( $tests );
BEGIN {
$tests = ...; # figure it out
use Test::More tests => $tests;
}
You can also separate the count from the loading:
use Test::More;
plan tests => $tests;
The latest TAP lets you put the plan at the end too.
In one comment, you seem to think prematurely exiting will count as a failure, since the plan won't be output at the end, but this isn't the case - the plan will be output unless
you terminate with POSIX::_exit or a fatal signal or the like. In particular, die() and exit() will result
in the plan being output (though the test harness should detect anything other than an exit(0) as a prematurely terminated test).
You may want to look at Test::Most's deferred plan option, soon to be in Test::More (if it's not already).
There's also been discussion of this on the perl-qa list recently. One thread: http://www.nntp.perl.org/group/perl.qa/2009/03/msg12121.html
Doing any testing is better than doing no testing, but testing is about being deliberate. Stating the number tests expected gives you the ability to see if there is a bug in the test script that is preventing a test from executing (or executing too many times). If you don't run tests under specific conditions you can use the skip function to declare this:
SKIP: {
skip $why, $how_many if $condition;
...normal testing code goes here...
}
I think it's ok to bend the rules and use no_plan when the human cost of figuring out the plan is too high, but this cost is a good indication that the test suite has not been well designed.
Another case where it's useful to have the test_plan explicitely defined is when you are doing this kind of tests:
$coderef = sub { my $arg = shift; isa_ok $arg, 'MyClass' };
do(#args, $coderef);
and
## hijack our interface to test it's called.
local *MyClass::do = $coderef;
If you don't specify a plan, it's easy to miss out that your test failed and that some assertions weren't run as you expected.
Having explicitly the number of test in the plan is a good idea, unless it is too expensive to retrieve this number. The question has been properly answered already but I wanted to stress two points:
Better than no_plan is to use done_testing()
use Test::More;
... run your tests ...;
done_testing( $number_of_tests_run );
# or done_testing() if not number of test is known
this Matt Trout blog entry is interesting, and rants about adding a plan vs cvs conflicts and other issues that make the plan problematic: Why numeric test plans are bad, wrong, and don't actually help anyway
I find it annoying, too, and I usually ignore the number at the very beginning until the test suite stabilizes. Then I just keep it up to date manually. I do like the idea of knowing how many total tests there are as the seconds tick by, as a kind of a progress indicator.
To make counting easier I put the following before each test:
#----- load non-existant record -----
....
#----- add a new record -----
....
#----- load the new record (by name) -----
....
#----- verify the name -----
etc.
Then I can quickly scan the file and easily count the tests, just looking for the #----- lines. I suppose I could even write something up in Emacs to do it for me, but it's honestly not that much of a chore.
It is a pain when doing TDD, because you are writing new tests opportunistically. When I was teaching TDD and the shop used Perl, we decided to use our test suite the no plan way. I guess we could have changed from no_plan to lock down the number of tests. At the time I saw it as more hindrance than help.
Eric Johnson's answer is exactly correct. I just wanted to add that done_testing, a much better replacement to no_plan, was released in Test-Simple 0.87_1 recently. It's an experimental release, but you can download it directly from the previous link.
done_testing allows you to declare the number of tests you think you've run at the end of your testing script, rather than trying to guess it before your script starts. You can read the documentation here.