celery celery_session_worker pytest timeout - pytest

I'm trying to write integration test for the celery tasks and have a test class like below
#pytest.mark.usefixtures('celery_session_app')
#pytest.mark.usefixtures('celery_session_worker')
#pytest.mark.usefixtures('mongodb')
class TestIntegration:
def test_delete_collection_from_mongodb(self, x, y):
results = delete_collection_from_mongodb(x, y).delay()
assert results.get(timeout=20) == 20
And in my conftest I have the following fixtures
#pytest.fixture(scope='session')
def celery_config():
return {
'broker_url': RABBITMQ_BROKER_URL,
'shutdown_timeout': 30,
}
#pytest.fixture(scope='session')
def celery_worker_parameters():
return {
'queues': (....),
}
#pytest.fixture(scope='session')
def celery_enable_logging():
return True
However, when I run the test it times out. Stacktrace:
task_id = '6009db28-637b-4447-a2c5-c0bdb3c03981', timeout = 10.0, interval = 0.5
no_ack = True, on_interval = <promise#0x7fcfaac01d30>
def wait_for(self, task_id,
timeout=None, interval=0.5, no_ack=True, on_interval=None):
"""Wait for task and return its result.
If the task raises an exception, this exception
will be re-raised by :func:`wait_for`.
Raises:
celery.exceptions.TimeoutError:
If `timeout` is not :const:`None`, and the operation
takes longer than `timeout` seconds.
"""
self._ensure_not_eager()
time_elapsed = 0.0
while 1:
meta = self.get_task_meta(task_id)
if meta['status'] in states.READY_STATES:
return meta
if on_interval:
on_interval()
# avoid hammering the CPU checking status.
time.sleep(interval)
time_elapsed += interval
if timeout and time_elapsed >= timeout:
> raise TimeoutError('The operation timed out.')
E celery.exceptions.TimeoutError: The operation timed out.
/venv/lib/python3.9/site-packages/celery/backends/base.py:792: TimeoutError
I've also tried to set the result backend to RPC, Redis, or cache+memory, and it still times out. Any idea what I'm missing?

Depending on the celery version you are using it could be this issue which has been fixed since 5.0.4. It was root-caused to the cache backend, though I see you've tried other backends with the same result.
Is it possible your backend and/or broker arent up when you run the tests? I also saw this timeout error when testing in a docker setup with the broker, backend, and test code all running in separate containers. Blowing away all the containers & rebuilding fixed it for me, so it was probably some weird state.
The github issue offers this minimum fixture & test case that might be worth trying out to see if you can get that to work.

Related

http4s shutdown takes 30 seconds?

I'm learning http4s and trying out the basic example from the documentation, and I've noticed something weird. Simply starting and stopping the server works fine, but if any requests are sent, a graceful shutdown takes about 30 seconds (during which new incoming requests are still processed and responded to).
This is the code:
object Main extends IOApp.Simple {
val helloWorldService = HttpRoutes.of[IO] {
case GET -> Root / "hello" / name =>
Ok(s"Hello, $name.")
}.orNotFound
def server[F[_] : Async : Network]: EmberServerBuilder[F] = {
EmberServerBuilder
.default[F]
.withHost(ipv4"0.0.0.0")
.withPort(port"8000")
}
def run: IO[Unit] = {
server[IO]
.withHttpApp(helloWorldService)
.build
.use(_ => IO.never)
}
}
This happens on both the stable (0.23.16) and dev (1.0.0-M37) versions.
Turns out the cause was the browser/Postman keeping the connection alive. Simply closing Postman after the request closed the connection and the shutdown was immediate.
And EmberServerBuilder has .withShutdownTimeout setting to control how long the shutdown waits for connections to be closed.

Is it possible to prevent execution of further tasks in locust TaskSequence if some task has failed?

For example i have the following class. How i can prevent execution of get_entity task if create_entity task was not executed?
class MyTaskSequence(TaskSequence):
#seq_task(1)
def create_entity(self):
self.round += 1
with self.client.post('/entities', json={}, catch_response=True) as resp:
if resp.status_code != HTTPStatus.CREATED:
resp.failure()
# how to stop other tasks for that run?
self.entity_id = resp.json()['data']['entity_id']
#seq_task(2)
def get_entity(self):
# It is being always executed,
# but it should not be run if create_entity task failed
resp = self.client.get(f'/entities/{self.entity_id}')
...
I found TaskSet.interrupt method in documentation, but does not allow to cancel root TaskSet. I tried to make parent TaskSet for my task sequence, so TaskSet.interrupt works.
class MyTaskSet(TaskSet):
tasks = {MyTaskSequence: 10}
But now i see that all results in ui are cleared after i call interrupt!
I just need to skip dependent tasks in this sequence. I need the results.
The easiest way to solve this is just to use a single #task with multiple requests inside it. Then, if a request fails just do a return after resp.failure()
Might self.interrupt() be what you are looking for?
See https://docs.locust.io/en/latest/writing-a-locustfile.html#interrupting-a-taskset for reference.
Why not using on_start(self): which runs once whenever a locust created, it can set a global which can be checked whether the locust executes the tasks
class MyTaskSequence(TaskSequence):
entity_created = false
def on_start(self):
self.round += 1
with self.client.post('/entities', json={}, catch_response=True) as resp:
if resp.status_code != HTTPStatus.CREATED:
self.entity_created = true
resp.failure()
self.entity_id = resp.json()['data']['entity_id']
#seq_task(2)
def get_entity(self):
if self.entity_created:
resp = self.client.get(f'/entities/{self.entity_id}')
...

Remove failure message from multiple assertion failure in NUnit

I have configured my tests to retry up to x number of times in the event of a failure, to ensure the failure is legitimate and not a fluke during the run. I do not log the error message on the initial failure.
However, I am noticing that if I am running a test, the first test fails, and then the second test passes and I check for any assertion failures via TestContext.CurrentContext.Result.Message and notice the first iteration failure is logged and my test is shown as failed, even though the test passed during the second iteration. If both tests fail, I will receive a "Multiple failures or warnings in test."
I would like to retain the final run's failure only vs all the failures for all the iterations. Is there a way to remove the initial failure from the TestContext.CurrentCOntext.Result.Message?
Edit: I am using NUnit v 3.10.1 and when I downgraded to v.3.4.0 I got the experience I desired without any modification to my code.
Use NUnit's [Retry(5)] attribute on your test to retry the test if it fails. Workarounds like in the link you posted depend on the undocumented internal behavior of NUnit that may change between releases.
Update based on your comment below, if you need to handle unexpected exceptions, wrap the flaky code that might throw in a try/catch block, then do your assertions outside of that block.
[Test]
[Retry(5)]
public void TestFlakyMethod()
{
int result = 0;
try
{
result = FlakyAdd(2, 2);
}
catch(Exception ex)
{
Assert.Fail($"Test failed with unexpected exception, {ex.Message}");
}
Assert.That(result, Is.EqualTo(4));
}
int FlakyAdd(int x, int y)
{
var rand = new Random();
if (rand.NextDouble() > 0.5)
throw new ArgumentOutOfRangeException();
return x + y;
}
Adding to the above, you can also use Assert.DoesNotThrow, it is a bit cleaner and easier to write.
[Test]
[Retry(5)]
public void TestFlakyMethod()
{
int result = 0;
Assert.DoesNotThrow(() => {{
result = FlakyAdd(2, 2);
});
Assert.That(result, Is.EqualTo(4));
}

Strange timeout with ScalaTest's Selenium DSL

I'm writing Selenium tests with ScalaTest's Selenium DSL and I'm running into timeouts I can't explain. To make matters more complicated, they only seem to happen some of the time.
The problem occurs whenever I access an Element after a page load or some Javascript rendering. It looks like this:
click on "editEmployee"
eventually {
textField(name("firstName")).value = "Steve"
}
My PatienceConfig is configured like this:
override implicit val patienceConfig: PatienceConfig =
PatienceConfig(timeout = Span(5, Seconds), interval = Span(50, Millis))
The test fails with the following error:
- should not display the old data after an employee was edited *** FAILED ***
The code passed to eventually never returned normally. Attempted 1 times over 10.023253653000001 seconds.
Last failure message: WebElement 'firstName' not found.. (EditOwnerTest.scala:24)
It makes sense that it doesn't succeed immediately, because the click causes some rendering, and the textfield may not be available right away. However, it shouldn't take 10 seconds to make an attempt to find it, right?
Also, I find it very interesting that the eventually block tried it only once, and that it took almost precisely 10 seconds. This smells like a timeout occurred somewhere, and it's not my PatienceConfig, because that was set to time out after 5 seconds.
With this workaround, it does work:
click on "editEmployee"
eventually {
find(name("firstName")).value // from ScalaTest's `OptionValues`
}
textField(name("firstName")).value = "Steve"
I did some digging in the ScalaTest source, and I've noticed that all calls that have this problem (it's not just textField), eventually call webElement at some point. The reason why the workaround works, is because it doesn't call webElement. webElement is defined like this:
def webElement(implicit driver: WebDriver, pos: source.Position = implicitly[source.Position]): WebElement = {
try {
driver.findElement(by)
}
catch {
case e: org.openqa.selenium.NoSuchElementException =>
// the following is avoid the suite instance to be bound/dragged into the messageFun, which can cause serialization problem.
val queryStringValue = queryString
throw new TestFailedException(
(_: StackDepthException) => Some("WebElement '" + queryStringValue + "' not found."),
Some(e),
pos
)
}
}
I've copied that code into my project and played around with it, and it looks like constructing and/or throwing the exception is where most of the 10 seconds are spent.
(EDIT Clarification: I've actually seen the code actually spend its 10 seconds inside the catch block. The implicit wait is set to 0, and besides, if I remove the catch block everything simply works as expected.)
So my question is, what can I do to avoid this strange behaviour? I don't want to have to insert superfluous calls to find all the time, because it's easily forgotten, especially since, as I said, the error occurs only some of the time. (I haven't been able to determine when the behaviour occurs and when it doesn't.)
It is clear that the textField(name("firstName")).value = "Steve" ends up calling the WebElement as you have found out.
Since the issue in the op is happening where ever web elements are involved (which in turn implies that webdriver is involved), I think it is safe to assume that the issue is related to the implicit wait on the Web driver.
implicitlyWait(Span(0, Seconds))
The above should ideally fix the issue. Also, making implicit wait to be 0 is a bad practice. Any web page might have some loading issues. The page load is handled by Selenium outside its wait conditions. But slow element load (may be due to ajax calls) could result in failure. I usually keep 10 seconds as my standard implicit wait. For scenarios which require more wait, explicit waits can be used.
def implicitlyWait(timeout: Span)(implicit driver: WebDriver): Unit = {
driver.manage.timeouts.implicitlyWait(timeout.totalNanos, TimeUnit.NANOSECONDS)
}
Execution Flow:
name("firstName") ends up having value as Query {Val by = By.className("firstName") }.
def name(elementName: String): NameQuery = new NameQuery(elementName)
case class NameQuery(queryString: String) extends Query { val by = By.name(queryString) }
Query is fed to the textField method which calls the Query.webElement as below.
def textField(query: Query)(implicit driver: WebDriver, pos: source.Position): TextField = new TextField(query.webElement)(pos)
sealed trait Query extends Product with Serializable {
val by: By
val queryString: String
def webElement(implicit driver: WebDriver, pos: source.Position = implicitly[source.Position]): WebElement = {
try {
driver.findElement(by)
}
catch {
case e: org.openqa.selenium.NoSuchElementException =>
// the following is avoid the suite instance to be bound/dragged into the messageFun, which can cause serialization problem.
val queryStringValue = queryString
throw new TestFailedException(
(_: StackDepthException) => Some("WebElement '" + queryStringValue + "' not found."),
Some(e),
pos
)
}
}
}
I don't know ScalaTest's specifics, but such strange timeouts usually occur when you're mixing up implicit and explicit waits together.
driver.findElement uses implicit waits internally. And depending on specified explicit waits timeout, you may face with summing both together.
Ideally, implicit waits should be set to 0 to avoid such issues.

Apache Spark: how to cancel job in code and kill running tasks?

I am running a Spark application (version 1.6.0) on a Hadoop cluster with Yarn (version 2.6.0) in client mode. I have a piece of code that runs a long computation, and I want to kill it if it takes too long (and then run some other function instead).
Here is an example:
val conf = new SparkConf().setAppName("TIMEOUT_TEST")
val sc = new SparkContext(conf)
val lst = List(1,2,3)
// setting up an infite action
val future = sc.parallelize(lst).map(while (true) _).collectAsync()
try {
Await.result(future, Duration(30, TimeUnit.SECONDS))
println("success!")
} catch {
case _:Throwable =>
future.cancel()
println("timeout")
}
// sleep for 1 hour to allow inspecting the application in yarn
Thread.sleep(60*60*1000)
sc.stop()
The timeout is set for 30 seconds, but of course the computation is infinite, and so Awaiting on the result of the future will throw an Exception, which will be caught and then the future will be canceled and the backup function will execute.
This all works perfectly well, except that the canceled job doesn't terminate completely: when looking at the web UI for the application, the job is marked as failed, but I can see there are still running tasks inside.
The same thing happens when I use SparkContext.cancelAllJobs or SparkContext.cancelJobGroup. The problem is that even though I manage to get on with my program, the running tasks of the canceled job are still hogging valuable resources (which will eventually slow me down to a near stop).
To sum things up: How do I kill a Spark job in a way that will also terminate all running tasks of that job? (as opposed to what happens now, which is stopping the job from running new tasks, but letting the currently running tasks finish)
UPDATE:
After a long time ignoring this problem, we found a messy but efficient little workaround. Instead of trying to kill the appropriate Spark Job/Stage from within the Spark application, we simply logged the stage ID of all active stages when the timeout occurred, and issued an HTTP GET request to the URL presented by the Spark Web UI used for killing said stages.
I don't know it this answers your question.
My need was to kill jobs hanging for too much time (my jobs extract data from Oracle tables, but for some unknonw reason, seldom the connection hangs forever).
After some study, I came to this solution:
val MAX_JOB_SECONDS = 100
val statusTracker = sc.statusTracker;
val sparkListener = new SparkListener()
{
override def onJobStart(jobStart : SparkListenerJobStart)
{
val jobId = jobStart.jobId
val f = Future
{
var c = MAX_JOB_SECONDS;
var mustCancel = false;
var running = true;
while(!mustCancel && running)
{
Thread.sleep(1000);
c = c - 1;
mustCancel = c <= 0;
val jobInfo = statusTracker.getJobInfo(jobId);
if(jobInfo!=null)
{
val v = jobInfo.get.status()
running = v == JobExecutionStatus.RUNNING
}
else
running = false;
}
if(mustCancel)
{
sc.cancelJob(jobId)
}
}
}
}
sc.addSparkListener(sparkListener)
try
{
val df = spark.sql("SELECT * FROM VERY_BIG_TABLE") //just an example of long-running-job
println(df.count)
}
catch
{
case exc: org.apache.spark.SparkException =>
{
if(exc.getMessage.contains("cancelled"))
throw new Exception("Job forcibly cancelled")
else
throw exc
}
case ex : Throwable =>
{
println(s"Another exception: $ex")
}
}
finally
{
sc.removeSparkListener(sparkListener)
}
For the sake of future visitors, Spark introduced the Spark task reaper since 2.0.3, which does address this scenario (more or less) and is a built-in solution.
Note that is can kill an Executor eventually, if the task is not responsive.
Moreover, some built-in Spark sources of data have been refactored to be more responsive to spark:
For the 1.6.0 version, Zohar's solution is a "messy but efficient" one.
According to setJobGroup:
"If interruptOnCancel is set to true for the job group, then job cancellation will result in Thread.interrupt() being called on the job's executor threads."
So the anno function in your map must be interruptible like this:
val future = sc.parallelize(lst).map(while (!Thread.interrupted) _).collectAsync()