I am using pyspark, EMR, terraform and airflow for triggering. I am writing pytest cases in my pyspark code. I have the following questions:
I)If I get an assertion as "False". I am thinking of sending a slack and email notification and shut down EMR cluster. is this the best practice that everyone follows? .
Can someone explain how to handle pytests in cases of failure assertions? or general methods of handling pytest failures
Related
I have the following celery setup in production
Using RabbitMQ as the broker
I am running multiple instances of the celery using ECS Fargate
Logs are sent to CloudWatch using default awslogs driver
Result backend - MongoDB
The issue I am facing is this. A lot of my tasks are not showing logs on cloudwatch.
I just see this log
Task task_name[3d56f396-4530-4471-b37c-9ad2985621dd] received
But I do not see the actual logs for the execution of this task. Nor do I see the logs for completion - for example something like this is nowhere in the logs to be found
Task task_name[3d56f396-4530-4471-b37c-9ad2985621dd] succeeded
This does not happen all the time. It happens intermittently but consistently. I see that a lot of tasks are printing the logs.
I can see that result backend has the task results and I know that the task has executed but the logs for the task are completely missing. It is not specific to some task_name.
On my local setup, I have not been able to isolate the issue
I am not sure if this is a celery logging issue or awslogs issue. What can I do to troubleshoot this?
** UPDATE **
Found the root cause - it was that I had some code in the codebase that was removing handlers from the root logger. Leaving this question on stack overflow in case someone else faces this issue
Log all requests from the python-requests module
Error Message - job failed with error message The output of the notebook is too large. Cause: rpc response (of 20972488 bytes) exceeds limit of 20971520 bytes
Details:
We are using databricks notebooks to run the job. Job is running on job cluster. This is a streaming job.
Job started failing with above mentioned error.
We do not have any display(), show(), print(), explain method in the job.
We are not using awaitAnyTermination method in the job as well.
We also tried adding "spark.databricks.driver.disableScalaOutput true" to the job but it still did not work. Job is failing with same error.
We have followed all the steps mentioned in this document - https://learn.microsoft.com/en-us/azure/databricks/kb/jobs/job-cluster-limit-nb-output
Do we have any option to resolve this issue or to find out exactly which commands output is causing it to go above 20MB limit.
See the docs regarding structured streaming in prod.
I would recommend migrating to workflows based on jar jobs because:
Notebook workflows are not supported with long-running jobs. Therefore we don’t recommend using notebook workflows in your streaming jobs.
Is there a way to run AWS cloudformation update in dry run mode? Cos I realised that the aws validate does not pick up all the errors and when you run the update new errors are being thrown.
The closest you could get to this is running mock testing which will require a fair amount of configuration time; it also cannot 100% guarantee you won't encounter any errors.
Moto has an extensive library dedicated to mock running AWS infrastructure and would be worth checking out, with the core endpoints of cloudformation included.
We are deploying Spring Cloud Data Flow v2.2.1.RELEASE in Kubernetes. Everything or almost seems to work but scheduling is not. In fact, even when running tasks by manual launch using the UI (or api) we see an error log. That same log is generated when trying to schedule but this time, it makes the schedule creation fails.
Here is a stack trace extract:
java.lang.IllegalArgumentException: taskDefinitionName must not be null or empty
at org.springframework.util.Assert.hasText(Assert.java:284)
at org.springframework.cloud.dataflow.rest.resource.ScheduleInfoResource.<init>(ScheduleInfoResource.java:58)
at org.springframework.cloud.dataflow.server.controller.TaskSchedulerController$Assembler.instantiateResource(TaskSchedulerController.java:174)
at org.springframework.cloud.dataflow.server.controller.TaskSchedulerController$Assembler.instantiateResource(TaskSchedulerController.java:160)
at org.springframework.hateoas.mvc.ResourceAssemblerSupport.createResourceWithId(ResourceAssemblerSupport.java:89)
at org.springframework.hateoas.mvc.ResourceAssemblerSupport.createResourceWithId(ResourceAssemblerSupport.java:81)
at org.springframework.cloud.dataflow.server.controller.TaskSchedulerController$Assembler.toResource(TaskSchedulerController.java:168)
at org.springframework.cloud.dataflow.server.controller.TaskSchedulerController$Assembler.toResource(TaskSchedulerController.java:160)
at org.springframework.data.web.PagedResourcesAssembler.createResource(PagedResourcesAssembler.java:208)
at org.springframework.data.web.PagedResourcesAssembler.toResource(PagedResourcesAssembler.java:120)
at org.springframework.cloud.dataflow.server.controller.TaskSchedulerController.list(TaskSchedulerController.java:85)
at sun.reflect.GeneratedMethodAccessor180.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
...,
We've looked at the table content, the task does have a name.
Any idea?
I've finally found the source of the error by debugging live Data Flow. The problem arises when CronJob that are not created by Data Flow are present in the namespace, which is by my evaluation a problem. The scheduler launches a process that loops on Kubernetes CronJob resources and tries to process them.
Data Flow should certainly do its processing on those using labels, like all Kubernetes native tools, to select only the elements that concerns it. Any process could use CronJob.
So Pivotal - Data Flow people, it would probably be a good idea to enhance that part and this way prevent that kind of "invisible" problems. I say invisible because the only error we get is the validation of the Schedule item, complaining about the fact that the name is empty and that is because that CronJob was not in any way linked to an SCDF task.
Hope that can help someone in the future.
Bug reported: https://github.com/spring-cloud/spring-cloud-deployer-kubernetes/issues/347
PR issued: https://github.com/spring-cloud/spring-cloud-deployer-kubernetes/pull/348
I am continuing Django project of someone who is using Celery along with Mandrill. There are daily reports which are sent to customers and due to some reason not a single mail is sent for three days, gets accumulated and sent together after three days. Since I am new to Celery, I want to know how to debug celery delays and errors, what are popular commands and execution path to follow?
Short tips:
Set debug=True in celery config, it will take you register and execution time for every task.
Install flower, popular tool for monitoring tasks
Use sentry for handy error tracking and aggregation
Happy debugging ;)