Does ReactTestingLibrary take up a lot of CPU/RAM, causing tests to timeout & fail? - react-testing-library

I've had some issues with tests timing out randomly. Usually on CircleCI, but sometimes locally. Based on Kent Dodds suggestion to write fewer longer tests I now have more tests with multiple clicks & multiple network requests (mocking fetch too). Theses tests seem to timeout. Just recently CircleCI added a Resources tab to the pipeline for some interesting metrics. When the tests timeout, the 4GB ram clearly gets to 100% for extended time, and the test fails. On a passed test, the ram stays mostly below 100%.
Failed test (4GB):
Passed test (4GB):
Updated Resource_class to 8GB
I tried a single experiment to update my circleci config so that the resource_class gets updated to large/8GB. Test passed and even better CPU usage %.
So, does React Testing Library take up a lot of horsepower?
Is our default 4GB RAM docker image ok?

Related

Gitlab coordinator waits 5 minutes after job finished by runner - How to diagnose

We have a current (13.5.1-ee) Kubernetes deployed Gitlab which has recently developed an unusual and obstructive behaviour:
After a job finishes (either successfully or with a failure), and this is reported by the runner (as confirmed by the local log), the coordinator waits for 5 minutes to report the status via the UI and start the next job.
This behaviour does not depend on the:
type of executor. It occurs with the docker and kubernetes executor.
size of artifacts (for test cases there are none)
size of logs (for test cases they are 5 lines long)
image being used (for test cases it is busybox)
script (for tests it is empty)
network quality (for tests I have activated feature flags that this)
It feels that the coordinator is attempting a call to another system and times out.
Has anyone seen this before? Does anyone have a means to diagnose?
This was a bug affecting Azure.

Pytest live logging with parallel execution - possible?

I have a test suite that I run with
python3 -mpytest --log-cli-level=DEBUG ...
on the build server. The live logs are useful to troubleshoot if the tests get stuck or are slow for some reason (the tests use external resources).
To speed things up, it is possible to run them with e.g.
python3 -mpytest -n 4 --log-cli-level=DEBUG ...
to have four parallel test runners. Speedup is almost linear with number of processes, which is great, but unfortunately the parent process swallows all live logs. I get the captured logs in case of a test failure, but I need the live logs as well to understand what is going on in real time. I understand that the output from all four parallel runs will be intermixed and that is fine. The purpose is for the committer to just check the build server output and know roughly what is going on.
I am currently using pytest-xdist, but use none of the more advanced features from it (just the multiprocessing).

FabricDCA and MaxDiskQuotaInMB Configuration

There's two parts to this question. First, what falls under the purview of the Diagnostics---MaxDiskQuotaInMB configuration? Is it everything under SvcFab/Log? Just SvcFab/Log/AppInstanceData/? Having more info on this would be nice.
Second, what is the proper course of action if the FabricDCA.exe is running but the SvcFab/Log and SvcFab/Log/AppInstanceData/ folders exceed the limits we've set on their size? My team set them to 10,000 MB, but SvcFab/Log regularly takes up 12-16 GB.
The cluster configuration on Azure recognizes the change to the MaxDiskQuotaInMB configuration but there seems to be no impact on the node itself. I've tried resetting FabricDCA.exe as well and so far it has not helped either (after several hours).
One node in our cluster had so much space taken up by logs (over our limit) that remaining storage space was reduced to 1 MB.
Posting a more complete answer since it may be helpful to other people.
Most of the things under SvcFab/Log folder should fall under the quota set by MaxDiskQuotaInMB. There are a few things that may not, but the majority of things that usually take disk space are included. Keep in mind also that the task cleaning the disk usually runs every 5 minutes so you may see usage go over the quota within this timeframe.
If FabricDCA.exe is not properly cleaning files from this folder it is possible that you are hitting a bug in .Net runtime where all system.threading.timers stop firing and the disk to not be cleaned because FabricDCA relies on these timers to do so.
This is the bug on the .NET core side tracking the issue: (https://github.com/dotnet/coreclr/issues/26771). It seems to happen when the machine is running out of memory intermittently.
There is an auto-mitigation added in FabricDCA in Service Fabric 7.0.
The manual mitigation is usually to kill FabricDCA.exe process.
The process should start again and after a few minutes it will start cleaning again.
You mentioned that you already tried killing FabricDCA.exe so maybe the solution above does not work for you. In this case, try taking a look at the Service Fabric cluster manifest directly, it might be the case where your new configurations seem to be accepted by the ARM template deployment but the new configuration doesn't reach the cluster manifest which is the source of truth in this case.
Update:
There was a regression introduced as part of the auto-mitigation above which caused The AppInstanceFolder to fill up the disk. This is fixed in SF version 7.0.466

Running tests in parallel on multiple machines using py.test

I know that UI tests can be run in parallel on multiple machines using selenium grid. How about API tests?
I looked at pytest-xdist plugin and it can run tests in parallel on the local machine using py.test -n NUM, which will send tests to multiple CPUs and run them in parallel. This may not be as effective and fast, if the number of tests that we would like to run in parallel is much more than the no of CPUs on the machine. For example: If the machine has 4 CPUs and we would like to run 50 tests in parallel.
And it seems to run the tests on remote machine we need to do something like
py.test -d --tx socket=192.168.1.102:8888 --rsyncdir mypkg mypkg
I am wondering if there is a way to distribute the tests to multiple remote machines and run them in parallel. For example: If i have 1000 tests and 50 remote machines, then i would like each remote machine to run 1 or more tests at the same time so that tests complete faster. Which means, all the 1000 tests will complete in the time it takes for 20 tests or less.
Thanks.
It looks like you want the load distribution mode, followed by multiple invocations of the --tx argument:
py.test --dist=load --tx socket=192.168.1.110:8888 --tx socket=192.168.1.111:8888 --tx socket=192.168.1.112:8888 --rsyncdir mypkg mypkg
I'm sure you've looked at CPU usage of the python processes when running the tests. If you are doing what what I expect you are doing (running an integration test suite against a single instance of a network service with high response times), your test suite isn't CPU bound but is actually I/O bound. For this type of workload, CPU usage may appear high, but actually includes the amount of time the test runner spent waiting for a response from the system under test.
The biggest problem I've encountered when parallelizing that type of test suite is that the order tests complete sometimes matters, and when run in parallel tests finish in a different order when they run in series just due to variation in response times, causing intermittent and difficult to troubleshoot test failures.
If that doesn't happen with multiple cores on a single machine, that's a good sign your plan will work. That having been said, because there is operational overhead involved with keeping any pool of hosts around - patching with updates, dealing with configuration, provisioning, and networking, not to mention other unexpected issues, I suggest you try something different.
I think you should consider refactoring your test code to use asynchronous IO instead of setting up the test grid. When you do this correctly, multiple tests will be able to run on one core at the same time. Your sysadmin (which may be you!) will thank you.

why salt-cloud is so slow comparing to terraform?

I'm comparing salt-cloud and terraform as tools to manage our infrastructure at GCE. We use salt stack to manage VM configurations, so I would naturally prefer to use salt-cloud as an integral part of the stack and phase out terraform as a legacy thing.
However my use case is critical on VM deployment time because we offer PaaS solution with VMs deployed on customer request, so need to deliver ready VMs on a click of a button within seconds.
And what puzzles me is why salt-cloud takes so long to deploy basic machines.
I have created neck-to-neck simple test with deploying three VMs based on default CentOS7 image using both terraform and salt-cloud (both in parallel mode). And the time difference is stunning - where terraform needs around 30 seconds to deploy requested machines (which is similar to time needed to deploy through GCE GUI), salt-cloud takes around 220 seconds to deploy exactly same machines under same account in the same zone. Especially strange is that first 130 seconds salt-cloud does not start deploying and does seemingly nothing at all, and only after around 130 seconds pass it shows message deploying VMs and those VMs appear in GUI as in deployment.
Is there something obvious that I'm missing about salt-cloud that makes it so slow? Can it be sped up somehow?
I would prefer to user full salt stack, but with current speed issues it has I cannot really afford that.
Note that this answer is a speculation based on what I understood about terraform and salt-cloud, I haven't verified with an experiment!
I think the reason is that Terraform keeps state of the previous run (either locally or remotely), while salt-cloud doesn't keep state and so queries the cloud before actually provisioning anything.
These two approaches (keeping state or querying before doing something) are needed, since both tools are idempotent (you can run them multiple times safely).
For example, I think that if you remove the state file of Terraform and re-run it, it will assume there is nothing in the cloud and will actually instantiate a duplicate. This is not to imply that terraform does it wrong, it is to show that state is important and Terraform docs say clearly that when operating in a team the state should be saved remotely, exactly to avoid this kind of problem.
Following my line of though, this should also mean that if you either run salt-cloud in verbose debug mode or look at the network traffic generated by it, in the first 130 secs you mention (before it says "deploying VMs"), you should see queries from salt-cloud to the cloud provider to dynamically construct the state.
Last point, the fact that salt-cloud doesn't store the state of a previous run doesn't mean that it is automatically safe to use in a team environment. It is safe to use as long as no two team members run it at the same time. On the other hand, terraform with remote state on Consul allows for example to lock, so that team concurrent usage will always be safe.