Running scalapbc command from a thread pool - scalapb

I am trying to run the scalapb command from a threadpool with each thread running the scalapbc command;
When I do that, I get an error of the form :
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGBUS (0x7) at pc=0x00007f32dd71fd70, pid=8346, tid=0x00007f32e0018700
As per my google search, this issue occurs when the /tmp folder is full or it is trying to be accessed by multiple code simultaneously.
My question is, is there a way to issue scalapbc commands using threading without getting above error? how to make sure that the temp folders being used by the individual threads don't interfere with each other?
This issue occurs most of the times when I run the code but sometimes the build passes as well.

Related

PowerShell Azure Function: How to fix "Failed to start a new language worker for runtime: powershell"?

In an Azure Function App containing one PowerShell function I get the following log message regularly:
"Failed to start a new language worker for runtime: powershell."
It has "Error" level and thus triggers our error alert notifications. I'm not entirely sure when this message appears. It might appear around restarting the function app, which might explain it somewhat. I think I remember it appearing during normal function operation - but I might be mistaken.
There is a rather involved thread over here about a similar message but for the dotnet runtime that suggests there are configuration options to configure: Azure Function - Failed to start a new language worker for runtime: dotnet-isolated
My function app runtime version is ~4, PS Core version is 7.0, platform is 64 bit and Windows.
What is the error message trying to telling me? Can I ignore it? Is there a configuration I can add to fix it?
After monitoring this for a while over the course of multiple deployments and restarts of PowerShell-based Function Apps my conclusion is this:
The error "Failed to start a new language worker for runtime: powershell." only appears when restarting the Function App. Thus my takeaway is that it can be ignored.

failed using cuda-gdb to launch program with CUPTI calls

I'm having this weird issue: I have a program that uses CUPTI callbackAPI to monitor the kernels in the program. It runs well when it's directly launched; but when I put it under cuda-gdb and run, it failed with the following error:
error: function cuptiSubscribe(&subscriber, CUpti_CallbackFunc)my_callback, NULL) failed with error CUPTI_ERROR_NOT_INITIALIZED
I've tried all examples in CUPTI/samples and concluded that programs that use callbackAPI and activityAPI will fail under cuda-gdb. (They are all well-behaved without cuda-gdb) But the fail reason differs:
If I have calls from activityAPI, then once run it under cuda-gdb, it'll hang for a minute then exit with error:
The CUDA driver has hit an internal error. Error code: 0x100ff00000001c Further execution or debugging is unreliable. Please ensure that your temporary directory is mounted with write and exec permissions.
If I have calls from callbackAPI like my own program, then it'll fail out much sooner with the same error:
CUPTI_ERROR_NOT_INITIALIZED
Any experience on this kinda issue? I really appreciate that!
According to NVIDIA forum posting here and also referred to here, the CUDA "tools" must be used uniquely. These tools include:
CUPTI
any profiler
cuda-memcheck
a debugger
Only one of these can be "in use" on a code at a time. It should be fairly easy for developers to use a profiler, or cuda-memcheck, or a debugger independently, but a possible takeaway for those using CUPTI, who also wish to be able to use another CUDA "tool" on the same code, would be to provide a coding method to be able to disable CUPTI use in their application, when they wish to use another tool.

Why do Selenium tests behave different on different machines?

I couldn't find much information on Google regarding this topic. Below, I have provided three results from the same Selenium tests. Why am I getting different results when running the tests from different places?
INFO:
So our architecture: Bitbucket, Bamboo Stage 1 (Build, Deploy to QA), Bamboo Stage 2 (start Amazon EC2 instance "Test", run tests from Test against recently deployed QA)
Using Chrome Webdriver.
For all three of the variations I am using the same QA URL that our application is deployed on.
I am running all tests Parallelizable per fixture
The EC2 instance is running Windows Server 2012 R2 with the Chrome browser installed
I have made sure that the test solution has been properly deployed to the EC2 "test" instance. It is indeed the exact same solution and builds correctly.
First, Local:
Second, from EC2 Via SSM Script that invokes the tests:
Note that the PowerShell script calls the nunit3-console.exe just like it would be utilized in my third example using the command line.
Lastly, RDP in on EC2 and run tests from the command line:
This has me perplexed... Any reasons why Selenium is running different on different machines?
This really should be a comment, but I can't comment yet so...
I don't know enough about the application you are testing to say for sure, but this seems like something I've seen testing the application I'm working on.
I have seen two issues. First, Selenium is checking for the element before it's created. Sometimes it works and sometimes it fails, it just depends on how quickly the page loads when the test runs. There's no rhyme or reason to it. Second, the app I'm testing is pretty dumb. When you touch a field, enter data and move on to the next, it, effectively, posts all editable fields back to the database and refreshes all the fields. So, Selenium enters the value, moves to the next field and pops either a stale element error or can't find element error depending on when in the post/refresh cycle it attempts to interact with the element.
The solution I have found is moderately ugly, I tried the wait until, but because it's the same element name, it's already visible and is grabbed immediately which returns a stale element. As a result, the only thing that I have found is that by using explicit waits between calls, I can get it to run correctly consistently. Below is an example of what I have to do with the app I'm testing. (I am aware that I can condense the code, I am working within the style manual for my company)
Thread.Sleep(2000);
By nBaseLocator = By.XPath("//*[#id='attr_seq_1240']");
IWebElement baseRate = driver.FindElement(nBaseLocator);
baseRate.SendKeys(Keys.Home + xBaseRate + Keys.Tab);
If this doesn't help, please tell us more about the app and how it's functioning so we can help you find a solution.
#Florent B. Thank you!
EDIT: This ended up not working...
The tests are still running different when called remotely with a powershell script. But, the tests are running locally on both the ec2 instance and my machine correctly.
So the headless command switch allowed me to replicate my failed tests locally.
Next I found out that a headless chrome browser is used during the tests when running via script on an EC2 instance... That is automatic, so the tests where indeed running and the errors where valid.
Finally, I figured out that the screen size is indeed the culprit as it was stuck to a size of 600/400 (600/400?)
So after many tries, the only usable screen size option for Windows, C# and ChromeDriver 2.32 is to set your webDriver options when you initiate you driver:
ChromeOptions chromeOpt = new ChromeOptions();
chromeOpt.AddArguments("--headless");
chromeOpt.AddArgument("--window-size=1920,1080");
chromeOpt.AddArguments("--disable-gpu");
webDriver = new ChromeDriver(chromeOpt);
FINISH EDIT:
Just to update
Screen size is large enough.
Still attempting to solve the issue. Anyone else ran into this?
AWS SSM Command -> Powershell -> Run Selenium Tests with Start-Process -> Any test that requires an element fails because ElementNotFound or ElementNotVisible exceptions.
Using POM for tests. FindsBy filter in c# is not finding elements.
Running tests locally on EC2 run fine from cmd, powershell and Powershell ISE.
The tests do not work correctly when executing with the AWS SSM Command. Cannot find any resources to fix problem.

Perl Multithreading: ''forks'' vs ''threads'' in cross platforms scripts

I'm trying to develop a software featuring multithreading.
On Linux the script works fine, the module i've used is ''forks''.
In other words, no problem of shared handlers between threads while querying db and stuff like that.
Once i've been trying to run the script on windows (Strawberry Perl), when i try to cpan install forks it says that ''forks module'' is not supported for the current version of my OS (64bit).
Moving over, i decided to use ''threads'' instead, but i gained the following error, almost certainly linked with shared handlers between threads.
''
Thread 1 terminated abnormally: DBD::SQLite::db prepare failed: handle 2 is owned by thread d97fe8 not current thread 3a01058 (handles can't be shared between threads and your driver may need a CLONE method added) at file.pl line 180, line 1.
''
How to fix the aforementioned problem and make the script runnable on windows Strawberry Perl?
In general each thread or process needs to have its own handle to the database. Create a new handle after each fork or within each threads::create.
First of all, thanks everybody for the prompt answer and the very good suggestions.
That's my solution.
Apparently, it seems there's no need to create a new db handler for each thread, just connect to the database each and every time you create a new thread.
So, if you're using ''threads'', i think it should be enough to include the following line within the ''threads->create'' method to make it runnable:
$dbh = DBI->connect( "dbi:SQLite:db_name.dbl" ) || die "Cannot connect: $DBI::errstr";
As you can see, the handler is static but it's not literally ''shared'' between
threads, as it is created each and every time you connect to the db, so each and every time you run a new thread!
Please, don't ask me why the original script works fine in Linux using ''forks'' and without this trick...it seems that with ''forks'' you don't need to care about shared handlers.
Thanks again

Attempt to access remote folder mounted with CIFS hangs when disconnected

This question is an extension for that question.
Yet again: I'm working under CentOS 6.0 and I have a remote win7 folder, mounted with:
mount -t cifs //PC128/mnt /media/net -o "username=WORKGROUP\user,password=pwd,rw,noexec,soft,uid=user,gid=user"
When remote folder is not available (e.g. network cable is pulled out) an attempt to access the remote folder locks an application I'm working on. At first I detected that QDir::exists() caused locking for 20-90 seconds (I still can't find out why such difference), further I detected that any call to stat() function leads to application lock.
I followed an advice provided in topic above, I moved QDir::exists() call (and later - call to the stat() function) to another thread and this didn't solve the problem. The application still hangs when connection is suddenly lost. Qt trace shows that lock is somewhere in the kernel:
0 __kernel_vsyscall
1 __xstat64#GLIBC_2.1 /lib/libc.so.6
2 QFSFileEnginePrivate::doStat stat.h
I did also tried to check if remote share is still mounted before trying to access folder itself, but it didn't help. Approaches such as:
mount | grep /media/net
show that shared folder is still mounted even is there is no active connection to the network.
Checking folder status differences such as:
stat -fc%t:%T /media/net/ != stat -fc%t:%T /media/net/..
also hangs for ~20 seconds.
So I have several questions:
Is there any way to change CIFS timeouts? I did try to find out but it seems that there is no appropriate parameters and no CIFS config.
How can I check if remote folder is still mounted and not get locked?
How can I check is folder exists and also not get locked?
Your problem: "An unreachable network filesystem" is a very well known example which trigger linux hung task which isn't the same of zombies process at all(killing the parent PID won't do anything)
An hung task, is task which triggered a system call that cause problem in the kernel, so that the system call never return.
The major particularity is that the task is declared in the "D" state by the scheduler which mean the program is in an uninterruptible state. This mean that you can do nothing to stop you program: You can trigger all signal to the task, it would not respond. Launching hundreds of SIGTERM/SIGKILL does nothing!
This the case whith my old kernel: when my nfs server crash, I need to reboot the client to kill the tasks using the filesystem. I compiled it a long time ago (I have still the build tree on my hdd) and during the configuration I saw this in lib/Kconfig.debug:
config DETECT_HUNG_TASK
bool "Detect Hung Tasks"
depends on DEBUG_KERNEL
default LOCKUP_DETECTOR
help
Say Y here to enable the kernel to detect "hung tasks",
which are bugs that cause the task to be stuck in
uninterruptible "D" state indefinitiley.
When a hung task is detected, the kernel will print the
current stack trace (which you should report), but the
task will stay in uninterruptible state. If lockdep is
enabled then all held locks will also be reported. This
feature has negligible overhead.
It was only proposing to detect such tash or panic on detection: I don't checked if recent kernel actually can solve the problem (It seems to be the case with your question), but I think it didn't worth enabling it.
There is second problem : normally, the detection occur after 120 seconds, but I saw also a Konfig option for this:
config DEFAULT_HUNG_TASK_TIMEOUT
int "Default timeout for hung task detection (in seconds)"
depends on DETECT_HUNG_TASK
default 120
help
This option controls the default timeout (in seconds) used
to determine when a task has become non-responsive and should
be considered hung.
It can be adjusted at runtime via the kernel.hung_task_timeout_secs
sysctl or by writing a value to
/proc/sys/kernel/hung_task_timeout_secs.
A timeout of 0 disables the check. The default is two minutes.
Keeping the default should be fine in most cases.
This also works with kernel threads: example: make a loop device to a file on a fuse filesystem. Then crash the userspace program controlling the fuse filesystem!
You should a get a Ktread which name is in the form loopX (X correspond normally to your loopback device number) HUNGing!
weblinks:
https://unix.stackexchange.com/questions/5642/what-if-kill-9-does-not-work (look at the answer written by ultrasawblade)
http://www.linuxquestions.org/questions/linux-general-1/kill-a-hung-task-when-kill-9-doesn't-help-697305/
http://forums-web2.gentoo.org/viewtopic-t-811557-start-0.html
http://comments.gmane.org/gmane.linux.kernel/1189978
http://comments.gmane.org/gmane.linux.kernel.cifs/7674 (This is a case similar to yours)
In your case of the three question: you have the answer: This probably due to what is probably a well known bug in the vfs linux kernel layer! (There is no CIFS timeouts)
After much trial & error I found a solution that persists.
# vim /etc/fstab
//192.168.1.122/myshare /mnt/share cifs username=user,password=password,_netdev 0 0
The _netdev option is important since we are mounting a network device. Clients may hang during the boot process if the system encounters any difficulties with the network.
https://www.redhat.com/sysadmin/samba-windows-linux