Parallel.For System.OutOfMemoryException - c#-3.0

We have a fairly simple program that's used for creating backups. I'm attempting to parallelize it but am getting an OutOfMemoryException within an AggregateException. Some of the source folders are quite large, and the program doesn't crash for about 40 minutes after it starts. I don't know where to start looking so the below code is a near exact dump of all code the code sans directory structure and Exception logging code. Any advice as to where to start looking?
using System;
using System.Diagnostics;
using System.IO;
using System.Threading.Tasks;
namespace SelfBackup
{
class Program
{
static readonly string[] saSrc = {
"\\src\\dir1\\",
//...
"\\src\\dirN\\", //this folder is over 6 GB
};
static readonly string[] saDest = {
"\\dest\\dir1\\",
//...
"\\dest\\dirN\\",
};
static void Main(string[] args)
{
Parallel.For(0, saDest.Length, i =>
{
try
{
if (Directory.Exists(sDest))
{
//Delete directory first so old stuff gets cleaned up
Directory.Delete(sDest, true);
}
//recursive function
clsCopyDirectory.copyDirectory(saSrc[i], sDest);
}
catch (Exception e)
{
//standard error logging
CL.EmailError();
}
});
}
}
///////////////////////////////////////
using System.IO;
using System.Threading.Tasks;
namespace SelfBackup
{
static class clsCopyDirectory
{
static public void copyDirectory(string Src, string Dst)
{
Directory.CreateDirectory(Dst);
/* Copy all the files in the folder
If and when .NET 4.0 is installed, change
Directory.GetFiles to Directory.Enumerate files for
slightly better performance.*/
Parallel.ForEach<string>(Directory.GetFiles(Src), file =>
{
/* An exception thrown here may be arbitrarily deep into
this recursive function there's also a good chance that
if one copy fails here, so too will other files in the
same directory, so we don't want to spam out hundreds of
error e-mails but we don't want to abort all together.
Instead, the best solution is probably to throw back up
to the original caller of copy directory an move on to
the next Src/Dst pair by not catching any possible
exception here.*/
File.Copy(file, //src
Path.Combine(Dst, Path.GetFileName(file)), //dest
true);//bool overwrite
});
//Call this function again for every directory in the folder.
Parallel.ForEach(Directory.GetDirectories(Src), dir =>
{
copyDirectory(dir, Path.Combine(Dst, Path.GetFileName(dir)));
});
}
}
The Threads debug window shows 417 Worker threads at the time of the exception.
EDIT: The copying is from one server to another. I'm now trying to run the code with the last Paralell.ForEach changed to a regular foreach.

Making a few guesses here as I haven't yet had feedback from the comment to your question.
I am guessing that the large amount of worker threads is happening here as actions (an action being the unit of work carried out on the parallel foreach) are taking longer than a specified amount of time, so the underlying ThreadPool is growing the number of threads. This will happen as the ThreadPool follows an algorithm of growing the pool so that new tasks are not blocked by existing long running tasks e.g. if all my current threads have been busy for half a second, I'll start adding more threads to the pool. However, you are going to get into trouble if all tasks are long-running and new tasks that you add are going to make existing tasks run even longer. This is why you are probably seeing a large number of worker threads - possibly because of disk thrashing or slow network IO (if networked drives are involved).
I am also guessing that files are being copied from one disk to another, or they are being copied from one location to another on the same disk. In this case, adding threads to the problem is not going to help out much. The source and destination disks only have one set of heads, so trying to make them do multiple things at once is likely to actually slow things down:
The disk heads will be lurching all over the place.
Your disk\OS caches may be frequently invalidated.
This may not be a great problem for parallelization.
Update
In answer to your comment, if you are getting a speed-up using multiple threads on smaller datasets, then you could experiment with lowering the maximum number of threads used in your parallel foreach, e.g.
ParallelOptions options = new ParallelOptions { MaxDegreeOfParallelism = 2 };
Parallel.ForEach(Directory.GetFiles(Src), options, file =>
{
//Do stuff
});
But please do bear in mind that disk thrashing may negate any benefits from parallelization in the general case. Play about with it and measure your results.

Related

How to capture command line input from Vert.x

Env: Mac OS 12.1, JDK 17, Vert.x 4.2.4
Question: how to capture command line input from a verticle? Tried so far following in the public void start(Promise<Void> startPromise) throws Exception method:
getVertx().createSharedWorkerExecutor("sys-in").executeBlocking(promise -> {
try (final BufferedReader br = new BufferedReader(new InputStreamReader(System.in))) {
String line;
int count = 0;
do {
System.out.print("message to MC: ");
line = br.readLine();
count++;
//doSth(line); // e.g. send line over multicast
} while (count < 3);
} catch (Throwable t) {
// log.info("<start> ", t);
} finally {
// bye(); // send a final message and close vertx
promise.complete();
}
});
This will start, get 3 nulls from br, and exit. Also tried a separated ExecutorService, in vain. Couldn't find any help in Vert.x doc either. Any hints are appreciated:
aware of the warnings of Vert.x when doing blocking stuff
Vert.x might not meant to be used this way, but would be cool if it (reading from command line) can be done with the same toolkit
I understand what you are trying to accomplish, but the problem is that that goes against fundamentals of verticles concept. Waiting for user input is potentially infinitely blocking operation i.e. there is no guarantee user will ever input the values. In that case, you are left with the verticle that is hung forever, spending resources and stuck in one spot. Multiply this if you are using worker verticles and you might have serious problems with the app. This issue is also emphasized here: https://vertx.io/docs/vertx-core/java/#blocking_code (under Warning).
In the link provided you can also find a suggested solution with a separate thread solution. Non-vertx thread won't mind being blocked and when the user input is provided can inform the vertx part of the application via the event bus that the user input dependent code can now be executed.
This might not be the solution you had in mind since it's not pure vertx, but have in mind that vert.x is just another tool, and that tool is not a good fit for what you are trying to accomplish here. However, it can be paired well with plain Java and it won't mind.

(Laravel 5) Monitor and optionally cancel an ALREADY RUNNING job on queue

I need to achieve the ability to monitor and be able to cancel an ALREADY RUNNING job on queue.
There's a lot of answers about deleting QUEUED jobs, but not on an already running one.
This is the situation: I have a "job", which consists of HUNDREDS OF THOUSANDS rows on a database, that need to be queried ONE BY ONE against a web service.
Every row needs to be picked up, queried against a web service, stored the response and its status updated.
I had that already working as a Command (launching from / outputting to console), but now I need to implement queues in order to allow piling up more jobs from more users.
So far I've seen Horizon (which doesn't runs on Windows due to missing process control libs). However, in some demos seen around it lacks (I believe) a couple things I need:
Dynamically configurable timeout (the whole job may take more than 12 hours, depending on the number of rows to process on the selected job)
Ability to CANCEL an ALREADY RUNNING job.
I also considered the option to generate EACH REQUEST as a new job instead of seeing a "job" as the whole collection of rows (this would overcome the timeout thing), but that would give me a Horizon "pending jobs" list of hundreds of thousands of records per job, and that would kill the browser (I know Redis can handle this without itching at all). Further, I guess is not possible to cancel "all jobs belonging to X tag".
I've been thinking about hitting an API route, fire the job and decouple it from the app, but I'm seeing that this requires forking processes.
For the ability to cancel, I would implement a database with job_id, and when the user hits an API to cancel a job, I'd mark it as "halted". On every loop I would check its status and if it finds "halted" then kill itself.
If I've missed any aspect just holler and I'll add it or clarify about it.
So I'm asking for an advice here since I'm new to Laravel: how could I achieve this?
So I finally came up with this (a bit clunky) solution:
In Controller:
public function cancelJob()
{
$jobs = DB::table('jobs')->get();
# I could use a specific ID and user owner filter, etc.
foreach ($jobs as $job) {
DB::table('jobs')->delete($job->id);
}
# This is a file that... well, it's self explaining
touch(base_path(config('files.halt_process_signal')));
return "Job cancelled - It will stop soon";
}
In job class (inside model::chunk() function)
# CHECK FOR HALT SIGNAL AND [OPTIONALLY] STOP THE PROCESS
if ($this->service->shouldHaltProcess()) {
# build stats, do some cleanup, log, etc...
$this->halted = true;
$this->service->stopProcess();
# This FALSE is what it makes the chunk() method to stop looping
return false;
}
In service class:
/**
* Checks the existence of the 'Halt Process Signal' file
*
* #return bool
*/
public function shouldHaltProcess() :bool
{
return file_exists($this->config['files.halt_process_signal']);
}
/**
* Stop the batch process
*
* #return void
*/
public function stopProcess() :void
{
logger()->info("=== HALT PROCESS SIGNAL FOUND - STOPPING THE PROCESS ===");
$this->deleteHaltProcessSignalFile();
return ;
}
It doesn't looks quite elegant, but it works.
I've surfed the whole web and many goes for Horizon or other tools that doesn't fit my case.
If anyone has a better way to achieve this, it's welcome to share.
Laravel queue have 3 important config:
1. retry_after
2. timeout
3. tries
See more: https://laravel.com/docs/5.8/queues
Dynamically configurable timeout (the whole job may take more than 12
hours, depending on the number of rows to process on the selected job)
I think you can config timeout + retry_after about 24h.
Ability to CANCEL an ALREADY RUNNING job.
Delete job in jobs table
Delete process by process id in your server
Hope it help you :)

MS-MPI MPI_Barrier: sometimes hangs indefinitely, sometimes doesn't

I'm using the MPI.NET library, and I've recently moved my application to a bigger cluster (more COMPUTE-NODES). I've started seeing various collective functions hang indefinitely, but only sometimes. About half the time a job will complete, the rest of the time it'll hang. I've seen it happen with Scatter, Broadcast, and Barrier.
I've put a MPI.Communicator.world.Barrier() call (MPI.NET) at the start of the application, and created trace logs (using the MPIEXEC.exe /trace switch).
C# code snippet:
static void Main(string[] args)
{
var hostName = System.Environment.MachineName;
Logger.Trace($"Program.Main entered on {hostName}");
string[] mpiArgs = null;
MPI.Environment myEnvironment = null;
try
{
Logger.Trace($"Trying to instantiated on MPI.Environment on {hostName}. Is currently initialized? {MPI.Environment.Initialized}");
myEnvironment = new MPI.Environment(ref mpiArgs);
Logger.Trace($"Is currently initialized?{MPI.Environment.Initialized}. {hostName} is waiting at Barrier... ");
Communicator.world.Barrier(); // CODE HANGS HERE!
Logger.Trace($"{hostName} is past Barrier");
}
catch (Exception envEx)
{
Logger.Error(envEx, "Could not instantiate MPI.Environment object");
}
// rest of implementation here...
}
I can see the msmpi.dll's MPI_Barrier function being called in the log, and I can see messages being sent and received thereafter for a passing and a failing example. For the passing example, messages are sent/received and then the MPI_Barrier function Leave is logged.
For the failing example it look like one (or more) of the send messages is lost - it is never received by the target. Am I correct in thinking that messages lost within the MPI_Barrier call will mean that the processes never synchronize, therefore all get stuck at the Communicator.world.Barrier() call?
What could be causing this to happen intermittently? Could poor network performance between the COMPUTE-NODES be a cause?
I'm running MS HPC Pack 2008 R2, so the version of MS-MPI is pretty old, v2.0.
EDIT - Additional information
If I keep a task running within the same node, then this issue does not happen. For example, if I run a task using 8 cores on one node then fine, but if i use 9 cores on two nodes I'll see this issue ~50% of the time.
Also, we have two clusters in use and this only happens on one of them. They are both virtualized environments, but appear to be set up identically.

SQLConnection Pooling - Handling InvalidOperationExceptions

I am designing a Highly Concurrent CCR Application in which it is imperative that I DO NOT Block or Send to sleep a Thread.
I am hitting SQLConnection Pool issues - Specifically getting InvalidOperationExceptions when trying to call SqlConnection.Open
I can potentially retry a hand full of times, but this isn't really solving the problem.
The ideal solution for me would be a method of periodically re-checking the connection for availablity that doesn't require a thread being tied up
Any ideas?
[Update]
Here is a related problem/solution posted at another forum
The solution requires a manually managed connection pool. I'd rather have a solution which is more dynamic i.e. kicks in when needed
Harry, I've run into this as well, also whilst using the CCR. My experience was that having completely decoupled my dispatcher threads from blocking on any I/O, I could consume and process work items much faster than the SqlConnection pool could cope with. Once the maximum-pool-limit was hit, I ran into the sort of errors you are seeing.
The simplest solution is to pre-allocate a number of non-pooled asynchronous SqlConnection objects and post them to some central Port<SqlConnection> object. Then whenever you need to execute a command, do so within an iterator with something like this:
public IEnumerator<ITask> Execute(SqlCommand someCmd)
{
// Assume that 'connPort' has been posted with some open
// connection objects.
try
{
// Wait for a connection to become available and assign
// it to the command.
yield return connPort.Receive(item => someCmd.Connection = item);
// Wait for the async command to complete.
var iarPort = new Port<IAsyncResult>();
var iar = someCmd.BeginExecuteNonQuery(iarPort.Post, null);
yield return iarPort.Receive();
// Process the response.
var rc = someCmd.EndExecuteNonQuery(iar);
// ...
}
finally
{
// Put the connection back in the 'connPort' pool
// when we're done.
if (someCmd.Connection != null)
connPort.Post(someCmd.Connection);
}
}
The nice thing about using the Ccr is that it is trivial to add the following the features to this basic piece of code.
Timeout - just make the initial receive (for an available connection), a 'Choice' with a timeout port.
Adjust the pool size dynamically. To increase the size of the pool, just post a new open SqlConnection to 'connPort'. To decrease the size of the pool, yield a receive on the connPort, and then close the received connection and throw it away.
Yes, connections are kept open and out of the connection pool. In the above example, the port is the pool.

WMI and Win32_DeviceChangeEvent - Wrong event type returned?

I am trying to register to a "Device added/ Device removed" event using WMI. When I say device - I mean something in the lines of a Disk-On-Key or any other device that has files on it which I can access...
I am registering to the event, and the event is raised, but the EventType propery is different from the one I am expecting to see.
The documentation (MSDN) states : 1- config change, 2- Device added, 3-Device removed 4- Docking. For some reason I always get a value of 1.
Any ideas ?
Here's sample code :
public class WMIReceiveEvent
{
public WMIReceiveEvent()
{
try
{
WqlEventQuery query = new WqlEventQuery(
"SELECT * FROM Win32_DeviceChangeEvent");
ManagementEventWatcher watcher = new ManagementEventWatcher(query);
Console.WriteLine("Waiting for an event...");
watcher.EventArrived +=
new EventArrivedEventHandler(
HandleEvent);
// Start listening for events
watcher.Start();
// Do something while waiting for events
System.Threading.Thread.Sleep(10000);
// Stop listening for events
watcher.Stop();
return;
}
catch(ManagementException err)
{
MessageBox.Show("An error occurred while trying to receive an event: " + err.Message);
}
}
private void HandleEvent(object sender,
EventArrivedEventArgs e)
{
Console.WriteLine(e.NewEvent.GetPropertyValue["EventType"]);
}
public static void Main()
{
WMIReceiveEvent receiveEvent = new WMIReceiveEvent();
return;
}
}
Well, I couldn't find the code. Tried on my old RAC account, nothing. Nothing in my old backups. Go figure. But I tried to work out how I did it, and I think this is the correct sequence (I based a lot of it on this article):
Get all drive letters and cache
them.
Wait for the WM_DEVICECHANGE
message, and start a timer with a
timeout of 1 second (this is done to
avoid a lot of spurious
WM_DEVICECHANGE messages that start
as start as soon as you insert the
USB key/other device and only end
when the drive is "settled").
Compare the drive letters with the
old cache and detect the new ones.
Get device information for those.
I know there are other methods, but that proved to be the only one that would work consistently in different versions of windows, and we needed that as my client used the ActiveX control on a webpage that uploaded images from any kind of device you inserted (I think they produced some kind of printing kiosk).
Oh! Yup, I've been through that, but using the raw Windows API calls some time ago, while developing an ActiveX control that detected the insertion of any kind of media. I'll try to unearth the code from my backups and see if I can tell you how I solved it. I'll subscribe to the RSS just in case somebody gets there first.
Well,
u can try win32_logical disk class and bind it to the __Instancecreationevent.
You can easily get the required info
I tried this on my system and I eventually get the right code. It just takes a while. I get a dozen or so events, and one of them is the device connect code.