Reading Multiple Files in Multiple Threads using C#, Slow !

Reading Multiple Files in Multiple Threads using C#, Slow ! - c#-3.0

I have an Intel Core 2 Duo CPU and i was reading 3 files from my C: drive and showing
some matching values from the files onto a EditBox on Screen.The whole process takes 2 minutes.Then I thought of processing each file in a separate thread and then the whole process is taking 2.30 minutes !!! i.e 30 seconds more than single threaded processing.
I was expecting the other way around !I can see both the Graphs in CPU usage history.Some one please explain to me what is going on ?
here is my code snippet.
foreach (FileInfo file in FileList)
{
Thread t = new Thread(new ParameterizedThreadStart(ProcessFileData));
t.Start(file.FullName);
}
where processFileData is the method that process the files.
Thanks!

The root of the problem is that the files are on the same drive and, unlike your dual core processor, your hard drive can only do one thing at a time.
If you read two files simultaneously, the disk heads will jump from one file to the other and back again. Given that your hard drive can read each file in roughly 40 seconds, it now has the additional overhead of moving its disk head between the three separate files many times during the read.
The fastest way to read multiple files from a single hard drive is to do it all in one thread and read them one after another. This way, the head only moves once per file read (at the very beginning) and not multiple times per read.
To optimize this process, you'll either need to change your logic (do you really need to read the whole contents of all three files?). Or purchase a faster hard drive/put the 3 files in three different hard drives and use threading/use a raid.

If you read from disk using multiple threads, then the disk heads will bounce around from one part of the disk to another as each thread reads from a different part of the drive. That can reduce throughput significantly, as you've seen.
For that reason, it's actually often a better idea to have all disk accesses go through a single thread, to help minimize disk seeks.
If your task is I/O bound and if it needs to run often, you might look at a tool like "contig" to make sure the layout of your files on disk is optimized / contiguous.

If you processing is mostly IO bound and CPU bound it make sense it take same time or even more.
How do you compare those files ? You should think what is the bottleneck of you application? IO output/input, CPU, memory ...
The multithreading is only interesting for CPU bound processing. i.e. complex calculation, comparison of data in memory, sorting etc ...

Since your process is IO bound, you should let the OS do your threading for you. Look at FileStream.BeginRead() for an example how to queue up your reads. Your EndRead() method can spin up your next request to read your next block of data pointing to itself to handle each subsequent completed block.
Also, with you creating additional threads, the OS has to manage more threads. And if a different CPU happens to get picked to handle the completed read, you've lost all of the CPU caching where your thread originated.
As you've found, you can't "speed up" an application just by adding threads.

Related

Scala concurrency performance issues

I have a data mining app.
There is 1 Mining Actor which receives and processes a Json containing 1000 objects. I put this into a list and foreach, I log the data by sending it to 1 Logger Actor which logs data into many files.
Processing the list sequentially, my app uses 700MB and takes ~15 seconds of 20% cpu power to process (4 core cpu). When I parallelize the list, my app uses 2GB and ~ the same amount of time and cpu to process.
My questions are:
Since I parallelized the list and thus the computation, shouldn't the compute-time decrease?
I think having only one Logger Actor is a bottleneck in this case. The computation may be faster but the bottleneck hides the speed increase. So if I add more Loggers to the pool, the app time should decrease?
Why does the memory usage jump to 2GB? Does the JVM have to store the entire collection in memory to parallelize it? And after the computation is done, the JVM garbage collector should deal with it?

Without more details, any answer is a guess. However, even a guess might point you to the right direction.
Parallelized execution should decrease the running time but your problem might lie elsewhere. For some reason, your CPU is idling a lot even in the single-threaded mode. You do not specify whether you read the input from disk or the network or where you write your output to. You explicitly say that you write logs to a lot of files. Disk and network reading/writing might in your case take much longer than data processing. Most probably your process is idle due to this I/O waiting. You should not expect any speedups from parallelizing a job that spends 80% of its time waiting on I/O. I therefore also suspect that loggers are not the bottleneck here.
The memory usage might jump if your threads allocate a lot of memory each. In that case, the more threads you have the more memory will be required. I don't know what kind of collection you are parallelizing on, but most are stored in memory, completely. Yes, the garbage collector will free any resources that do not require you to explicitly free them, such as files.

How many threads for reading and writing to the hard disk?
The memory increases because I send messages faster than the Logger can write, so the Mailbox balloons in size until the Logger has processed the messages and the GC kicks in.
I solved this by writing state to a protocol buffer file. Before doing any writes, I compare with the protobuf file because reads are significantly cheaper than writes. My resource usage is now 10% for 2 seconds, and less than 400MB RAM.

Run time memory of perl script

I am having a perl script which is killed by a automated job whenever a high priority process comes as my script is running ~ 300 parallel jobs for downloading data and is consuming lot of memory. I want to figure out how much is the memory it takes during run time so that I can ask for more memory before scheduling the script or if I get to know using some tool the portion in my code which takes up more memory, I can optimize the code for it.

Regarding OP's comment on the question, if you want to minimize memory use, definitely collect and append the data one row/line at a time. If you collect all of it into a variable at once, that means you need to have all of it in memory at once.
Regarding the question itself, you may want to look into whether it's possible to have the Perl code just run once (rather than running 300 separate instances) and then fork to create your individual worker processes. When you fork, the child processes will share memory with the parent much more efficiently than is possible for unrelated processes, so you will, e.g., only need to have one copy of the Perl binary in memory rather than 300 copies.

Perl Threads faster than Sequentially processing?

Just wanted to ask whether it's true that parallel processing is faster than sequentially processing.
I've always thought that parallel processing is faster, so therefore, I did an experiment.
I benchmarked my scripts and found out that after doing a bunch of
sub add{
for ($x=0; $x<=200000; $x++){
$data[$x] = $x/($x+2);
}
}
threading seems to be slower by about 0.5 CPU secs on average. Is this normal or is it really true that sequentially processing is faster?

Whether parallel vs. sequential processing is better is highly task-dependent and you've already done the right thing: You benchmarked both and determined for your task (the one you benchmarked, not necessarily the one you actually want to do) which one is faster.
As a general rule, on a single processor, sequential processing tends to be better for tasks which are CPU-bound, because if you have two tasks each needing five seconds of CPU time to complete, then you'll need ten seconds of CPU time regardless of whether you do them sequentially or in parallel. Setting up multiple threads/processes will, therefore, provide no benefit, but it will create additional task-switching overhead while also preventing you from having any results until all results are available.
CPU-bound tasks on a multi-processor system tend to do better when run in parallel, provided that they can run independently of each other. If not, or if you're using a language/threading model/IPC model/etc. which forces all tasks to run on the same processor, then see "on a single processor" above.
Parallel processing is generally better for tasks which are I/O-bound, regardless of the number of processors available, because CPUs are fast and I/O is slow, so working in parallel allows one task to process its data while the other is waiting for I/O operations to complete. (This is why make -j2 tends to be significantly faster than a plain make, even on single-processor machines.)
But, again, these are all generalities and all have cases where they'll be incorrect. Only benchmarking will reveal the truth with certainty.

Perl threads are an extreme suck. You are better off in every case forking several processes.
When you create a new thread in perl, it does the following:
Make a copy - yes, a real copy - of every single perl data structure in scope, including those belonging to modules you didn't write
Start up what is almost a new, independent instance of perl in a new OS thread
If you then want to share anything (as it has now copied everything) you have to use the share function in the threads module. This is incredibly sucky, as it replaces your variable, with some tie() nonsense, which adds much-too-fine-grained locking around it to prevent concurrent access. Accessing a shared variable then causes a massive amount of implicit locking, and is incredibly slow.
So in short, perl threads:
Take a long time to start
waste loads of memory
Cannot share data efficiently anyway.
You are much better off with fork(), which does not copy every variable (the kernel does copy-on-write) unless you're on Windows.

There's no reason to assume that in a single CPU core system, parallel processing will be faster.
Consider this png example:
The red and blue lines at the top represent two tasks running sequentially on a single core.
The alternate red and blue lines at the bottom represent two task running in parallel on a single core.

What is the fastest way to read 10 GB file from the disk?

We need to read and count different types of messages/run
some statistics on a 10 GB text file, e.g a FIX engine
log. We use Linux, 32-bit, 4 CPUs, Intel, coding in Perl but
the language doesn't really matter.
I have found some interesting tips in Tim Bray's
WideFinder project. However, we've found that using memory mapping
is inherently limited by the 32 bit architecture.
We tried using multiple processes, which seems to work
faster if we process the file in parallel using 4 processes
on 4 CPUs. Adding multi-threading slows it down, maybe
because of the cost of context switching. We tried changing
the size of thread pool, but that is still slower than
simple multi-process version.
The memory mapping part is not very stable, sometimes it
takes 80 sec and sometimes 7 sec on a 2 GB file, maybe from
page faults or something related to virtual memory usage.
Anyway, Mmap cannot scale beyond 4 GB on a 32 bit
architecture.
We tried Perl's IPC::Mmap and Sys::Mmap. Looked
into Map-Reduce as well, but the problem is really I/O
bound, the processing itself is sufficiently fast.
So we decided to try optimize the basic I/O by tuning
buffering size, type, etc.
Can anyone who is aware of an existing project where this
problem was efficiently solved in any language/platform
point to a useful link or suggest a direction?

Most of the time you will be I/O bound not CPU bound, thus just read this file through normal Perl I/O and process it in single thread. Unless you prove that you can do more I/O than your single CPU work, don't waste your time with anything more. Anyway, you should ask: Why on Earth is this in one huge file? Why on Earth don't they split it in a reasonable way when they generate it? It would be magnitude more worth work. Then you can put it in separate I/O channels and use more CPU's (if you don't use some sort of RAID 0 or NAS or ...).
Measure, don't assume. Don't forget to flush caches before each test. Remember that serialized I/O is a magnitude faster than random.

This all depends on what kind of preprocessing you can do and and when.
On some of systems we have, we gzip such large text files, reducing them to 1/5 to 1/7 of their original size. Part of what makes this possible is we don't need to process these files
until hours after they're created, and at creation time we don't really have any other load on the machines.
Processing them is done more or less in the fashion of zcat thosefiles | ourprocessing.(well it's done over unix sockets though with a custom made zcat). It trades cpu time for disk i/o time, and for our system that has been well worth it. There's ofcourse a lot of variables that can make this a very poor design for a particular system.

Perhaps you've already read this forum thread, but if not:
http://www.perlmonks.org/?node_id=512221
It describes using Perl to do it line-by-line, and the users seem to think Perl is quite capable of it.
Oh, is it possible to process the file from a RAID array? If you have several mirrored disks, then the read speed can be improved. Competition for disk resources may be what makes your multiple-threads attempt not work.
Best of luck.

I wish I knew more about the content of your file, but not knowing other than that it is text, this sounds like an excellent MapReduce kind of problem.
PS, the fastest read of any file is a linear read. cat file > /dev/null should be the speed that the file can be read.

Have you thought of streaming the file and filtering out to a secondary file any interesting results? (Repeat until you have a manageble size file).

Basically need to "Divide and conquer", if you have a network of computers, then copy the 10G file to as many client PCs as possible, get each client PC to read an offset of the file. For added bonus, get EACH pc to implement multi threading in addition to distributed reading.

Parse the file once, reading line by line. Put the results in a table in a decent database. Run as many queries as you wish. Feed the beast regularly with new incoming data.
Realize that manipulating a 10 Gb file, transferring it across the (even if local) network, exploring complicated solutions etc all take time.

I have a co-worker who sped up his FIX reading by going to 64-bit Linux. If it's something worthwhile, drop a little cash to get some fancier hardware.

hmmm, but what's wrong with the read() command in C? Usually has a 2GB limit,
so just call it 5 times in sequence. That should be fairly fast.

If you are I/O bound and your file is on a single disk, then there isn't much to do. A straightforward single-threaded linear scan across the whole file is the fastest way to get the data off of the disk. Using large buffer sizes might help a bit.
If you can convince the writer of the file to stripe it across multiple disks / machines, then you could think about multithreading the reader (one thread per read head, each thread reading the data from a single stripe).

Since you said platform and language doesn't matter...
If you want a stable performance that is as fast as the source medium allows for, the only way I am aware that this can be done on Windows is by overlapped non-OS-buffered aligned sequential reads. You can probably get to some GB/s with two or three buffers, beyond that, at some point you need a ring buffer (one writer, 1+ readers) to avoid any copying. The exact implementation depends on the driver/APIs. If there's any memory copying going on the thread (both in kernel and usermode) dealing with the IO, obviously the larger buffer is to copy, the more time is wasted on that rather than doing the IO. So the optimal buffer size depends on the firmware and driver. On windows good values to try are multiples of 32 KB for disk IO. Windows file buffering, memory mapping and all that stuff adds overhead. Only good if doing either (or both) multiple reads of same data in random access manner. So for reading a large file sequentially a single time, you don't want the OS to buffer anything or do any memcpy's. If using C# there's also penalties for calling into the OS due to marshaling, so the interop code may need bit of optimization unless you use C++/CLI.
Some people prefer throwing hardware at problems but if you have more time than money, in some scenarios it's possible to optimize things to perform 100-1000x better on a single consumer level computer than a 1000 enterprise priced computers. The reason is that if the processing is also latency sensitive, going beyond using two cores is probably adding latency. This is why drivers can push gigabytes/s while enterprise software is ends stuck at megabytes/s by the time it's all done. Whatever reporting, business logic and such the enterprise software do can probably also be done at gigabytes/s on two core consumer CPU, if written like you were back in the 80's writing a game. The most famous example I've heard of approaching their entire business logic in this manner is the LMAX forex exchange, which published some of their ring buffer based code, which was said to be inspired by network card drivers.
Forgetting all the theory, if you are happy with < 1 GB/s, one possible starting point on Windows I've found is looking at readfile source from winimage, unless you want to dig into sdk/driver samples. It may need some source code fixes to calculate perf correctly at SSD speeds. Experiment with buffer sizes also.
The switches /h multi-threaded and /o overlapped (completion port) IO with optimal buffer size (try 32,64,128 KB etc) using no windows file buffering in my experience give best perf when reading from SSD (cold data) while simultaneously processing (use the /a for Adler processing as otherwise it's too CPU-bound).

I seem to recall a project in which we were reading big files, Our implementation used multithreading - basically n * worker_threads were starting at incrementing offsets of the file (0, chunk_size, 2xchunk_size, 3x chunk_size ... n-1x chunk_size) and was reading smaller chunks of information. I can't exactly recall our reasoning for this as someone else was desining the whole thing - the workers weren't the only thing to it, but that's roughly how we did it.
Hope it helps

Its not stated in the problem that sequence matters really or not. So,
divide the file into equal parts say 1GB each, and since you are using multiple CPUs, then multiple threads wont be a problem, so read each file using separate thread, and use RAM of capacity > 10 GB, then all your contents would be stored in RAM read by multiple threads.

How can I handle multiple sockets within a Perl daemon with large memory usage?

I have created a client-server program with Perl using IO::Socket::INET. I access server through CGI based site. My server program will run as daemon and will accept multiple simultaneous connections. My server process consumes about 100MB of memory space (9 large arrays, many arrays...). I want these hashes to reside in memory and share them so that I don't have to create them for every connection. Hash creation takes 10-15 seconds.
Whenever a new connection is accepted through sockets, I fork a new process to take care of the processing for each connection received. Since parent process is huge, every time I fork, processor tries to allocate memory to a new child, but due to limited memory, it takes large time to spawn a new child, thereby increasing the response time. Many times it hangs down even for a single connection.
Parent process creates 9 large hashes. For each child, I need to refer to one or more hashes in read-only mode. I will not update hashes through child. I want to use something like copy-on-write, by which I can share whole 100mb or whole global variables created by parent with all child? or any other mechanism like threads. I expect the server will get minimum 100 request per second and it should be able to process all of them in parallel. On an average, a child will exit in 2 seconds.
I am using Cygwin on Windows XP with only 1GB of RAM. I am not finding any way to overcome this issue. Can you suggest something? How can I share variables and also create 100 child processes per second and manage them and synchronize them,
Thanks.

Instead of forking there are two other approaches to handle concurrent connections. Either you use threads or a polling approach.
In the thread approach for each connection a new thread is created that handles the I/O of a socket. A thread runs in the same virtual memory of the creating process and can access all of its data. Make sure to properly use locks to synchronize write access on your data.
An even more efficient approach is to use polling via select(). In this case a single process/thread handles all sockets. This works under the assumption that most work will be I/O and that the time spend with waiting for I/O requests to finish is spent handling other sockets.
Go research further on those two options and decide which one suits you best.
See for example: http://www.perlfect.com/articles/select.shtml

If you have that much data, I wonder why you don't simply use a database?

This architecture is unsuitable for Cygwin. Forking on real unix systems is cheap, but on fake unix systems like Cygwin it's terribly expensive, because all data has to be copied (real unices use copy-on-write). Using threads changes the memory usage pattern (higher base usage, but smaller increase per thread), but odds are it will still be inefficient.
I would advice you to use a single-process approach using polling, and maybe non-blocking IO too.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse