Fast Mass File Copying in Scala with Akka - scala

I am wondering what would be the best way to Copy a file src to dest within Scala that will be wrapped in an Akka Actor and possibly using a RemoteActor with several machines.
I have a tremendous amount of image files I have to copy from one directory to a NFS mounted directory.
Haven't done much FileHandling in Java or Scala, but know there is the NIO lib and some others out there that have been worked on since Scala 2.7. Something that would be the safest and quickest.
I probably should give some idea of my infrastructure as well. The connection is 1000 MB's in which connects via a Cisco3560 from an Isilon node to a Windows 2003 Server. The Isilon node is the NFS mount and the Windows 2003 Server is a highly configured Samba(Cifs) mount.

You probably can't beat the underlying OS file copy speed, so if the files are large or you can batch them, you're probably best off writing a shell script with Scala and then calling it with bash or somesuch. Chances are that one thread can saturate the disk IO, so there really isn't anything fancy to do. If the images are large, you'll be waiting for the 50ish MB/s limit on your disk (or 10ish MB/s limit on your 100 Mbps ethernet); if they're small, you'll be waiting for the quite-some-dozens of ms overhead on file seeks and network ping times and so on.
That said, you can use Apache Commons IO, which has a file copy utility, and a previous question has a high-performance answer among the top rated entries. You can have one actor handle all the copying tasks, and that should be as fast as if you have a bunch of actors all trying to compete for the same limited IO bandwidth.

Related

What would happen if deploy PostgreSQL with HDFS as filesystem in high-load scenario?

It's a deliberately stupid question. But I'm just curious - what would happen if I mount HDFS using FUSE binding as a volume and launch PostgreSQL with a cluster stored on this HDFS volume and start writing massive amounts of data and/or do high-intensity reading?
First I don't think it's a stupid question, with that said, let's use some definitions and we can continue from that point:
Fuse:
FUSE is a userspace filesystem framework. It consists of a kernel module, a userspace library, and a mount utility (fusermount).
HDFS (Hadoop Distributed File System):
A file system that is distributed amongst many networked computers or nodes. HDFS is fault-tolerant because it stores multiple replicas of files on the file system, the default replication level is 3.
So I think that a short version of your question #Gill Bates is: Does HDFS affect the performance of a Postgres DB (Of course assuming that the Postgres cluster is stored in HDFS)?
The short answer is, depends on your configuration but likely yes, as mentioned above you can think of HDFS as a file-system, and of course, Postgres stores the data in the file system, so it will be affected by the file system you are using, and let's say you perform multiple operations read/write, one of the great advantages of having a distributed file system as HDFS is that support multiple replicas of files which
considerably reduces the common bottleneck of many clients accessing a single file so that may help to scale better.
So answering your question directly: what happens if I start writing massive amounts of data and/or do high-intensity reading?
Regardless of your file system is HDFS (which may help you to scale better and at the same time add fault tolerance to your file system) or not, the parameters that could determine/affect directly how good your DB responds under stress tests are:
Indexing
Partitioning
Checkpoints
VACUUM, ANALYZE (with FILLFACTOR)
Queries definition
And of course, depends on your stack too (how good is your server provided/host), based on my experience these are the facts that may affect more your Postgres DB (attached below some links that may help to clarify more 👌).
DB performance and file system
Tunning a Postgres DB
Hope the above helps to clarify! 👍

how to download multiple files from one server and upload to another concurrently in microservice architecture (concurrent piping)

I'm writing a (scala/jvm) microservice that is part of a CI solution.
Its job is to download built artifacts from some external build tool on the cloud and upload them to repositories from which they can be consumed, such as a docker registry or maven style repositories like Nexus.
There are many many such files for each build and many many such builds all the times, so the problem to solve is that of scale.
My microservice is integrated with an event queue (kafka), so it's easy to asynchronously assign tasks to workers.
I'm looking for the best way to manage my resources: nodes of cluster, threads, io, memory, storage - to handle the download and upload of all files, preferably without storing the entire content of each file locally on a file or in memory, but just to pipe from the source server to the target server.
I'm not sure what's the best approach to actually write the pipe code itself or how to best use the workers.
I was thinking of dispatching an event per file-to-pipe, and in each worker to pipe that one file by performing a get operation on the input server, a post operation on the target server and creating an in memory pipe between the streams with some buffer.
In this scenario there could be different transfer speeds for the input and target servers and i'm not sure if that's a problem or not. I think this should be solved by TCP/IP at the OS level and nothing for me to handle applicatively. I think if i use different thread pools for the download client and the upload client i can expect decent usage of non-blocking io to perform the pipe.
Alternatively i can do something else entirely and do some sort of producer/consumer where some workers download files while others upload them? this means more storage and shared storage at that, and a custom configuration for this microservice, which i'm not excited about.
Any other suggestions/insights are also welcome.
The eventual solution should (hopefully) be robust, scalable and as simple as possible.
Are you positive the source cloud service is not going to offer an "export file to Nexus" solution in the near/medium future? If so maybe your solution does not have to be fully efficient.
I would look at FS2 for this job https://github.com/functional-streams-for-scala/fs2/blob/series/1.0/README.md#example

mongodb flushing mmap takes around 20 secs with no updates being required

Hi One of our customers is running mongodb V2.2.3 on a 64 bit windows server 2008 R2 Enterprise.
We're currently seeing mmap flush times of over 20 seconds every minute.
What is confusing me is that it isn't doing any writes to the disk. (Disk write bytes is next to 0)
Our programme which access the data has been temporary turned off.
so all that is connected is a mongo shell.
Mongostat and mongotop aresn't showing anything
The database has 130 million records. There are 356 files for mmap.
Any sugestions on what could be causing this?
Thanks
If your working set is significantly larger than memory, and MongoDB is constantly going to disk for reads (and not just the normal spikes when syncing writes to disk), then you really should be sharding to spread the data across multiple machines/instances.
Given the behaviour you have described and that you have a large number of files for mmap, I suspect the underlying performance issue is SERVER-12401 in the MongoDB Jira issue tracker:
On Windows, Memory Mapped File flushes are synchronous operations. When the OS Virtual Memory Manager is asked to flush a memory mapped file, it makes a synchronous write request to the file cache manager in the OS. This causes large I/O stalls on Windows systems with high Disk IO latency, while on Linux the same writes are asynchronous.
There are a few possible ways to improve the flush performance on Windows, including code changes in both the MongoDB server and the Windows O/S. There is some ongoing work to address these issues, now that the synchronous flushing behaviour on Windows has been confirmed.
If you are using higher latency local storage (for example, spinning disks) you may be able to mitigate the issue by upgrading to SSD or better spec'd drives.
I would suggest upvoting/watching SERVER-12401 and the related Jira issues for updates.
It would also be worth upgrading from MongoDB 2.2 to a newer version as 2.2 is now past end-of-life for updates. There have been two major production release branches since then, including significant improvements in general performance/features as well as Windows support.

Load distribution to instances of a perl script running on each node of a cluster

I have a perl script (call it worker) installed on each node/machine (4 total) of a cluster (each running RHEL). The script itself is configured as a RedHat Cluster service (which means the RH cluster manager would ensure that one and exactly one instance of this script is running as long as at least one node in the cluster is up).
I have X amount of work to be done every day once a day, which this script does. So far the X was small enough and only one instance of this script was enough to do it. But now the load is going to increase and along with High Availability (viz already implemented using RHCS), I also need load distribution.
Question is how do I do that?
Of course I have a way to split the work in n parts of size X/n each. Options I had in mind:
Create a new load distributor, which splits the work in jobs of X/n. AND one of the following:
Create a named pipe on the network file system (which is mounted and visible on all nodes), post all jobs to the pipe. Make each worker script on each node read (atomic) from the pipe and do the work. OR
Make each worker script on each node listen on a TCP socket and the load distributor send jobs to each this socket in a round robin (or some other algo) fashion.
Theoretical problem with #1 is that we've observed some nasty latency problems with NFS. And I'm not even sure if NFS would support IPC via named pipes across machines.
Theoretical problem with #2 is that I have to implement some monitors to ensure that each worker is running and listening, which being a noob to Perl, I'm not sure if is easy enough.
I personally prefer load distributor creating a pool and workers pulling from it, rather than load distributor tracking each worker and pushing work to each. Any other options?
I'm open to new ideas as well. :)
Thanks!
-- edit --
using Perl 5.8.8, to be precise: This is perl, v5.8.8 built for x86_64-linux-thread-multi
If you want to keep it simple use a database to store the jobs and then have each worker lock the table and get the jobs they need then unlock and let the next worker do it's thing. This isn't the most scalable solution since you'll have lock contention, but with just 4 nodes it should be fine.
But if you start going down this road it might make sense to look at a dedicated job-queue system like Gearman.

How to grab a full memory dump of a large memory usage

I am hosting IIS based web service applications on Windows 2008 64-bit system running on a Quad core 8G machine. Ran into couple of instances when W3WP was running at 7.6G of memory usage. Nothing else was responding on the system including RDP. Right click on the process from the task manager and creating the dumps, froze the system and all its threads for a long time (close to 30minutes). When the freeze up occurred during off hours, we let the dump run for a while (ran close to 1 hour) but still dump didn't complete. In the interest of getting the system up, we had to kill IIS
Tried other tools like procexp, debug diag etc to create full memory dump and all have the same results
So, what tool does the community use to grab dump files quickly? Or without freezing all the threads? I realize latter might be a rhetorical question. But what are the options for generating such a large dump file without locking up the system for a long time?
IMO you shouldn't have to wait until the process memory grows to 8 GB. I am sure with something like 3 - 4 GB you should be able to detect the memory leak.
Procdump has an option based on memory threshold
-m Memory commit threshold in MB at which to create a dump of the process.
I would you this option to dump the memory of the process.
And also SSD would help in writing faster.
WPA a.k.a xperf (http://msdn.microsoft.com/en-us/performance/cc825801.aspx) is a powerfull tool, to diagnose the applications. You will get call stack of the culprit allocation. You dont have to collect the dump and it is no-invasive and does not load much in production systems
Complete step by step information is available here. http://msdn.microsoft.com/en-us/library/ff190906(v=VS.85).aspx.