Which is more efficient, storing output in variable or output to file? - powershell

I am using Robocopy in PowerShell to sort through and output millions of filenames older than a user-specified age. My question is this: Is it better to make use of Robocopy's logging feature, then import the log via Get-Content -ReadCount, or would it be better to store Robocopy's output in a variable so that the script doesn't have to write to disk?
I would have to regex either way to get the actual file names. I'm using Robocopy because many of the files have paths longer than 248 chars.
Is one way more preferred than the other? Don't want to miss something that should be considered obvious.

You can skip all the theory and speculation about the multiple factors in play by measuring how long each method takes using Measure-Command, for example:
Measure-Command {$rc_output = robocopy <arguments>}
Measure-Command {robocopy <arguments> /log:rc.log; Get-Content rc.log [...]}
You'll get output telling you exactly how long each version took, down to the millisecond. Try it out on a small amount of sample data, see which one is quicker, then apply it to your millions of files.

I will add to #mjolinor's comment, and the other comments. To answer the question directly:
Saving information to a variable (and therefore to RAM) is always faster than direct to disk. But only in the following situations:
Variables are designed to be used to store small (<10Mb) amounts of data. They are not designed to hold things like entire databases. If the size of the data is large (i.e. millions of rows of data, i.e. tens of megabytes), then disk is always better. The problem is that if you shove a ton of information into a variable, you will fill up your RAM, and once your RAM is full, things slow down, paging memory to disk starts happening, and basically everything stops working, including any commands that you currently running (i.e. Robocopy).
Overall, because you are dealing with millions of rows, my recommendation is to write it to disk, because your results are likely to take up quite a bit of space, much more than a variable "should" hold.
Now, after saying all that and delving into the details of how programs manipulate bits in memory, it all doesn't really matter, because the time spent writing things to disk is very very small compared to the amount of time that it takes to process all the files.
If you are processing 1,000,000 files, and you process them at a good speed, say, 1,000 files a second, then it will take 1,000 seconds to process. That means that it takes over 16 Minutes to run through all the files.
If lets say writing to disk is bad, and causes you to be able to process 5 files slower per second, so 995 files instead, it will run only 5 seconds longer. 5 seconds is an impact of 0.5% which is nothing compared to the amount of time it takes to run the whole process.
It is much more likely that writing to a variable will cause much more troubles than writing to disk.

It depends on how much output you're talking about, and what your available system resources are. It will be faster to write them out to a file and then read them back in if the disk I/O time is less than the additional overhead required for memory managment to get into memory. You can try it both ways and time it, but I'd try reading it into memory first while monitoring it with Task Manager. If it starts throwing lots of page faults, that's a clue that you may be better off using the disk as intermediate storage.

Related

What if my mmap virtual memory exceeds my computer’s RAM?

Background and Use Case
I have around 30 GB of data that never changes, specifically, every dictionary of every language.
Client requests to see the definition of a word, I simply respond with it.
On every request I have to conduct an algorithmic search of my choice so I don’t have to loop through the over two hundred million words I have stored in my .txt file.
If I open the txt file and read it so I can search for the word, it would take forever due to the size of the file (even if that file is broken down into smaller files, it is not feasible nor it is what I want to do).
I came across the concept of mmap, mentioned to me as a possible solution to my problem by a very kind gentleman on discord.
Problem
As I was learning about mmap I came across the fact that mmap does not store the data on the RAM but rather on a virtual RAM… well regardless of which it is, my server or docker instances may have no more than 64 GB of RAM and that chunk of data taking 30 of them is quite painful and makes me feel like there needs to be an alternative that is better. Even on a worst case scenario, if my server or docker container does not have enough RAM for the data stored on mmap, then it is not feasible, unless I am wrong as to how this works, which is why I am asking this question.
Questions
Is there better solution for my use case than mmap?
Will having to access such a large amount of data through mmap so I don’t have to open and read the file every time allocate RAM memory of the amount of the file that I am accessing?
Lastly, if I was wrong about a specific statement I made on what I have written so far, please do correct me as I am learning lots about mmap still.
Requirements For My Specific Use Case
I may get a request from one client that has tens of words that I have to look up, so I need to be able to retrieve lots of data from the txt file effectively.
The response to the client has to be as quick as possible, the quicker the better, I am talking ideally a less than three seconds, or if impossible, then as quick as it can be.

NetLogo BehaviorSpace memory size constraint

In my model I'm using behaviour space to carry out a number of runs, with variables changing for each run and the output being stored in a *.csv for later analysis. The model runs fine for the first few iterations, but quickly slows as the data grows. My questions is will file-flush when used in behaviour space help this? Or is there a way around it?
Cheers
Simon
Make sure you are using table format output and spreadsheet format is disabled. At http://ccl.northwestern.edu/netlogo/docs/behaviorspace.html we read:
Note however that spreadsheet data is not written to the results file until the experiment finishes. Since spreadsheet data is stored in memory until the experiment is done, very large experiments could run out of memory. So you should disable spreadsheet output unless you really want it.
Note also:
doing runs in parallel will multiply the experiment's memory requirements accordingly. You may need to increase NetLogo's memory ceiling (see this FAQ entry).
where the linked FAQ entry is http://ccl.northwestern.edu/netlogo/docs/faq.html#howbig
Using file-flush will not help. It flushes any buffered data to disk, but only for a file you opened yourself with file-open, and anyway, the buffer associated with a file is fixed-size, not something that grows over time. file-flush is really only useful if you're reading from the same file from another process during a run.

Perl read file vs traverse array Performance

I need to test lines in a file against multiple values
What are the difference in terms of time between opening a file and reading line by line each time vs opening the file once placing it in an array and traversing the array each time?
To expand upon what #mpacpec said in his comment, file IO is always slower than memory read/writes. But there's more to the story. "Test lines in a file against multiple values" can be interpreted in a lot of ways, so without knowing more about what exactly you are trying to do, then no one can tell you anything more specifically. So the answer is, "It depends". It depends on the file size, what you're testing and how often, and how you're testing.
However, pragmatically speaking, based upon my understanding of what you've said, you'll have to read the whole file one way or another, and you'll have to test every line, one way or another. Do what's easiest to write/read/understand, and see if that's fast enough. If it isn't, you have a much more useful baseline from which to ask the question. Personally, I'd start with a linewise read and test loop and work from there, simply because I think that'd be easier and faster to write correctly.
Make it work, then make it fast :)
Provided in the former case you can do all the tests you need on each line (rather than re-reading file each time), then the two approaches should be roughly the same speed and I/O, CPU efficiency (ignoring second-order effects such as whether the disk IO gets distracted by other processes more easily). However, the latter case - reading whole file - may hit memory limits for large files, which may cause it to lose performance dramatically or even fail.
The main cost of processing the file line by line is loss of flexibility - for instance if you need to cross-reference the lines, it would not be easy (whilst if they are all in memory, the code to do that would be simpler and faster).

Data corruption for large data manipulation

I have some very weird data corruption trouble recently.
Basically what I do is:
transfer some large data (50files, each around 8GB) from one server to hpcc(high performance computing) using "scp"
Process each line of input files, and then append/write those modified lines to output files. And I do this on hpcc by "qsub -t 1-1000 xxx.sh", that is throwing out all 1000 jobs at the same time. Also these 1000 jobs are on average using 4GB of memory each.
The basic format of my script is:
f=open(file)
for line in f:
#process lines
or
f=open(file).readlines()
#process lines
However, weird part is: from time to time, I can see data corruption in some parts of my data.
First, I just find some of my "input" data is corrupted (not ALL); then I just doubt if it's the problem of "scp". I ask some computer guys, and also post here, but seems there's very little possibility that 'scp' can distort the data.
And I just do "scp" to transfer my data again to hpcc; and the input data this time becomes ok. weird, right?
So this propels me to think: is it possible that input data maybe disrupted by being used to run memory/CPU usage-intensive programs?
If input data is corrupted, it's very natural that output is also corrupted. Ok, then I transfer the input data again to hpcc, and check that all of them are in good-shape, I then run programs (should point out:run 1000 jobs together), and the output files...most of them are good; however very surprisingly, some portion of only one file are corrupted! So for I just singly run program for this specific file again, then get good output without any corruption!!
I'm so confused......After seeing so many weird things, my only conclusion is: maybe running many memory-intensive jobs at the same time will harm the data? (But I used to also run lots of such jobs, and seems ok)
And by data corruption, I mean:
Something like this:
CTTGTTACCCAGTTCCAAAG9583gfg1131CCGGATGCTGAATGGCACGTTTACAATCCTTTAGCTAGACACAAAAGTTCTCCAAGTCCCCACCAGATTAGCTAGACACAGAGGGCTGGTTGGTGCATCT0/1
gfgggfgggggggggggggg9583gfg1131CCGGAfffffffaedeffdfffeffff`fffffffffcafffeedffbfbb[aUdb\``ce]aafeeee\_dcdcWe[eeffd\ebaM_cYKU]\a\Wcc0/1
CTTGTTACCCAGTTCCAAAG9667gfg1137CCGGATCTTAAAACCATGCTGAGGGTTACAAA1AGAAAGTTAACGGGATGCTGATGTGGACTGTGCAAATCGTTAACATACTGAAAACCTCT0/1
gfgggfgggggggggggggg9667gfg1137CCGGAeeeeeeeaeeb`ed`dadddeebeeedY_dSeeecee_eaeaeeeeeZeedceadeeXbd`RcJdcbc^c^e`cQ]a_]Z_Z^ZZT^0/1
However it should be like:
#HWI-ST150_0140:6:2204:16666:85719#0/1
TGGGCTAAAAGGATAAGGGAGGGTGAAGAGAGGATCTGGGTGAACACACAAGAGGCTTAAAGCATTTTATCAAATCCCAATTCTGTTTACTAGCTGTGTGA
+HWI-ST150_0140:6:2204:16666:85719#0/1
gggggggggggggggggfgggggZgeffffgggeeggegg^ggegeggggaeededecegffbYdeedffgggdedffc_ffcffeedeffccdffafdfe
#HWI-ST150_0140:6:2204:16743:85724#0/1
GCCCCCAGCACAAAGCCTGAGCTCAGGGGTCTAGGAGTAGGATGGGTGGTCTCAGATTCCCCATGACCCTGGAGCTCAGAACCAATTCTTTGCTTTTCTGT
+HWI-ST150_0140:6:2204:16743:85724#0/1
ffgggggggfgeggfefggeegfggggggeffefeegcgggeeeeebddZggeeeaeed[ffe^eTaedddc^Oacccccggge\edde_abcaMcccbaf
#HWI-ST150_0140:6:2204:16627:85726#0/1
CCCCCATAGTAGATGGGCTGGGAGCAGTAGGGCCACATGTAGGGACACTCAGTCAGATCTATGTAGCTGGGGCTCAAACTGAAATAAAGAATACAGTGGTA

How to efficiently process 300+ Files concurrently in scala

I'm going to work on comparing around 300 binary files using Scala, bytes-by-bytes, 4MB each. However, judging from what I've already done, processing 15 files at the same time using java.BufferedInputStream tooks me around 90 sec on my machine so I don't think my solution would scale well in terms of large number of files.
Ideas and suggestions are highly appreciated.
EDIT: The actual task is not just comparing the difference but to processing those files in the same sequence order. Let's say I have to look at byte ith in every file at the same time, and moving on to (ith + 1).
Did you notice your hard drive slowly evaporating as you read the files? Reading that many files in parallel is not something mechanical hard drives are designed to do at full-speed.
If the files will always be this small (4MB is plenty small enough), I would read the entire first file into memory, and then compare each file with it in series.
I can't comment on solid-state drives, as I have no first-hand experience with their performance.
You are quite screwed, indeed.
Let's see... 300 * 4 MB = 1.2 GB. Does that fit your memory budget? If it does, by all means read them all into memory. But, to speed things up, you might try the following:
Read 512 KB of every file, sequentially. You might try reading from 2 to 8 at the same time -- perhaps through Futures, and see how well it scales. Depending on your I/O system, you may gain some speed by reading a few files at the same time, but I do not expect it to scale much. EXPERIMENT! BENCHMARK!
Process those 512 KB using Futures.
Go back to step 1, unless you are finished with the files.
Get the result back from the processing Futures.
On step number 1, by limiting the parallel reads you avoid trashing your I/O subsystem. Push it as much as you can, maybe a bit less than that, but definitely not more than that.
By not reading all files on step number 1, you use some of the time spent reading these files doing useful CPU work. You may experiment with lowering the bytes read on step 1 as well.
Are the files exactly the same number of bytes? If they are not, the files can be compared simply via the File.length() method to determine a first-order guess of equality.
Of course you may be wanting to do a much deeper comparison than just "are these files the same?"
If you are just looking to see if they are the same I would suggest using a hashing algorithm like SHA1 to see if they match.
Here is some java source to make that happen
many large systems that handle data use sha1 Including the NSA and git
Its simply more efficient use a hash instead of a byte compare. the hashes can also be stored for later to see if the data has been altered.
Here is a talk by Linus Torvalds specifically about git, it also mentions why he uses SHA1.
I would suggest using nio if possible. Introudction To Java NIO and NIO2 seems like a decent guide to using NIO if you are not familiar with it. I would not suggest reading a file and doing a comparison byte by byte, if that is what you are currently doing. You can create a ByteBuffer to read in chunks of data from a file and then do comparisons from that.