NetLogo BehaviorSpace memory size constraint - netlogo

In my model I'm using behaviour space to carry out a number of runs, with variables changing for each run and the output being stored in a *.csv for later analysis. The model runs fine for the first few iterations, but quickly slows as the data grows. My questions is will file-flush when used in behaviour space help this? Or is there a way around it?
Cheers
Simon

Make sure you are using table format output and spreadsheet format is disabled. At http://ccl.northwestern.edu/netlogo/docs/behaviorspace.html we read:
Note however that spreadsheet data is not written to the results file until the experiment finishes. Since spreadsheet data is stored in memory until the experiment is done, very large experiments could run out of memory. So you should disable spreadsheet output unless you really want it.
Note also:
doing runs in parallel will multiply the experiment's memory requirements accordingly. You may need to increase NetLogo's memory ceiling (see this FAQ entry).
where the linked FAQ entry is http://ccl.northwestern.edu/netlogo/docs/faq.html#howbig
Using file-flush will not help. It flushes any buffered data to disk, but only for a file you opened yourself with file-open, and anyway, the buffer associated with a file is fixed-size, not something that grows over time. file-flush is really only useful if you're reading from the same file from another process during a run.

Related

Will the depth of a file within a filesystem change the time taken to copy it?

I am trying to figure out if whether or not the depth of a file in a filesystem will change the amount of time it takes to execute a "cp" bash command with that file.
By depth I mean how many parent directories its contained in.
I tried running a few tests, but my results are pretty inconclusive, and when I try to logically answer, I can think of reasons of why it would be either way.
What is the purpose of this?
Provided nothing is cached, the deeper the directory tree the more data has to be read from storage to get to the file - you have to find the name of the second dir, then the third within the second and so on. On the other hand if the file is big, the time needed to do this can be negligible in comparison.
Also mere startup of a command like cp is not without its cost.
If you are interested in how file systems work read this free book: http://www.nobius.org/~dbg/practical-file-system-design.pdf
Performance is a complicated subject, especially so when hard media is involved. Without proper understanding of how this works and proper understanding of statistics, you can't perform a correct test.

Best Time Series Format for Querying and Converting to Matlab (HDF5)

I have somewhat of a unique problem that looks similar to the problem here :
https://news.ycombinator.com/item?id=8368509
I have a high-speed traffic analysis box that is capturing at about 5 Gbps, and picking out specific packets from this to save into some format in a C++ program. Each day there will probably be 1-3 TB written to disk. Since it's network data, it's all time series down to the nanosecond level, but it would be fine to save it at second or millisecond level and have another application sort the embedded higher-resolution timestamps afterwards. My problem is deciding which format to use. My two requirements are:
Be able to write to disk at about 50 MB/s continuously with several different timestamped parameters.
Be able to export chunks of this data into MATLAB (HDF5).
Query this data once or twice a day for analytics purposes
Another nice thing that's not a hard requirement is :
There will be 4 of these boxes running independently, and it would be nice to query across all of them and combine data if possible. I should mention all 4 of these boxes are in physically different locations, so there is some overhead in sharing data.
The second one is something I cannot change because of legacy applications, but I think the first is more important. The types of queries I may want to export into matlab are something like "Pull metric X between time Y and Z", so this would eventually have to go into an HDF5 format. There is an external library called MatIO that I can use to write matlab files if needed, but it would be even better if there wasn't a translation step. I have read the entire thread mentioned above, and there are many options that appear to stand out: kdb+, Cassandra, PyTables, and OpenTSDB. All of these seem to do what I want, but I can't really figure out how easy it would be to get it into the MATLAB HDF5 format, and if any of these would make it harder than others.
If anyone has experience doing something similar, it would be a big help. Thanks!
A KDB+ tickerplant is certainly capable of capturing data at that rate, however there's lots of things you need to make sure (whatever solution you pick)
Do the machine(s) that are capturing the data have enough cores? Best to taskset a tickerplant, for example, to a core that nothing else will contend with
Similarly with disk - SSD, be sure there is no contention on the bus
Separate the workload - can write different types of data (maybe packets can be partioned by source or stream?) to different cpus/disks/tickerplant processes.
Basically there's lots of ways you can cut this. I can say though that with the appropriate hardware KDB+ could do the job. However, given you want HDF5 it's probably even better to have a simple process capturing the data and writing/converting to disk on the fly.

Which is more efficient, storing output in variable or output to file?

I am using Robocopy in PowerShell to sort through and output millions of filenames older than a user-specified age. My question is this: Is it better to make use of Robocopy's logging feature, then import the log via Get-Content -ReadCount, or would it be better to store Robocopy's output in a variable so that the script doesn't have to write to disk?
I would have to regex either way to get the actual file names. I'm using Robocopy because many of the files have paths longer than 248 chars.
Is one way more preferred than the other? Don't want to miss something that should be considered obvious.
You can skip all the theory and speculation about the multiple factors in play by measuring how long each method takes using Measure-Command, for example:
Measure-Command {$rc_output = robocopy <arguments>}
Measure-Command {robocopy <arguments> /log:rc.log; Get-Content rc.log [...]}
You'll get output telling you exactly how long each version took, down to the millisecond. Try it out on a small amount of sample data, see which one is quicker, then apply it to your millions of files.
I will add to #mjolinor's comment, and the other comments. To answer the question directly:
Saving information to a variable (and therefore to RAM) is always faster than direct to disk. But only in the following situations:
Variables are designed to be used to store small (<10Mb) amounts of data. They are not designed to hold things like entire databases. If the size of the data is large (i.e. millions of rows of data, i.e. tens of megabytes), then disk is always better. The problem is that if you shove a ton of information into a variable, you will fill up your RAM, and once your RAM is full, things slow down, paging memory to disk starts happening, and basically everything stops working, including any commands that you currently running (i.e. Robocopy).
Overall, because you are dealing with millions of rows, my recommendation is to write it to disk, because your results are likely to take up quite a bit of space, much more than a variable "should" hold.
Now, after saying all that and delving into the details of how programs manipulate bits in memory, it all doesn't really matter, because the time spent writing things to disk is very very small compared to the amount of time that it takes to process all the files.
If you are processing 1,000,000 files, and you process them at a good speed, say, 1,000 files a second, then it will take 1,000 seconds to process. That means that it takes over 16 Minutes to run through all the files.
If lets say writing to disk is bad, and causes you to be able to process 5 files slower per second, so 995 files instead, it will run only 5 seconds longer. 5 seconds is an impact of 0.5% which is nothing compared to the amount of time it takes to run the whole process.
It is much more likely that writing to a variable will cause much more troubles than writing to disk.
It depends on how much output you're talking about, and what your available system resources are. It will be faster to write them out to a file and then read them back in if the disk I/O time is less than the additional overhead required for memory managment to get into memory. You can try it both ways and time it, but I'd try reading it into memory first while monitoring it with Task Manager. If it starts throwing lots of page faults, that's a clue that you may be better off using the disk as intermediate storage.

Perl read file vs traverse array Performance

I need to test lines in a file against multiple values
What are the difference in terms of time between opening a file and reading line by line each time vs opening the file once placing it in an array and traversing the array each time?
To expand upon what #mpacpec said in his comment, file IO is always slower than memory read/writes. But there's more to the story. "Test lines in a file against multiple values" can be interpreted in a lot of ways, so without knowing more about what exactly you are trying to do, then no one can tell you anything more specifically. So the answer is, "It depends". It depends on the file size, what you're testing and how often, and how you're testing.
However, pragmatically speaking, based upon my understanding of what you've said, you'll have to read the whole file one way or another, and you'll have to test every line, one way or another. Do what's easiest to write/read/understand, and see if that's fast enough. If it isn't, you have a much more useful baseline from which to ask the question. Personally, I'd start with a linewise read and test loop and work from there, simply because I think that'd be easier and faster to write correctly.
Make it work, then make it fast :)
Provided in the former case you can do all the tests you need on each line (rather than re-reading file each time), then the two approaches should be roughly the same speed and I/O, CPU efficiency (ignoring second-order effects such as whether the disk IO gets distracted by other processes more easily). However, the latter case - reading whole file - may hit memory limits for large files, which may cause it to lose performance dramatically or even fail.
The main cost of processing the file line by line is loss of flexibility - for instance if you need to cross-reference the lines, it would not be easy (whilst if they are all in memory, the code to do that would be simpler and faster).

Data corruption for large data manipulation

I have some very weird data corruption trouble recently.
Basically what I do is:
transfer some large data (50files, each around 8GB) from one server to hpcc(high performance computing) using "scp"
Process each line of input files, and then append/write those modified lines to output files. And I do this on hpcc by "qsub -t 1-1000 xxx.sh", that is throwing out all 1000 jobs at the same time. Also these 1000 jobs are on average using 4GB of memory each.
The basic format of my script is:
f=open(file)
for line in f:
#process lines
or
f=open(file).readlines()
#process lines
However, weird part is: from time to time, I can see data corruption in some parts of my data.
First, I just find some of my "input" data is corrupted (not ALL); then I just doubt if it's the problem of "scp". I ask some computer guys, and also post here, but seems there's very little possibility that 'scp' can distort the data.
And I just do "scp" to transfer my data again to hpcc; and the input data this time becomes ok. weird, right?
So this propels me to think: is it possible that input data maybe disrupted by being used to run memory/CPU usage-intensive programs?
If input data is corrupted, it's very natural that output is also corrupted. Ok, then I transfer the input data again to hpcc, and check that all of them are in good-shape, I then run programs (should point out:run 1000 jobs together), and the output files...most of them are good; however very surprisingly, some portion of only one file are corrupted! So for I just singly run program for this specific file again, then get good output without any corruption!!
I'm so confused......After seeing so many weird things, my only conclusion is: maybe running many memory-intensive jobs at the same time will harm the data? (But I used to also run lots of such jobs, and seems ok)
And by data corruption, I mean:
Something like this:
CTTGTTACCCAGTTCCAAAG9583gfg1131CCGGATGCTGAATGGCACGTTTACAATCCTTTAGCTAGACACAAAAGTTCTCCAAGTCCCCACCAGATTAGCTAGACACAGAGGGCTGGTTGGTGCATCT0/1
gfgggfgggggggggggggg9583gfg1131CCGGAfffffffaedeffdfffeffff`fffffffffcafffeedffbfbb[aUdb\``ce]aafeeee\_dcdcWe[eeffd\ebaM_cYKU]\a\Wcc0/1
CTTGTTACCCAGTTCCAAAG9667gfg1137CCGGATCTTAAAACCATGCTGAGGGTTACAAA1AGAAAGTTAACGGGATGCTGATGTGGACTGTGCAAATCGTTAACATACTGAAAACCTCT0/1
gfgggfgggggggggggggg9667gfg1137CCGGAeeeeeeeaeeb`ed`dadddeebeeedY_dSeeecee_eaeaeeeeeZeedceadeeXbd`RcJdcbc^c^e`cQ]a_]Z_Z^ZZT^0/1
However it should be like:
#HWI-ST150_0140:6:2204:16666:85719#0/1
TGGGCTAAAAGGATAAGGGAGGGTGAAGAGAGGATCTGGGTGAACACACAAGAGGCTTAAAGCATTTTATCAAATCCCAATTCTGTTTACTAGCTGTGTGA
+HWI-ST150_0140:6:2204:16666:85719#0/1
gggggggggggggggggfgggggZgeffffgggeeggegg^ggegeggggaeededecegffbYdeedffgggdedffc_ffcffeedeffccdffafdfe
#HWI-ST150_0140:6:2204:16743:85724#0/1
GCCCCCAGCACAAAGCCTGAGCTCAGGGGTCTAGGAGTAGGATGGGTGGTCTCAGATTCCCCATGACCCTGGAGCTCAGAACCAATTCTTTGCTTTTCTGT
+HWI-ST150_0140:6:2204:16743:85724#0/1
ffgggggggfgeggfefggeegfggggggeffefeegcgggeeeeebddZggeeeaeed[ffe^eTaedddc^Oacccccggge\edde_abcaMcccbaf
#HWI-ST150_0140:6:2204:16627:85726#0/1
CCCCCATAGTAGATGGGCTGGGAGCAGTAGGGCCACATGTAGGGACACTCAGTCAGATCTATGTAGCTGGGGCTCAAACTGAAATAAAGAATACAGTGGTA