Data corruption for large data manipulation - server

I have some very weird data corruption trouble recently.
Basically what I do is:
transfer some large data (50files, each around 8GB) from one server to hpcc(high performance computing) using "scp"
Process each line of input files, and then append/write those modified lines to output files. And I do this on hpcc by "qsub -t 1-1000 xxx.sh", that is throwing out all 1000 jobs at the same time. Also these 1000 jobs are on average using 4GB of memory each.
The basic format of my script is:
f=open(file)
for line in f:
#process lines
or
f=open(file).readlines()
#process lines
However, weird part is: from time to time, I can see data corruption in some parts of my data.
First, I just find some of my "input" data is corrupted (not ALL); then I just doubt if it's the problem of "scp". I ask some computer guys, and also post here, but seems there's very little possibility that 'scp' can distort the data.
And I just do "scp" to transfer my data again to hpcc; and the input data this time becomes ok. weird, right?
So this propels me to think: is it possible that input data maybe disrupted by being used to run memory/CPU usage-intensive programs?
If input data is corrupted, it's very natural that output is also corrupted. Ok, then I transfer the input data again to hpcc, and check that all of them are in good-shape, I then run programs (should point out:run 1000 jobs together), and the output files...most of them are good; however very surprisingly, some portion of only one file are corrupted! So for I just singly run program for this specific file again, then get good output without any corruption!!
I'm so confused......After seeing so many weird things, my only conclusion is: maybe running many memory-intensive jobs at the same time will harm the data? (But I used to also run lots of such jobs, and seems ok)
And by data corruption, I mean:
Something like this:
CTTGTTACCCAGTTCCAAAG9583gfg1131CCGGATGCTGAATGGCACGTTTACAATCCTTTAGCTAGACACAAAAGTTCTCCAAGTCCCCACCAGATTAGCTAGACACAGAGGGCTGGTTGGTGCATCT0/1
gfgggfgggggggggggggg9583gfg1131CCGGAfffffffaedeffdfffeffff`fffffffffcafffeedffbfbb[aUdb\``ce]aafeeee\_dcdcWe[eeffd\ebaM_cYKU]\a\Wcc0/1
CTTGTTACCCAGTTCCAAAG9667gfg1137CCGGATCTTAAAACCATGCTGAGGGTTACAAA1AGAAAGTTAACGGGATGCTGATGTGGACTGTGCAAATCGTTAACATACTGAAAACCTCT0/1
gfgggfgggggggggggggg9667gfg1137CCGGAeeeeeeeaeeb`ed`dadddeebeeedY_dSeeecee_eaeaeeeeeZeedceadeeXbd`RcJdcbc^c^e`cQ]a_]Z_Z^ZZT^0/1
However it should be like:
#HWI-ST150_0140:6:2204:16666:85719#0/1
TGGGCTAAAAGGATAAGGGAGGGTGAAGAGAGGATCTGGGTGAACACACAAGAGGCTTAAAGCATTTTATCAAATCCCAATTCTGTTTACTAGCTGTGTGA
+HWI-ST150_0140:6:2204:16666:85719#0/1
gggggggggggggggggfgggggZgeffffgggeeggegg^ggegeggggaeededecegffbYdeedffgggdedffc_ffcffeedeffccdffafdfe
#HWI-ST150_0140:6:2204:16743:85724#0/1
GCCCCCAGCACAAAGCCTGAGCTCAGGGGTCTAGGAGTAGGATGGGTGGTCTCAGATTCCCCATGACCCTGGAGCTCAGAACCAATTCTTTGCTTTTCTGT
+HWI-ST150_0140:6:2204:16743:85724#0/1
ffgggggggfgeggfefggeegfggggggeffefeegcgggeeeeebddZggeeeaeed[ffe^eTaedddc^Oacccccggge\edde_abcaMcccbaf
#HWI-ST150_0140:6:2204:16627:85726#0/1
CCCCCATAGTAGATGGGCTGGGAGCAGTAGGGCCACATGTAGGGACACTCAGTCAGATCTATGTAGCTGGGGCTCAAACTGAAATAAAGAATACAGTGGTA

Related

google dataprep (clouddataprep by trifacta) tip: jobs will not be able to run if they are to large

During my cloud dataprep adventures I have come across yet another very annoying bug.
The problem occurs when creating complex flow structures which need to be connected through reference datasets. If a certain limit is crossed in performing a number of unions or a joins with these sets, dataflow is unable to start a job.
I have had a lot of contact with support and they are working on the issue:
"Our Systems Engineer Team was able to determine the root cause resulting into the failed job. They mentioned that the job is too large. That means that the recipe (combined from all datasets) is too big, and Dataflow rejects it. Our engineering team is still investigating approaches to address this.
A workaround is to split the job into two smaller jobs. The first run the flow for the data enrichment, and then use the output as input in the other flow. While it is not ideal, this would be a working solution for the time being."
I ran into the same problem and have a fairly educated guess as to the answer. Keep in mind that DataPrep simply takes all your GUI based inputs and translates it into Apache Beam code. When you pass in a reference data set, it probably writes some AB code that turns the reference data set into a side-input (https://beam.apache.org/documentation/programming-guide/). DataFlow will perform a Parellel Do (ParDo) function where it takes each element from a PCollection, stuffs it into a worker node, and then applies the side-input data for transformation.
So I am pretty sure if the reference sets get too big (which can happen with Joins), the underlying code will take an element from dataset A, pass it to a function with side-input B...but if side-input B is very big, it won't be able to fit into the worker memory. Take a look at the Stackdriver logs for your job to investigate if this is the case. If you see 'GC (Allocation Failure)' in your logs this is a sign of not enough memory.
You can try doing this: suppose you have two CSV files to read in and process, file A is 4 GB and file B is also 4 GB. If you kick off a job to perform some type of Join, it will very quickly outgrow the worker memory and puke. If you CAN, see if you can pre-process in a way where one of the files is in the MB range and just grow the other file.
If your data structures don't lend themselves to that option, you could do what the Sys Engs suggested, split one file up into many small chunks and then feed it to the recipe iteratively against the other larger file.
Another option to test is specifying the compute type for the workers. You can iteratively grow the compute type larger and larger to see if it finally pushes through.
The other option is to code it all up yourself in Apache Beam, test locally, then port to Google Cloud DataFlow.
Hopefully these guys fix the problem soon, they don't make it easy to ask them questions, that's for sure.

Best Time Series Format for Querying and Converting to Matlab (HDF5)

I have somewhat of a unique problem that looks similar to the problem here :
https://news.ycombinator.com/item?id=8368509
I have a high-speed traffic analysis box that is capturing at about 5 Gbps, and picking out specific packets from this to save into some format in a C++ program. Each day there will probably be 1-3 TB written to disk. Since it's network data, it's all time series down to the nanosecond level, but it would be fine to save it at second or millisecond level and have another application sort the embedded higher-resolution timestamps afterwards. My problem is deciding which format to use. My two requirements are:
Be able to write to disk at about 50 MB/s continuously with several different timestamped parameters.
Be able to export chunks of this data into MATLAB (HDF5).
Query this data once or twice a day for analytics purposes
Another nice thing that's not a hard requirement is :
There will be 4 of these boxes running independently, and it would be nice to query across all of them and combine data if possible. I should mention all 4 of these boxes are in physically different locations, so there is some overhead in sharing data.
The second one is something I cannot change because of legacy applications, but I think the first is more important. The types of queries I may want to export into matlab are something like "Pull metric X between time Y and Z", so this would eventually have to go into an HDF5 format. There is an external library called MatIO that I can use to write matlab files if needed, but it would be even better if there wasn't a translation step. I have read the entire thread mentioned above, and there are many options that appear to stand out: kdb+, Cassandra, PyTables, and OpenTSDB. All of these seem to do what I want, but I can't really figure out how easy it would be to get it into the MATLAB HDF5 format, and if any of these would make it harder than others.
If anyone has experience doing something similar, it would be a big help. Thanks!
A KDB+ tickerplant is certainly capable of capturing data at that rate, however there's lots of things you need to make sure (whatever solution you pick)
Do the machine(s) that are capturing the data have enough cores? Best to taskset a tickerplant, for example, to a core that nothing else will contend with
Similarly with disk - SSD, be sure there is no contention on the bus
Separate the workload - can write different types of data (maybe packets can be partioned by source or stream?) to different cpus/disks/tickerplant processes.
Basically there's lots of ways you can cut this. I can say though that with the appropriate hardware KDB+ could do the job. However, given you want HDF5 it's probably even better to have a simple process capturing the data and writing/converting to disk on the fly.

NetLogo BehaviorSpace memory size constraint

In my model I'm using behaviour space to carry out a number of runs, with variables changing for each run and the output being stored in a *.csv for later analysis. The model runs fine for the first few iterations, but quickly slows as the data grows. My questions is will file-flush when used in behaviour space help this? Or is there a way around it?
Cheers
Simon
Make sure you are using table format output and spreadsheet format is disabled. At http://ccl.northwestern.edu/netlogo/docs/behaviorspace.html we read:
Note however that spreadsheet data is not written to the results file until the experiment finishes. Since spreadsheet data is stored in memory until the experiment is done, very large experiments could run out of memory. So you should disable spreadsheet output unless you really want it.
Note also:
doing runs in parallel will multiply the experiment's memory requirements accordingly. You may need to increase NetLogo's memory ceiling (see this FAQ entry).
where the linked FAQ entry is http://ccl.northwestern.edu/netlogo/docs/faq.html#howbig
Using file-flush will not help. It flushes any buffered data to disk, but only for a file you opened yourself with file-open, and anyway, the buffer associated with a file is fixed-size, not something that grows over time. file-flush is really only useful if you're reading from the same file from another process during a run.

Which is more efficient, storing output in variable or output to file?

I am using Robocopy in PowerShell to sort through and output millions of filenames older than a user-specified age. My question is this: Is it better to make use of Robocopy's logging feature, then import the log via Get-Content -ReadCount, or would it be better to store Robocopy's output in a variable so that the script doesn't have to write to disk?
I would have to regex either way to get the actual file names. I'm using Robocopy because many of the files have paths longer than 248 chars.
Is one way more preferred than the other? Don't want to miss something that should be considered obvious.
You can skip all the theory and speculation about the multiple factors in play by measuring how long each method takes using Measure-Command, for example:
Measure-Command {$rc_output = robocopy <arguments>}
Measure-Command {robocopy <arguments> /log:rc.log; Get-Content rc.log [...]}
You'll get output telling you exactly how long each version took, down to the millisecond. Try it out on a small amount of sample data, see which one is quicker, then apply it to your millions of files.
I will add to #mjolinor's comment, and the other comments. To answer the question directly:
Saving information to a variable (and therefore to RAM) is always faster than direct to disk. But only in the following situations:
Variables are designed to be used to store small (<10Mb) amounts of data. They are not designed to hold things like entire databases. If the size of the data is large (i.e. millions of rows of data, i.e. tens of megabytes), then disk is always better. The problem is that if you shove a ton of information into a variable, you will fill up your RAM, and once your RAM is full, things slow down, paging memory to disk starts happening, and basically everything stops working, including any commands that you currently running (i.e. Robocopy).
Overall, because you are dealing with millions of rows, my recommendation is to write it to disk, because your results are likely to take up quite a bit of space, much more than a variable "should" hold.
Now, after saying all that and delving into the details of how programs manipulate bits in memory, it all doesn't really matter, because the time spent writing things to disk is very very small compared to the amount of time that it takes to process all the files.
If you are processing 1,000,000 files, and you process them at a good speed, say, 1,000 files a second, then it will take 1,000 seconds to process. That means that it takes over 16 Minutes to run through all the files.
If lets say writing to disk is bad, and causes you to be able to process 5 files slower per second, so 995 files instead, it will run only 5 seconds longer. 5 seconds is an impact of 0.5% which is nothing compared to the amount of time it takes to run the whole process.
It is much more likely that writing to a variable will cause much more troubles than writing to disk.
It depends on how much output you're talking about, and what your available system resources are. It will be faster to write them out to a file and then read them back in if the disk I/O time is less than the additional overhead required for memory managment to get into memory. You can try it both ways and time it, but I'd try reading it into memory first while monitoring it with Task Manager. If it starts throwing lots of page faults, that's a clue that you may be better off using the disk as intermediate storage.

Recover standard out from a failed Hadoop job

I'm running a large Hadoop streaming job where I process a large list of files with each file being processed as a single unit. To do this, my input to my streaming job is a single file with a list of all the file names on separate lines.
In general, this works well. However, I ran into an issue where I was partially through a large job (~36%) when Hadoop ran into some files with issues and for some reason it seemed to crash the entire job. If the job had completed successfully, what would have been printed to standard out would be a line for each file as it was completed along with some stats from my program that's processing each file. However, with this failed job, when I try to look at the output that would have been sent to standard out, it is empty. I know that roughly 36% of the files were processed (because I'm saving the data to a database), but it's not easy for me to generate a list of which files were successfully processed and which ones remain. Is there anyway to recover this logging to standard out?
One thing I can do is look at all of the log files for the completed/failed tasks, but this seems more difficult to me and I'm not sure how to go about retrieving the good/bad list of files this way.
Thanks for any suggestions.
Hadoop captures system.out data here :
/mnt/hadoop/logs/userlogs/task_id
However, I've found this unreliable, and Hadoop jobs dont usually use standard out for debugging, rather - the convetion is to use counters.
For each of your documents, you can summarize document characteristics : like length, number of normal ascii chars, number of new lines.
Then, you can have 2 counters: a counter for "good" files, and a counter for "bad" files.
It probably be pretty easy to note that the bad files have something in common [no data, too much data, or maybe some non printable chars].
Finally, you obviously will have to look at the results after the job is done running.
The problem, of course, with system.out statements is that the jobs running on various machines can't integrate their data. Counters get around this problem - they are easily integrated into a clear and accurate picture of the overall job.
Of course, the problem with counters is the information content is entirely numeric, but, with a little creativity, you can easily find ways to quantitatively describe the data in a meaningfull way.
WORST CASE SCENARIO : YOU REALLY NEED TEXT DEBUGGING, and you dont want it in a temp file
In this case, you can use MultipleOutputs to write out ancillary files with other data in them. You can emit records to these files in the same way as you would for the part-r-0000* data.
In the end, I think you will find that, ironically, the restriction of having to use counters will increase the readability of your jobs : it is pretty intuitive, once you think about it, to debug using numerical counts rather than raw text --- i find, quite often that much of my debugging print statements are, when cut down to their raw information content, are basically just counters...