Large-scale MPI merge - merge

In an MPI application, I generated a (huge) number of files to campaign storage (GPFS or Lustre). Each file consists of a sequence of tuples (timestamp, data), already sorted by timestamp.
I'm looking for the most efficient possible way to merge all those files to a single sorted log, ideally scalable and in parallel.
The naive approach, which is keeping one file descriptor per file and sequentially build the merged file does not scale well. First, the system file descriptor limit is reached quite fast – it is capped at 100,000 files (that is, ~100.000 cores), when the goal is to scale the application to 1.000.000+ cores (on Sequoia).
The intermediate approach I can think of would be to create a merge tree. That is, merge sub-groups of files to a bigger one, then iterate over those until we get a completely sorted file.
Do you know of any better-performing approach to that problem, or publications that would set the state of the art for that problem?
Thanks.

Related

Is there an optimal way for writing lots of tiny files with PySpark?

I have a job that requires having to write a single JSON file to s3 for each row in a Spark dataframe (which then gets picked up by another process).
df.repartition(col("id")).write.mode("overwrite").partitionBy(col("id")).json(
f"s3://bucket/path/to/file"
)
These datasets often consist of 100k rows (sometimes 1m+) and take a very long time to write. I understand that large numbers of small files is not great for read performance but is this also the case for writes? Or is there something that can be done with partitioning to speed things up?
Please don't do this, you will only suffer pain. S3 was designed to be cheap long-term storage optimized for large files. It was design so the 'prefix' (directory path) leads to a bucket that provides files. If you want to optimize reads and writes you want to develop several buckets to write to at the same time. This means you want to actually modify the directory path(prefix) to the bucket with the most amount of variation to increase the number of buckets that you write to.
Example of mulitple files being written to the same bucket:
S3:/mydrive/mystuff/2020-12-31
S3:/mydrive/mystuff/2020-12-30
S3:/mydrive/mystuff/2020-12-29
This is because they all share the same bucket prefix --> S3:/mydrive/mystuff/
What if instead you flipped the part that changes? Now you have different buckets being used as you are writing to different buckets.(prefix is different)
S3:2020-12-31/mydrive/mystuff/
S3:2020-12-30/mydrive/mystuff/
S3:2020-12-29/mydrive/mystuff/
This change will help with read/write speed as different buckets will be used. It will not solve the problem that S3 doesn't actually use directories to direct you to files. As I said a prefix is actually just a pointer to the bucket. It then searches against all files you have written, to find the file that exists in your bucket. This is why tons of small files makes things worse, the lookup time for files takes longer and longer the more files you write. Because this lookup is expensive it's much faster to write larger files and make the cost of lookup minimized.

google dataprep (clouddataprep by trifacta) tip: jobs will not be able to run if they are to large

During my cloud dataprep adventures I have come across yet another very annoying bug.
The problem occurs when creating complex flow structures which need to be connected through reference datasets. If a certain limit is crossed in performing a number of unions or a joins with these sets, dataflow is unable to start a job.
I have had a lot of contact with support and they are working on the issue:
"Our Systems Engineer Team was able to determine the root cause resulting into the failed job. They mentioned that the job is too large. That means that the recipe (combined from all datasets) is too big, and Dataflow rejects it. Our engineering team is still investigating approaches to address this.
A workaround is to split the job into two smaller jobs. The first run the flow for the data enrichment, and then use the output as input in the other flow. While it is not ideal, this would be a working solution for the time being."
I ran into the same problem and have a fairly educated guess as to the answer. Keep in mind that DataPrep simply takes all your GUI based inputs and translates it into Apache Beam code. When you pass in a reference data set, it probably writes some AB code that turns the reference data set into a side-input (https://beam.apache.org/documentation/programming-guide/). DataFlow will perform a Parellel Do (ParDo) function where it takes each element from a PCollection, stuffs it into a worker node, and then applies the side-input data for transformation.
So I am pretty sure if the reference sets get too big (which can happen with Joins), the underlying code will take an element from dataset A, pass it to a function with side-input B...but if side-input B is very big, it won't be able to fit into the worker memory. Take a look at the Stackdriver logs for your job to investigate if this is the case. If you see 'GC (Allocation Failure)' in your logs this is a sign of not enough memory.
You can try doing this: suppose you have two CSV files to read in and process, file A is 4 GB and file B is also 4 GB. If you kick off a job to perform some type of Join, it will very quickly outgrow the worker memory and puke. If you CAN, see if you can pre-process in a way where one of the files is in the MB range and just grow the other file.
If your data structures don't lend themselves to that option, you could do what the Sys Engs suggested, split one file up into many small chunks and then feed it to the recipe iteratively against the other larger file.
Another option to test is specifying the compute type for the workers. You can iteratively grow the compute type larger and larger to see if it finally pushes through.
The other option is to code it all up yourself in Apache Beam, test locally, then port to Google Cloud DataFlow.
Hopefully these guys fix the problem soon, they don't make it easy to ask them questions, that's for sure.

Idempotent streams or preventing duplicate rows using PipelineDB

My application produces rotating log files containing multiple application metrics. The log file is rotated once a minute, but each file is still relatively large (over 30MB, with 100ks of rows)
I'd like to feed the logs into PipelineDB (running on the same single machine) which Countiuous View can create for me exactly the aggregations I need over the metrics.
I can easily ship the logs to PipelineDB using copy from stdin, which works great.
However, a machine might occasionally power off unexpectedly (e.g. due to power shortage) during the copy of a log file. Which means that once back online there is uncertainty how much of the file has been inserted into PipelineDB.
How could I ensure that each row in my logs is inserted exactly once in such cases? (It's very important that I get complete and accurate aggregations)
Notice each row in the log file has a unique identifier (serial number created by my application), but I can't find in the docs the option to define a unique field in the stream. I assume that PipelineDB's design is not meant to handle unique fields in stream rows
Nonetheless, are there any alternative solutions to this issue?
Exactly once semantics in a streaming (infinite rows) context is a very complex problem. Most large PipelineDB deployments use some kind of message bus infrastructure (e.g. Kafka) in front of PipelineDB for delivery semantics and reliability, as that's not PipelineDB's core focus.
That being said, there are a couple of approaches you could use here that may be worth thinking about.
First, you could maintain a regular table in PipelineDB that keeps track of each logfile and the line number that it has successfully written to PipelineDB. When beginning to ship a new logfile, check it against this table to determine which line number to start at.
Secondly, you could separate your aggregations by logfile (by including a path or something in the grouping) and simply DELETE any existing rows for that logfile before sending it. Then use combine to aggregate over all logfiles at read time, possibly with a VIEW.

Is sorting necessary for merging BAM files using BamTools?

I have a pair of Illumina paired-end read files (say, A_1.fastq.gz and A_2.fastq.gz) produced from a single bacterial isolate for variant calling. First of all, I used FLASH to merge overlapping reads because of the read length (100 bp), insertion size (about 230 bp) and its standard deviation (about 50 bp). FLASH produced three read files, two for non-overlapping paired-end reads and one for merged reads (single-end). Then I aligned them against a common reference genome using bowtie, which generated two bam files (one for paired-end reads and the other for single-end reads).
To gain a higher coverage and read depth for variant calling, I would like to merge both BAM files into a single one. I plan to use BamTools for this task as it is dedicated to handle BAM files. However, I am not sure whether it is necessary to sort input BAM files prior to calling the "bamtools merge" command? It is not covered in the software tutorial or elsewhere. I would appreciate it if you could help.
Well, it is a merge so, by definition, the input has to be sorted. Otherwise it won't be a merge.
Merge is the action of joining two or more sorted lists keeping the ordering. The good thing about the merge is that you don't have to do an extra sorting when your inputs are already sorted.
If the inputs are not sorted, then you can simply concatenate them and sort the final result, or sort the inputs and merge the intermediate results.
BTW, it is quite probable that if you feed unsorted bams to the merge command, it will complain about it.

Merge of key-value stores

Is there some merge strategy or program which is aware of key-value stores, in the sense that the sequence of the lines does not matter*? For a real example, jEdit does not keep the order of options, so there are hundreds of lines which are shuffled around. It would be nice to diff/merge these without having to sort the file first, for example to see how values are changed and keys are added/removed by configuration modifications while the program is running.
* I know it matters for some file types, like shell scripts where you can have references to other keys. These of course should be merged normally.
if the stores are unsorted then comparing them will cost O(n*m) time, if you first sort them you can run it in O(n log n + m log m) for the sort plus O(n+m) for the check, so if the stores are reasonably large then sorting is way faster