Ordering of files during the mergesort - mergesort

I am interested in one topic, suppose that we have eight file, each containing 1 billion integers, and we should combine these files together into 8 billion integers files, all of them in each files are sorted. Of course, task is easy if we make 8 pass mergesort, but my question is, is it important ordering of files, in which order we should make combination on them? For example at the beginning, instead of combine first and second files, create new M file and combine with third file, maybe sometimes combination of second and third and then with first one would be more profitably? I think my question is clear. Does ordering of files during mergesort procedure matter? If so, how can we choose optimal one?

It's probably optimal to do an 8-way merge sort without intermediate files. Open 8 file handles, find the smallest integer from all 8, write that to the output file and read the next integer from that file. You could probably manage an 8-element array of your 8 sources (holding a file handle and the last value read), using an insertion sort.
As far as ordering goes, if you could only merge two files at a time, I would probably merge the smallest files first. Simplify your example and you can see why.
Assume you have 3 files, with 1, 2 and 100 records in them.
If you merge the 1 & 2 into a temp file with 3 records, and then merge that with the 100, you'll have read 106 records and written 103.
If you instead merge the 1 & 100 into a temp file with 101 records, and then merge that with the 2, you'll have read 204 records and written 103.

Related

How to hash a filename down to a small number or digit for output processing

I am not a Perl programmer but I've inherited existing code that is going to a directory, finding all files iren that folder and subfolder (usually JPG or Office files) and then converting this into a single file to use to load into a SQL Server database. The customer has about 500,000 of these files.
It takes about 45 mins to create the file and then another 45 mins for SQL to load the data. Crudely, it's doing about 150 per second which is reasonable but time is the issue for the job. There are many reasons I don't want to use other techniques so please don't suggest other options unless closely aligned to this process.
What I was considering is to improve speed by running something like 10 processes concurrent. Each process would get passed another argument (0-9). Each process would go to the directory and find all files as it is currently doing but for each file found, it would hash or kludge the filename down to a single digit (0-9) and if that matched the supplied argument, the process would process that file and write it out to it's unique file stream.
Then I would have 10 output files at the end. I doubt that the SQL Server side could be improved as I would have to load to separate tables and then merge in the database and as these are BLOB objects, will not be fast.
So I am looking for some basic code or clues on what functions to use in Perl to take a variable (the file name $File) and generate a single 0 to 9 value based on that. It is probably done by getting the ascii value of each char, then adding these together to get a long number, then add these individual numbers together and eventually you'll get an answer.
Any clues or suggested techniques?
Here's an easy one to implement suggested in the unpack function documentation:
sub string_to_code {
# convert an arbitrary string to a digit from 0-9
my ($string) = #_;
return unpack("%32W*",$string) % 10;
}

How to fix out of memory error when comparing 5 million records in two files using Perl in windows environment

I'm comparing two files of 5 million records each(each lines contains so many columns but I need to compare only 2 columns). any better approach to compare two files and find the differences without out of memory error?
I have tried parsing each file into different hashes and comparing both hashes leads to out of memory error.
The first question is, do you need to be using Perl to begin with?
Have you thought about using standard Linux utilities?
Depending on how your columns of data are constructed and delimited, there is a very good chance that Linux 'cut' could work for you to extract from each file only the column you need into a temp file.
Then use Linux 'sort' to sort each temp file.
Then use Linux 'diff' or 'comm' to compare the two temp files.
None of the above-suggested utilities should have any out-of-memory issues even on two files of 5 million records, assuming you have a reasonable amount of memory and disk space (e.g., for 'sort' to create its own temporary files).

Is sorting necessary for merging BAM files using BamTools?

I have a pair of Illumina paired-end read files (say, A_1.fastq.gz and A_2.fastq.gz) produced from a single bacterial isolate for variant calling. First of all, I used FLASH to merge overlapping reads because of the read length (100 bp), insertion size (about 230 bp) and its standard deviation (about 50 bp). FLASH produced three read files, two for non-overlapping paired-end reads and one for merged reads (single-end). Then I aligned them against a common reference genome using bowtie, which generated two bam files (one for paired-end reads and the other for single-end reads).
To gain a higher coverage and read depth for variant calling, I would like to merge both BAM files into a single one. I plan to use BamTools for this task as it is dedicated to handle BAM files. However, I am not sure whether it is necessary to sort input BAM files prior to calling the "bamtools merge" command? It is not covered in the software tutorial or elsewhere. I would appreciate it if you could help.
Well, it is a merge so, by definition, the input has to be sorted. Otherwise it won't be a merge.
Merge is the action of joining two or more sorted lists keeping the ordering. The good thing about the merge is that you don't have to do an extra sorting when your inputs are already sorted.
If the inputs are not sorted, then you can simply concatenate them and sort the final result, or sort the inputs and merge the intermediate results.
BTW, it is quite probable that if you feed unsorted bams to the merge command, it will complain about it.

Large-scale MPI merge

In an MPI application, I generated a (huge) number of files to campaign storage (GPFS or Lustre). Each file consists of a sequence of tuples (timestamp, data), already sorted by timestamp.
I'm looking for the most efficient possible way to merge all those files to a single sorted log, ideally scalable and in parallel.
The naive approach, which is keeping one file descriptor per file and sequentially build the merged file does not scale well. First, the system file descriptor limit is reached quite fast – it is capped at 100,000 files (that is, ~100.000 cores), when the goal is to scale the application to 1.000.000+ cores (on Sequoia).
The intermediate approach I can think of would be to create a merge tree. That is, merge sub-groups of files to a bigger one, then iterate over those until we get a completely sorted file.
Do you know of any better-performing approach to that problem, or publications that would set the state of the art for that problem?
Thanks.

Why are zero byte files written to GCS when running a pipeline?

Our job/pipeline is writing the results of a ParDo transformation back out to GCS i.e. using TextIO.Write.to("gs://...")
We've noticed that when the job/pipeline completes, it leaves numerous 0 byte files in the output bucket.
The input to the pipeline is from multiple files from GCS, so I'm assuming the results are sharded, which is fine.
But why do we get empty files?
It is likely that these empty shards are the results of an intermediate pipeline step which turned out to be somewhat sparse and some pre-partitioned shards had no records in them.
E.g. if there was a GroupByKey right before the TextIO.Write and, say, the keyspace was sharded into ranges [00, 01), [01, 02), ..., [fe, ff) (255 shards total), but all actual keys emitted from the input of this GroupByKey were in the range [34, 81) and [a3, b5), then 255 output files will be produced, but most of them will turn out empty. (this is a hypothetical partitioning scheme, just to give you the idea)
The rest of my answer will be in the form of Q&A.
Why produce empty files at all? If there's nothing to output, don't create the file!
It's true that it would be technically possible to avoid producing them, e.g. by opening them lazily when writing output when the first element is written. AFAIK we normally don't do this because empty output files are usually not an issue, and it is easier to understand an empty file than absence of a file: it would be pretty confusing if, say, only the first of 50 shards turned out non-empty and you would only have a single output file named 00001-of-000050: you'd wonder what happened to the 49 other ones.
But why not add a post-processing step to delete the empty files? In principle we could add a post-processing step of deleting the empty outputs and renaming the rest (to be consistent with the xxxxx-of-yyyyy filepattern) if empty outputs became a big issue.
Does existence of empty shards signal a problem in my pipeline?
A lot of empty shards might mean that the system-chosen sharding was suboptimal/uneven and we should have split the computation into fewer, more uniform shards. If this is a problem for you, could you give more details about your pipeline's output, e.g.: your screenshot shows that the non-empty outputs are also pretty small: do they contain just a handful of records? (if so, it may be difficult to achieve uniform sharding without knowing the data in advance)
But the shards of my original input are not empty, doesn't sharding of output mirror sharding of input? If your pipeline has GroupByKey (or derived) operations, there will be intermediate steps where the number of shards in input and output are different: e.g. an operation may consume 30 shards of input but produce 50 shards of output, or vice versa. Different number of shards in input and output is also possible in some other cases not involving GroupByKey.
TL;DR If your overall output is correct, it's not a bug, but tell us if it is a problem for you :)