Is sorting necessary for merging BAM files using BamTools? - merge

I have a pair of Illumina paired-end read files (say, A_1.fastq.gz and A_2.fastq.gz) produced from a single bacterial isolate for variant calling. First of all, I used FLASH to merge overlapping reads because of the read length (100 bp), insertion size (about 230 bp) and its standard deviation (about 50 bp). FLASH produced three read files, two for non-overlapping paired-end reads and one for merged reads (single-end). Then I aligned them against a common reference genome using bowtie, which generated two bam files (one for paired-end reads and the other for single-end reads).
To gain a higher coverage and read depth for variant calling, I would like to merge both BAM files into a single one. I plan to use BamTools for this task as it is dedicated to handle BAM files. However, I am not sure whether it is necessary to sort input BAM files prior to calling the "bamtools merge" command? It is not covered in the software tutorial or elsewhere. I would appreciate it if you could help.

Well, it is a merge so, by definition, the input has to be sorted. Otherwise it won't be a merge.
Merge is the action of joining two or more sorted lists keeping the ordering. The good thing about the merge is that you don't have to do an extra sorting when your inputs are already sorted.
If the inputs are not sorted, then you can simply concatenate them and sort the final result, or sort the inputs and merge the intermediate results.
BTW, it is quite probable that if you feed unsorted bams to the merge command, it will complain about it.

Related

Large-scale MPI merge

In an MPI application, I generated a (huge) number of files to campaign storage (GPFS or Lustre). Each file consists of a sequence of tuples (timestamp, data), already sorted by timestamp.
I'm looking for the most efficient possible way to merge all those files to a single sorted log, ideally scalable and in parallel.
The naive approach, which is keeping one file descriptor per file and sequentially build the merged file does not scale well. First, the system file descriptor limit is reached quite fast – it is capped at 100,000 files (that is, ~100.000 cores), when the goal is to scale the application to 1.000.000+ cores (on Sequoia).
The intermediate approach I can think of would be to create a merge tree. That is, merge sub-groups of files to a bigger one, then iterate over those until we get a completely sorted file.
Do you know of any better-performing approach to that problem, or publications that would set the state of the art for that problem?
Thanks.

Why are zero byte files written to GCS when running a pipeline?

Our job/pipeline is writing the results of a ParDo transformation back out to GCS i.e. using TextIO.Write.to("gs://...")
We've noticed that when the job/pipeline completes, it leaves numerous 0 byte files in the output bucket.
The input to the pipeline is from multiple files from GCS, so I'm assuming the results are sharded, which is fine.
But why do we get empty files?
It is likely that these empty shards are the results of an intermediate pipeline step which turned out to be somewhat sparse and some pre-partitioned shards had no records in them.
E.g. if there was a GroupByKey right before the TextIO.Write and, say, the keyspace was sharded into ranges [00, 01), [01, 02), ..., [fe, ff) (255 shards total), but all actual keys emitted from the input of this GroupByKey were in the range [34, 81) and [a3, b5), then 255 output files will be produced, but most of them will turn out empty. (this is a hypothetical partitioning scheme, just to give you the idea)
The rest of my answer will be in the form of Q&A.
Why produce empty files at all? If there's nothing to output, don't create the file!
It's true that it would be technically possible to avoid producing them, e.g. by opening them lazily when writing output when the first element is written. AFAIK we normally don't do this because empty output files are usually not an issue, and it is easier to understand an empty file than absence of a file: it would be pretty confusing if, say, only the first of 50 shards turned out non-empty and you would only have a single output file named 00001-of-000050: you'd wonder what happened to the 49 other ones.
But why not add a post-processing step to delete the empty files? In principle we could add a post-processing step of deleting the empty outputs and renaming the rest (to be consistent with the xxxxx-of-yyyyy filepattern) if empty outputs became a big issue.
Does existence of empty shards signal a problem in my pipeline?
A lot of empty shards might mean that the system-chosen sharding was suboptimal/uneven and we should have split the computation into fewer, more uniform shards. If this is a problem for you, could you give more details about your pipeline's output, e.g.: your screenshot shows that the non-empty outputs are also pretty small: do they contain just a handful of records? (if so, it may be difficult to achieve uniform sharding without knowing the data in advance)
But the shards of my original input are not empty, doesn't sharding of output mirror sharding of input? If your pipeline has GroupByKey (or derived) operations, there will be intermediate steps where the number of shards in input and output are different: e.g. an operation may consume 30 shards of input but produce 50 shards of output, or vice versa. Different number of shards in input and output is also possible in some other cases not involving GroupByKey.
TL;DR If your overall output is correct, it's not a bug, but tell us if it is a problem for you :)

indexing of large text files line by line for fast access

I have a very large text file around 43GB which I use to process them to generate another files in different forms. and i don't want to setup any databases or any indexing search engines
the data is in the .ttl format
<http://www.wikidata.org/entity/Q1000> <http://www.w3.org/2002/07/owl#sameAs> <http://nl.dbpedia.org/resource/Gabon> .
<http://www.wikidata.org/entity/Q1000> <http://www.w3.org/2002/07/owl#sameAs> <http://en.dbpedia.org/resource/Gabon> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lad.dbpedia.org/resource/Mohandas_Gandhi> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lb.dbpedia.org/resource/Mohandas_Karamchand_Gandhi> .
target is generating all combinations from all triples who share same subject:
for example for the subject Q1000 :
<http://nl.dbpedia.org/resource/Gabon> <http://www.w3.org/2002/07/owl#sameAs> <http://en.dbpedia.org/resource/Gabon> .
<http://en.dbpedia.org/resource/Gabon> <http://www.w3.org/2002/07/owl#sameAs> <http://nl.dbpedia.org/resource/Gabon> .
the problem:
the Dummy code to start with is iterating with complexity O(n^2) where n is the number of lines of the 45GB text file ,needless to say that it would take years to do so.
what i thought of to optimize :
loading a HashMap [String,IntArray] for indexing lines of appearance each key and using any library to access the file by line number for example:
Q1000 | 1,2,433
Q1001 | 2334,323,2124
drawbacks is that the index could be relatively large as well , considering that we will have another index for the access with specific line number , plus the overloaded i didnt try the performance of the
making a text file for each key like Q1000.txt for all triples contains subject Q1000 and iterating over them one by one and making combinations
drawbacks : this seems the fastest one and least memory consuming but certainly creating around 10 million files and accessing them will be a problem , is there and alternative for that ?
i'm using scala scripts for the task
Take the 43GB file in chunks that fit comfortably in memory and sort on the subject. Write the chunks separately.
Run a merge sort on the chunks (sorted by subject). It's really easy: you have as input iterators over two files, and you write out whichever input is less, then read from that one again (if there's any left).
Now you just need to make one pass through the sorted data to gather the groups of subjects.
Should take O(n) space and O(n log n) time, which for this sort of thing you should be able to afford.
A possible solution would be to use some existing map-reduce library. After all, your task is exactly what map-reduce is for. Even if you don't parallelize your computation on multiple machines, the main advantage is that it handles the management of splitting and merging for you.
There is an interesting library Apache Crunch with Scala API. I haven't used it myself, but it looks it could solve your problem well. Your lines would be split according to their subjects and then

Ordering of files during the mergesort

I am interested in one topic, suppose that we have eight file, each containing 1 billion integers, and we should combine these files together into 8 billion integers files, all of them in each files are sorted. Of course, task is easy if we make 8 pass mergesort, but my question is, is it important ordering of files, in which order we should make combination on them? For example at the beginning, instead of combine first and second files, create new M file and combine with third file, maybe sometimes combination of second and third and then with first one would be more profitably? I think my question is clear. Does ordering of files during mergesort procedure matter? If so, how can we choose optimal one?
It's probably optimal to do an 8-way merge sort without intermediate files. Open 8 file handles, find the smallest integer from all 8, write that to the output file and read the next integer from that file. You could probably manage an 8-element array of your 8 sources (holding a file handle and the last value read), using an insertion sort.
As far as ordering goes, if you could only merge two files at a time, I would probably merge the smallest files first. Simplify your example and you can see why.
Assume you have 3 files, with 1, 2 and 100 records in them.
If you merge the 1 & 2 into a temp file with 3 records, and then merge that with the 100, you'll have read 106 records and written 103.
If you instead merge the 1 & 100 into a temp file with 101 records, and then merge that with the 2, you'll have read 204 records and written 103.

Merge of key-value stores

Is there some merge strategy or program which is aware of key-value stores, in the sense that the sequence of the lines does not matter*? For a real example, jEdit does not keep the order of options, so there are hundreds of lines which are shuffled around. It would be nice to diff/merge these without having to sort the file first, for example to see how values are changed and keys are added/removed by configuration modifications while the program is running.
* I know it matters for some file types, like shell scripts where you can have references to other keys. These of course should be merged normally.
if the stores are unsorted then comparing them will cost O(n*m) time, if you first sort them you can run it in O(n log n + m log m) for the sort plus O(n+m) for the check, so if the stores are reasonably large then sorting is way faster