Merge of key-value stores - merge

Is there some merge strategy or program which is aware of key-value stores, in the sense that the sequence of the lines does not matter*? For a real example, jEdit does not keep the order of options, so there are hundreds of lines which are shuffled around. It would be nice to diff/merge these without having to sort the file first, for example to see how values are changed and keys are added/removed by configuration modifications while the program is running.
* I know it matters for some file types, like shell scripts where you can have references to other keys. These of course should be merged normally.

if the stores are unsorted then comparing them will cost O(n*m) time, if you first sort them you can run it in O(n log n + m log m) for the sort plus O(n+m) for the check, so if the stores are reasonably large then sorting is way faster


Is sorting necessary for merging BAM files using BamTools?

I have a pair of Illumina paired-end read files (say, A_1.fastq.gz and A_2.fastq.gz) produced from a single bacterial isolate for variant calling. First of all, I used FLASH to merge overlapping reads because of the read length (100 bp), insertion size (about 230 bp) and its standard deviation (about 50 bp). FLASH produced three read files, two for non-overlapping paired-end reads and one for merged reads (single-end). Then I aligned them against a common reference genome using bowtie, which generated two bam files (one for paired-end reads and the other for single-end reads).
To gain a higher coverage and read depth for variant calling, I would like to merge both BAM files into a single one. I plan to use BamTools for this task as it is dedicated to handle BAM files. However, I am not sure whether it is necessary to sort input BAM files prior to calling the "bamtools merge" command? It is not covered in the software tutorial or elsewhere. I would appreciate it if you could help.
Well, it is a merge so, by definition, the input has to be sorted. Otherwise it won't be a merge.
Merge is the action of joining two or more sorted lists keeping the ordering. The good thing about the merge is that you don't have to do an extra sorting when your inputs are already sorted.
If the inputs are not sorted, then you can simply concatenate them and sort the final result, or sort the inputs and merge the intermediate results.
BTW, it is quite probable that if you feed unsorted bams to the merge command, it will complain about it.

Large-scale MPI merge

In an MPI application, I generated a (huge) number of files to campaign storage (GPFS or Lustre). Each file consists of a sequence of tuples (timestamp, data), already sorted by timestamp.
I'm looking for the most efficient possible way to merge all those files to a single sorted log, ideally scalable and in parallel.
The naive approach, which is keeping one file descriptor per file and sequentially build the merged file does not scale well. First, the system file descriptor limit is reached quite fast – it is capped at 100,000 files (that is, ~100.000 cores), when the goal is to scale the application to 1.000.000+ cores (on Sequoia).
The intermediate approach I can think of would be to create a merge tree. That is, merge sub-groups of files to a bigger one, then iterate over those until we get a completely sorted file.
Do you know of any better-performing approach to that problem, or publications that would set the state of the art for that problem?


I am pretty new to NoSQL, but I always liked the idea of it. I took a look at Redis, and got a few questions about the best ways of storing and recieving multiple hashes.
Assuming the following scenario:
Store a list of objects (redis 'Hashes') and select them by their timestamp.
To archive this in SQL, it would require one table and two simple queries (INSERT & SELECT).
Trying to do this in Redis, I ended up creating the following structure:
Key object:$id (hash) containing the object
Key index:timestamp:$id (sorted set)
score equals timestamp and value includes id
While I can live with the additional maintenance work of two keys instead of one table (SQL), I am curious about the process of selecting multiple objects:
ZRANGEBYSCORE index:timestamp:$id timestampStart timestampEnd
This returns an array of all IDs which got created between timestampStart and timestampEnd. To get the object itself I am requesting every single one by:
GET object:$id
Is this the right way of doing it?
In comparison with an SQL Database: Is it still appreciably faster or might it even become slower caused by the high number of GETs?
A ZRANGEBYSCORE costs O(log(N) + M) where N=|items in your set| and M=|items you're selecting|. So, doing the ZRANGEBYSCORE and then M GET operations is just O(long(N)+M+M) = O(log(N)+M) and would at most be twice as slow. The network back and forth could have been a major slow down, but since each of your gets is an independent operation, you can just pipeline them. You can also put the whole thing in a Lua script and just have one back and forth, which would be the most optimal. I'd say with 99% certainty this would be faster than doing the same thing in SQL.
Also, if this is a very frequent operation for you, you can get even more speed up by just storing the entire object in your sorted set instead of just the id. You'd have key = object encoded as json, score = timestamp. This would save you O(M) on your operation in terms of not needing to do any GETs.
Whether or not this is a good way of doing things really depends on your use case. How much speed do you really need, and how important are other features of a traditional database to you? Remember, Redis is much more just datastructures accessible by clients than a traditional database, and it must store everything in RAM. To know whether it's the right thing for you, we'd need more information.

Merging huge sets (HashSet) in Scala

I have two huge (as in millions of entries) sets (HashSet) that have some (<10%) overlap between them. I need to merge them into one set (I don't care about maintaining the original sets).
Currently, I am adding all items of one set to the other with:
setOne ++= setTwo
This takes several minutes to complete (after several attempts at tweaking hashCode() on the members).
Any ideas how to speed things up?
You can get slightly better performance with Parallel Collections API in Scala 2.9.0+:
setOne.par ++ setTwo
(setOne.par /: setTwo)(_ + _)
There are a few things you might wanna try:
Use the sizeHint method to keep your sets at the expected size.
Call useSizeMap(true) on it to get better hash table resizing.
It seems to me that the latter option gives better results, though both show improvements on tests here.
Can you tell me a little more about the data inside the sets? The reason I ask is that for this kind of thing, you usually want something a bit specialized. Here's a few things that can be done:
If the data is (or can be) sorted, you can walk pointers to do a merge, similar to what's done using merge sort. This operation is pretty trivially parallelizable since you can partition one data set and then partition the second data set using binary search to find the correct boundary.
If the data is within a certain numeric range, you can instead use a bitset and just set bits whenever you encounter that number.
If one of the data sets is smaller than the other, you could put it in a hash set and loop over the other dataset quickly, checking for containment.
I have used the first strategy to create a gigantic set of about 8 million integers from about 40k smaller sets in about a second (on beefy hardware, in Scala).

Optimizing word count

(This is rather hypothetical in nature as of right now, so I don't have too many details to offer.)
I have a flat file of random (English) words, one on each line. I need to write an efficient program to count the number of occurrences of each word. The file is big (perhaps about 1GB), but I have plenty of RAM for everything. They're stored on permanent media, so read speeds are slow, so I need to just read through it once linearly.
My two off-the-top-of-my-head ideas were to use a hash with words => no. of occurrences, or a trie with the no. of occurrences at the end node. I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.
What approach would be best?
I think a trie with the count as the leaves could be faster.
Any decent hash table implementation will require reading the word fully, processing it using a hash function, and finally, a look-up in the table.
A trie can be implemented such that the search occurs as you are reading the word. This way, rather than doing a full look-up of the word, you could often find yourself skipping characters once you've established the unique word prefix.
For example, if you've read the characters: "torto", a trie would know that the only possible word that starts this way is tortoise.
If you can perform this inline searching faster on a word faster than the hashing algorithm can hash, you should be able to be faster.
However, this is total overkill. I rambled on since you said it was purely hypothetical, I figured you'd like a hypothetical-type of answer. Go with the most maintainable solution that performs the task in a reasonable amount of time. Micro-optimizations typically waste more time in man-hours than they save in CPU-hours.
I'd use a Dictionary object where the key is word converted to lower case and the value is the count. If the dictionary doesn't contain the word, add it with a value of 1. If it does contain the word, increment the value.
Given slow reading, it's probably not going to make any noticeable difference. The overall time will be completely dominated by the time to read the data anyway, so that's what you should work at optimizing. For the algorithm (mostly data structure, really) in memory, just use whatever happens to be most convenient in the language you find most comfortable.
A hash table is (if done right, and you said you had lots of RAM) O(1) to count a particular word, while a trie is going to be O(n) where n is the length of the word.
With a sufficiently large hash space, you'll get much better performance from a hash table than from a trie.
I think that a trie is overkill for your use case. A hash of word => # of occurrences is exactly what I would use. Even using a slow interpreted language like Perl, you can munge a 1GB file this way in just a few minutes. (I've done this before.)
I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.
How many times will this code be run? If you're just doing it once, I'd say optimize for your time rather than your CPU's time, and just do whatever's fastest to implement (within reason). If you have a standard library function that implements a key-value interface, just use that.
If you're doing it many times, then grab a subset (or several subsets) of the data file, and benchmark your options. Without knowing more about your data set, it'd be dubious to recommend one over another.
Use Python!
Add these elements to a set data type as you go line by line, before asking whether it is in the hash table. After you know it is in the set, then add a dictionary value of 2, since you already added it to the set once before.
This will take some of the memory and computation away from asking the dictionary every single time, and instead will handle unique valued words better, at the end of the call just dump all the words that are not in the dictionary out of the set with a value of 1. (Intersect the two collections in respect to the set)
To a large extent, it depends on what you want you want to do with the data once you've captured it. See Why Use a Hash Table over a Trie (Prefix Tree)?
a simple python script:
import collections
f = file('words.txt')
counts = collections.defaultdict(int)
for line in f:
counts[line.strip()] +=1
print "\n".join("%s: %d" % (word, count) for (word, count) in counts.iteritems())