Find global subscript midpoint - intersystems-cache

In Caché ObjectScript (Intersystems' dialect of MUMPS), is there a way to efficiently skip to the approximate midpoint or a linear point in the key for a global subscript range? Equal, based on the number of records.
I want to divide up the the subscript key range into approximately equal chunks and then process each chunk in parallel.
Knowing that the keys in a global are arranged in a binary tree of some kind, this should be a simple operation for the underlying data storage engine but I'm not sure if there is an interface to do this.
I can do it by scanning the global's whole keyspace but that would defeat the purpose of trying to run the operation in parallel. A sequential scan takes hours on this global. I need the keyspace divided up BEFORE I begin scanning.
I want each thread will to an approximately equal sized contiguous chunk of the keyspace to scan individually; the problem is calculating what key range to give each thread.

you can use second parameter "direction" (1 or -1) in function $order or $query

For my particular need, I found that the application I'm using has what I would call an index global. Another global maintained by the app with different keys, linking back to the main table. I can scan that in a fraction of the time and break up the keyset from there.
If someone comes up with a way to do what I want given only the main global, I'll change the accepted answer to that.

Related

How does postgres calculate multi column hashes?

In PG I have a table which is using partition by hash (*text*, *bigint*) looking at a previous answer the functions used for hashing can be seen however I'm unsure which function is used to build combined hashes?
Is it treating the partition keys as a record and using hash_record?
Ultimately I want to know this to rebuild the hashing function in Java to optimise reads and writes to specific partitions.
The hash_combine64 is used for calculating the ultimate hash value. According to the comments in the code it's based on the boost's hash_combine approach. Also, you can find the whole partition calculation algorithm in compute_partition_hash_value function.

what is wrong with my extendible hashing solution?

I have the solution to extendible hashing below. I was wondering if hashing it this way is a correct representation? I know that your directories can also be a binary representation and you rehash it and increase your global depth every time a clash happens. But is this also a correct representation?
There is an extendable hash table with leaf size M=4 entries, and with the directory initially indexed using two senior bits. Consider insertion, into an initially empty table, of the keys that hash into the following values: 0100010, 0100100, 1000000, 0110101, 0101111, 1000001, 0100000, 1001000, 1001001, 1000010.
A. Show the state of the table after the first 8 insertions.
B. Show the state of the table after the remaining two insertions.

why are orientdb index sizes on disk so large

orientdb 2.0.5
I have a database in which we create non-unque index on 2 properties on a class called indexstat.
The two properties which make up the index are a string identifier plus a long timestamp.
Data is created in batches of few hundred records every 5 minutes. After a few hours old records are deleted.
This is file listing are the files related to that table.
Question:
Why is the .irs file which according to documentation (is related to non-unique indexes)...so monstrously huge after a few hours. 298056704 bytes larger than actual data (.irs size - .sbt size - .cpm size).
I would think the index would be smaller than the actual data.
Second question:
What is best practice here. Should I be using unique indexes instead of non-unique? Should I find a way to make the data in the index smaller (e.g. use longs instead of strings as identifiers)?
Below are file names and the sizes of each.
indexstat.cpm 727778304
indexstatidx.irs 1799095296
indexstatidx.sbt 263168
indexstat.pcl 773260288
This is repeated for a few tables where the index size is larger than the database data.
Internals of *.irs files organised in a such way that when you delete something from an index there is an unused hole left in the file. At some point, when about a half of the file space is wasted, those unused holes come into play again and become available for reuse and allocation. That is done for performance reasons to lower the index data fragmentation. In your case this means that sooner or later the *.irs file will stop growing, and its maximum size should be around 2-3 times larger than the maximum observed size of the corresponding *.pcl file, assuming your single stat record size is not much bigger compared to the size of the id-timestamp pair size.
Regarding the second question, in a long run it is almost always better to use the most specific/strict data types to model the data and the most specific/strict index types to index it.
At this link is shown a discussion relative to the index file, maybe can help you.
For the second question, the index should be chosen according to your purpose and your data (not vice versa). The data type (long, string) must be the one that best represents your fields (and already here if for example if you just an integer and this is sufficient to the scope, it is useless to use a long). The same choice for the index, if you need to not have duplicate the choice will be non-unique. if you need an index that allows to choose the range sb-tree instead of the hash and so on ...

collection.mutable.OpenHashMap vs collection.mutable.HashMap

For put and get operations OpenHashMap outperform HashMap by about 5 times: https://gist.github.com/1423303
Are any cases when HashMap should be preferred over OpenHashMap?
Your code exactly matches one of the use cases for OpenHashMap. Your code:
println ("scala OpenHashMap: " + time (warmup) {
val m = new scala.collection.mutable.OpenHashMap[Int,Int];
var i = 0;
var start = System.currentTimeMillis();
while(i<100000) { m.put(i,i);i=i+1;};
})
The explanation for OpenHashMap (scaladoc):
A mutable hash map based on an open hashing scheme. The precise scheme
is undefined, but it should make a reasonable effort to ensure that an
insert with consecutive hash codes is not unneccessarily penalised. In
particular, mappings of consecutive integer keys should work without
significant performance loss.
My emphasis. Which explains your findings. When to use OpenHashMap rather than HashMap? See Wikipedia. From there:
Chained hash tables with linked lists are popular because they require
only basic data structures with simple algorithms, and can use simple
hash functions that are unsuitable for other methods.
The cost of a table operation is that of scanning the entries of the
selected bucket for the desired key. If the distribution of keys is
sufficiently uniform, the average cost of a lookup depends only on the
average number of keys per bucket—that is, on the load factor.
Chained hash tables remain effective even when the number of table
entries n is much higher than the number of slots. Their performance
degrades more gracefully (linearly) with the load factor. For example,
a chained hash table with 1000 slots and 10,000 stored keys (load
factor 10) is five to ten times slower than a 10,000-slot table (load
factor 1); but still 1000 times faster than a plain sequential list,
and possibly even faster than a balanced search tree.
For separate-chaining, the worst-case scenario is when all entries
were inserted into the same bucket, in which case the hash table is
ineffective and the cost is that of searching the bucket data
structure. If the latter is a linear list, the lookup procedure may
have to scan all its entries; so the worst-case cost is proportional
to the number n of entries in the table.
This is a generic explanation. As ever with these things, your performance will vary depending upon the use case, if you care about it, you need to measure it.

Merging huge sets (HashSet) in Scala

I have two huge (as in millions of entries) sets (HashSet) that have some (<10%) overlap between them. I need to merge them into one set (I don't care about maintaining the original sets).
Currently, I am adding all items of one set to the other with:
setOne ++= setTwo
This takes several minutes to complete (after several attempts at tweaking hashCode() on the members).
Any ideas how to speed things up?
You can get slightly better performance with Parallel Collections API in Scala 2.9.0+:
setOne.par ++ setTwo
or
(setOne.par /: setTwo)(_ + _)
There are a few things you might wanna try:
Use the sizeHint method to keep your sets at the expected size.
Call useSizeMap(true) on it to get better hash table resizing.
It seems to me that the latter option gives better results, though both show improvements on tests here.
Can you tell me a little more about the data inside the sets? The reason I ask is that for this kind of thing, you usually want something a bit specialized. Here's a few things that can be done:
If the data is (or can be) sorted, you can walk pointers to do a merge, similar to what's done using merge sort. This operation is pretty trivially parallelizable since you can partition one data set and then partition the second data set using binary search to find the correct boundary.
If the data is within a certain numeric range, you can instead use a bitset and just set bits whenever you encounter that number.
If one of the data sets is smaller than the other, you could put it in a hash set and loop over the other dataset quickly, checking for containment.
I have used the first strategy to create a gigantic set of about 8 million integers from about 40k smaller sets in about a second (on beefy hardware, in Scala).