I'm looking for a method of storing data within the order of a dictionary when it is being transmitted.
As the order of a dictionary doesn't matter, it provides an ideal place to store data that is likely to be overlooked.
For the purposes of this, the fact it's a dictionary doesn't matter, so I'll model it as a list.
I have a list of size 3 with values A, B, C and D.
The ideal amount of data I can store in this is log2(n!) where n=4 with is 4.58... so 4 bits.
There are a number of simple methods that approach n-1 bits that can be stored, for example a simple method for n-1 efficiency:
I have the same list as above, A..D.
I start with the first element
I place the next elements before or after it - each referring to a 1 or a 0.
For example:
000 -> DCBA
001 -> CBAD
010 -> DBAC
100 -> BACD
There are a few optimisations on this that would provide a few extra percent of bits stored, but I'd like to know (if possible) if there is a method that would approach the theoretical maximum, or at least provide a significant boost to the efficiency of this method.
For some more context, I am looking to store data in the order of HTTP request header fields.
I'm looking for an algorithm, not a piece of code, if possible.
I figured this out by using a quicksort style algorithm, however instead of comparing each element to the pivot, the next bit from the 'Datasource' is used.
As I'm answering my own question and there has been very little interest in this question, I'm not going into huge detail, but am happy to do so if asked.
Related
I'm trying to implement this paper
Browser Fingerprint Coding Methods Increasing the Effectiveness of User Identification in the Web Traffic
I got a couple of questions about the LHS algorithm in general and the proposed implementation:
The LSH algorithm it's used only when you have a lot of documents to compare with each other (because it is supposed to put the similar ones in the same bucket from what I got). If for example I have a new document and I want to calculate the similarity with the others, I have to relaunch the LHS algorithm from scratch, including the new document, correct?
In 'Mining of Massive Datasets, Ch3', it is said that for the LHS we should use one hash function per band. Each hash function creates n buckets.
So, for the first band, we are going to have n buckets. For the second band onward, Am I supposed to keep using the same hash function (so this way I keep using the same buckets as before) or another one (ending so with m>>n buckets)?
This question is related t the previous one. If I use the same hash function for all the bands, then I'll have n buckets. No problem here. But If I have to use more hash functions (one different function per row), I'm going to end up with a lot of different buckets. Am I supposed to measure the similarity for each pair in each bucket? (If I have to use only one hash function then here it's not a problem).
In the paper, I understood most of the algorithm except for its end.
Basically, two Signatures matrices are created (one for stable features and one for unstable features) via minhashing. Then, they use LSH on the first matrix to obtain a list of candidates pairs. So far so good.
What happens at the end? do they perform the LHS on the second matrix? How the result of the first LHS is used? I cannot see the relationship between the first and the second LHS.
The output of the final step is supposed to be a list of pairing candidates, right? and all that I have to do is performing Jaccard similarity on them and setting a threshold, right?
Thanks for your answers!
I got a partial answer to my question (still missing question 4)
No. You would keep the bucket structure and hash the new doc into it. Then compare with only those docs in one of the buckets it fell into.
No. You HAVE to use different hash functions and a different set of buckets for each hash function.
This is irrelevant because of the answer to (2).
While reading about Elias Gamma coding on wikipedia, I see it mentions that:
"Gamma coding is used in applications where the largest encoded value is not known ahead of time."
and that:
"It is used most commonly when coding integers whose upper-bound cannot be determined beforehand."
I don't really understand what is meant by these sentences, because whenever this algorithm is coded, the largest value of the test data or range of the test data would be known before hand. Any help is appreciated!
As far as I'm acquainted with Elias-gamma/delta encoding, the first sentence simply states that these compression methods are global, which means that it does not rely on the input data to generate the code. In other words, these methods do not need to process the input before performing the compression (as local methods do); it compresses the data with a function that does not depend on information from the database.
As for the second sentence, it may be taken as a guarantee that, although there may be some very large integers, the encoding will still perform well (and will represent such values with feasible amount of bytes, i.e., it is a universal method). Notice that, if you knew the biggest integer, some approaches (like minimal hashes) could perform better.
As a last consideration, the same page you referred to also states that:
Gamma coding is used in applications where the largest encoded value is not known ahead of time, or to compress data in which small values are much more frequent than large values.
This may be obtained by generating lists of differences from the original lists of integers, and passing such differences to be compressed instead. For example, in a list of increasing numbers, you could generate:
list: 1 5 29 32 35 36 37
diff: 1 4 24 3 3 1 1
Which will give you many more small numbers, and therefore a greater level of compression, than the first list.
I read this page about the time complexity of Scala collections. As it says, Vector's complexity is eC for all operations.
It made me wonder what Vector is. I read the document and it says:
Because vectors strike a good balance between fast random selections and fast random functional updates, they are currently the
default implementation of immutable indexed sequences. It is backed by
a little endian bit-mapped vector trie with a branching factor of 32.
Locality is very good, but not contiguous, which is good for very
large sequences.
As with everything else about Scala, it's pretty vague. How actually does Vector work?
The keyword here is Trie.
Vector is implemented as a Trie datastructure.
See http://en.wikipedia.org/wiki/Trie.
More precisely, it is a "bit-mapped vector trie". I've just found a consise enough description of the structure (along with an implementation - apparently in Rust) here:
https://bitbucket.org/astrieanna/bitmapped-vector-trie
The most relevant excerpt is:
A Bitmapped Vector Trie is basically a 32-tree. Level 1 is an array of size 32, of whatever data type. Level 2 is an array of 32 Level 1's. and so on, until: Level 7 is an array of 2 Level 6's.
UPDATE: In reply to Lai Yu-Hsuan's comment about complexity:
I will have to assume you meant "depth" here :-D. The legend for "eC" says "The operation takes effectively constant time, but this might depend on some assumptions such as maximum length of a vector or distribution of hash keys.".
If you are willing to consider the worst case, and given that there is an upper bound to the maximum size of the vector, then yes indeed we can say that the complexity is constant.
Say that we consider the maximum size to be 2^32, then this means that the worst case is 7 operations at most, in any case.
Then again, we can always consider the worst case for any type of collection, find an upper bound and say this is constant complexity, but for a list by example, this would mean a constant of 4 billions, which is not quite practical.
But Vector is the opposite, 7 operations being more than practical, and this is how we can afford to consider its complexity constant in practice.
Another way to look at this: we are not talking about log(2,N), but log(32,N). If you try to plot that you'll see it is practically an horizontal line. So pragmatically speaking you'll never be able to see much increase in processing time as the collection grows.
Yes, that's still not really constant (which is why it is marked as "eC" and not just "C"), and you'll be able to see a difference around short vectors (but again, a very small difference because the number of operations grows so much slowly).
The other answers re 'Trie' are good. But as a close approximation, just for quick understanding:
Vector internally uses a tree structure - not a binary tree, but a 32-ary tree
Each '32-way node' uses Array[32] and can store either 0-32 references to child nodes or 0-32 pieces of data
The tree is structured to be balanced in a certain way - it is "n" levels deep, but levels 1 to n-1 are "index-only levels" (100% child references; no data) and level n contains all the data (100% data; no child references). So if the number of elements of data is "d" then n = log-base-32(d) rounded upwards
Why this? Simple: for performance.
Instead of doing thousands/millions/gazillions of memory allocations for each individual data element, memory is allocated in 32 element chunks. Instead of walking miles deep to find your data, the structure is quite shallow - it's a very wide, short tree. E.g. 5 levels deep can contain 32^5 data elements (for 4 byte elements = 132GB i.e. pretty big) and each data access would lookup & walk through 5 nodes from the root (whereas a big array would use a single data access). The vector does not proactively allocat memory for all of Level n (data), - it allocates in 32 element chunks as needed. It gives read performance somewhat similar to a huge array, whilst having functional characteristics (power & flexibility & memory-efficiency) somewhat similar to a binary tree.
:)
These may be interesting for you:
Ideal Hash Trees by Phil Bagwell.
Implementing Persistent Vectors in Scala - Daniel Spiewak
More Persistent Vectors: Performance Analysis - Daniel Spiewak
Persistent data structures in Scala
We are looking for the computationally simplest function that will enable an indexed look-up of a function to be determined by a high frequency input stream of widely distributed integers and ranges of integers.
It is OK if the hash/map function selection itself varies based on the specific integer and range requirements, and the performance associated with the part of the code that selects this algorithm is not critical. The number of integers/ranges of interest in most cases will be small (zero to a few thousand). The performance critical portion is in processing the incoming stream and selecting the appropriate function.
As a simple example, please consider the following pseudo-code:
switch (highFrequencyIntegerStream)
case(2) : func1();
case(3) : func2();
case(8) : func3();
case(33-122) : func4();
...
case(10,000) : func40();
In a typical example, there would be only a few thousand of the "cases" shown above, which could include a full range of 32-bit integer values and ranges. (In the pseudo code above 33-122 represents all integers from 33 to 122.) There will be a large number of objects containing these "switch statements."
(Note that the actual implementation will not include switch statements. It will instead be a jump table (which is an array of function pointers) or maybe a combination of the Command and Observer patterns, etc. The implementation details are tangential to the request, but provided to help with visualization.)
Many of the objects will contain "switch statements" with only a few entries. The values of interest are subject to real time change, but performance associated with managing these changes is not critical. Hash/map algorithms can be re-generated slowly with each update based on the specific integers and ranges of interest (for a given object at a given time).
We have searched around the internet, looking at Bloom filters, various hash functions listed on Wikipedia's "hash function" page and elsewhere, quite a few Stack Overflow questions, abstract algebra (mostly Galois theory which is attractive for its computationally simple operands), various ciphers, etc., but have not found a solution that appears to be targeted to this problem. (We could not even find a hash or map function that considered these types of ranges as inputs, much less a highly efficient one. Perhaps we are not looking in the right places or using the correct vernacular.)
The current plan is to create a custom algorithm that preprocesses the list of interesting integers and ranges (for a given object at a given time) looking for shifts and masks that can be applied to input stream to help delineate the ranges. Note that most of the incoming integers will be uninteresting, and it is of critical importance to make a very quick decision for as large a percentage of that portion of the stream as possible (which is why Bloom filters looked interesting at first (before we starting thinking that their implementation required more computational complexity than other solutions)).
Because the first decision is so important, we are also considering having multiple tables, the first of which would be inverse masks (masks to select uninteresting numbers) for the easy to find large ranges of data not included in a given "switch statement", to be followed by subsequent tables that would expand the smaller ranges. We are thinking this will, for most cases of input streams, yield something quite a bit faster than a binary search on the bounds of the ranges.
Note that the input stream can be considered to be randomly distributed.
There is a pretty extensive theory of minimal perfect hash functions that I think will meet your requirement. The idea of a minimal perfect hash is that a set of distinct inputs is mapped to a dense set of integers in 1-1 fashion. In your case a set of N 32-bit integers and ranges would each be mapped to a unique integer in a range of size a small multiple of N. Gnu has a perfect hash function generator called gperf that is meant for strings but might possibly work on your data. I'd definitely give it a try. Just add a length byte so that integers are 5 byte strings and ranges are 9 bytes. There are some formal references on the Wikipedia page. A literature search in ACM and IEEE literature will certainly turn up more.
I just ran across this library I had not seen before.
Addition
I see now that you are trying to map all integers in the ranges to the same function value. As I said in the comment, this is not very compatible with hashing because hash functions deliberately try to "erase" the magnitude information in a bit's position so that values with similar magnitude are unlikely to map to the same hash value.
Consequently, I think that you will not do better than an optimal binary search tree, or equivalently a code generator that produces an optimal "tree" of "if else" statements.
If we wanted to construct a function of the type you are asking for, we could try using real numbers where individual domain values map to consecutive integers in the co-domain and ranges map to unit intervals in the co-domain. So a simple floor operation will give you the jump table indices you're looking for.
In the example you provided you'd have the following mapping:
2 -> 0.0
3 -> 1.0
8 -> 2.0
33 -> 3.0
122 -> 3.99999
...
10000 -> 42.0 (for example)
The trick is to find a monotonically increasing polynomial that interpolates these points. This is certainly possible, but with thousands of points I'm certain you'ed end up with something much slower to evaluate than the optimal search would be.
Perhaps our thoughts on hashing integers can help a little bit. You will also find there a hashing library (hashlib.zip) based on Bob Jenkins' work which deals with integer numbers in a smart way.
I would propose to deal with larger ranges after the single cases have been rejected by the hashing mechanism.
I am clustering a large set of points. Throughout the iterations, I want to avoid re-computing cluster properties if the assigned points are the same as the previous iteration. Each cluster keeps the IDs of its points. I don't want to compare them element wise, comparing the sum of the ID vector is risky (a small ID can be compensated with a large one), may be I should compare the sum of squares? Is there a hashing method in Matlab which I can use with confidence?
Example data:
a=[2,13,14,18,19,21,23,24,25,27]
b=[6,79,82,85,89,111,113,123,127,129]
c=[3,9,59,91,99,101,110,119,120,682]
d=[11,57,74,83,86,90,92,102,103,104]
So the problem is that if I just check the sum, it could be that cluster d for example, looses points 11,103 and gets 9,105. Then I would mistakenly think that there has been no change in the cluster.
This is one of those (very common) situations where the more we know about your data and application the better we are able to help. In the absence of better information than you provide, and in the spirit of exposing the weakness of answers such as this in that absence, here are a couple of suggestions you might reject.
One appropriate data structure for set operations is a bit-set, that is a set of length equal to the cardinality of the underlying universe of things in which each bit is set on or off according to the things membership of the (sub-set). You could implement this in Matlab in at least two ways:
a) (easy, but possibly consuming too much space): define a matrix with as many columns as there are points in your data, and one row for each cluster. Set the (cluster, point) value to true if point is a member of cluster. Set operations are then defined by vector operations. I don't have a clue about the relative (time) efficiency of setdiff versus rowA==rowB.
b) (more difficult): actually represent the clusters by bit sets. You'll have to use Matlab's bit-twiddling capabilities of course, but the pain might be worth the gain. Suppose that your universe comprises 1024 points, then you'll need an array of 16 uint64 values to represent the bit set for each cluster. The presence of, say, point 563 in a cluster requires that you set, for the bit set representing that cluster, bit 563 (which is probably bit 51 in the 9th element of the set) to 1.
And perhaps I should have started by writing that I don't think that this is a hashing sort of a problem, it's a set sort of a problem. Yeah, you could use a hash but then you'll have to program around the limitations of using a screwdriver on a nail (choose your preferred analogy).
If I understand correctly, to hash the ID's I would recommend using the matlab Java interface to use the Java hashing algorithms
http://docs.oracle.com/javase/1.4.2/docs/api/java/security/MessageDigest.html
You'll do something like:
hash = java.security.MessageDigest.getInstance('SHA');
Hope this helps.
I found the function
DataHash on FEX it is quiet fast for vectors and the strcmp on the keys is a lot faster than I expected.