How does feature bits work in vowpal wabbit - hash

I am relatively new to vowpal wabbit and would like to find out about the -b parameter (feature bits in feature table).
My training data are something like this. There are a total of about 1 million words.
1 | a = "word" b ="word131232" c="word1233" d = "word123124" e = "word23145"
However, each row would only have 5 features. how many bits should i use? I tried to run it and it seems with an increasing number of examples, the number of features set seem to increase. I do not seem to understand why is this so.

If you use -b 18 (which is the default), the features will be hashed into a table with 2^18 items, so if the number of unique features in your dataset is close to 2^18 (or even higher), you should increase the parameter -b, so there are not so many hash collisions. There is no easy way how to detect the number of collisions, but the common practice is to tune the parameter -b for a best progressive validation loss (or holdout loss if you use more passes). Of course, it also depends on the available memory on your machine.
1 | a = "word" b ="word131232" c="word1233" d = "word123124" e = "word23145"
Note that this example is wrong (not what you intended) because of the spaces around =. The equal sign has no special meaning (unlike colon which is used for separating the feature value). Features cannot contain space in their name. There is no need to enclose feature names in quotes. So the example should look like
1 | word word131232 word1233 word123124 word23145
If the prefix a, b, c, d, e has some special meaning (i.e. a=word42 should be a different feature than b=word42) you can use:
1 | a=word b=word131232 c=word1233 d=word123124 e=word23145
If all your words are already mapped to integers (within the range 0-2^b), you can use them directly as feature names and no hashing will be done (unless you specify --hash=all):
1 | 0 131232 1233 123124 23145
See the wiki page about input format.
the number of features set seem to increase
In the progress report (by default each 2^x th example), in the last column you can see current features, which is the number of features for the current example (including the constant feature and quadratic/cubic/... features if you use them) and it should not be increasing (unless you have such strange data).
In the final report, vw prints total feature number, which is the average number of features per example times the number of examples times the number of passes (so it is not the number of unique features in the dataset).

Related

Uniqueness of UUID substring

We track an internal entity with java.util generated UUID. New requirement is to pass this object to a third party who requires a unique identifier with a max character limit of 11. In lieu of generating, tracking and mapping an entirely new unique ID we are wondering if it is viable to use a substring of the UUID as a calculated field. The number of records is at most 10 million.
java.util.UUID.randomUUID().toString() // code used to generate
Quotes from other resources (incl. SOF):
"....only after generating 1 billion UUIDs every second for approximately 100 years would the probability of creating a single duplicate reach 50%."
"Also be careful with generating longer UUIDs and substring-ing them, since some parts of the ID may contain fixed bytes (e.g. this is the case with MAC, DCE and MD5 UUIDs)."
We will check out existing IDs' substrings for duplicates. What are the chances the substring would generate a duplicate?
This is an instance of the Birthday Problem. One formulation of B.P.: Given a choice of n values sampled randomly with replacement, how many values can we sample before the same value will be seen at least twice with probability p?
For the classic instance of the problem,
p = 0.5, n = the 365 days of the year
and the answer is 23. In other words, the odds are 50% that two people share the same birthday when you are surveying 23 people.
You can plug in
n = the number of possible UUIDs
instead to get that kind of cosmically large sample size required for a 50% probability of a collision — something like the billion-per-second figure. It is
n = 16^32
for a 32-character string of 16 case-insensitive hex digits.
B.P. a relatively expensive problem to compute, as there is no known closed-form formula for it. In fact, I just tried it for your 11-character substring (n = 16^11) on Wolfram Alpha Pro, and it timed out.
However, I found an efficient implementation of a closed-form estimate here. And here's my adaptation of the Python.
import math
def find(p, n):
return math.ceil(math.sqrt(2 * n * math.log(1/(1-p))))
If I plug in the classic B.P. numbers, I get an answer of 23, which is right. For the full UUID numbers,
find(.5, math.pow(16, 32)) / 365 / 24 / 60 / 60 / 100
my result is actually close to 7 billion UUID per second for 100 years! Maybe this estimate is too coarse for large numbers, though I don't know what method your source used.
For the 11-character string? You only have to generate about 5 million IDs total to reach the 50% chance of a collision. For 1%, it's only about 600,000 total. And that's probably overestimating safety, compared to your source (and which we are already guilty of by assuming the substring is random).
My engineering advice: Do you really need the guarantees that UUIDs provide aside from uniqueness, such as non-enumerability, and assurance against collisions in a distributed context? If not, then just use a sequential ID, and avoid these complications.

Calculate significance for DNA binding motifs found vs. expected in MATLAB

I have a set of say, 100, genomic features for which I've created a fasta file with a 500 bp window around each. I've searched these windows for a DNA sequence and found an average of 1.5 sequences per individual 500 bp window in the feature set. By chance, I expect the sequence to be present once every 1024 bp, or on average ~0.49 of my sequence per 500 bp window.
My question is how can I determine whether the 1.5 binding sites per individual feature I've uncovered is significant or not, and obtain a p-value?
And as a follow up, if I use the same set of 100 windows and search for a different sequence with the same probability (1/1024) and determine that there are now an average of 0.9 sequences per individual window, how can I determine whether this is significantly different than the 1.5 for the sequence for which I searched above?
As a second follow up, if I search for the same two sequences above (both found on average 1/1024 base pairs) in a different set of 500 bp windows for a different feature type (say, n=50), how can I determine if the results of this search are significantly different than the results above (particularly if the difference between sequence A and sequence B in feature set 1 and feature set 2 is significant)?
Thank you in advance.
I ended up using simulations to answer all of the above questions. Generate windows of desired size, 500 bp in this case, of random genomic sequence. Search for motifs in X windows (where X = number of individuals in a feature set) and compare with results obtained searching for motifs in the features of interest. To Repeat with sample size equal to that of the second feature set being analyzed. To compare features with one another, do the analogous simulation and compare results.

In preprocessing data with high cardinality, do you hash first or one-hot-encode first?

Hashing reduces dimensionality while one-hot-encoding essentially blows up the feature space by transforming multi-categorical variables into many binary variables. So it seems like they have opposite effects. My questions are:
What is the benefit of doing both on the same dataset? I read something about capturing interactions but not in detail - can somebody elaborate on this?
Which one comes first and why?
Binary one-hot-encoding is needed for feeding categorical data to linear models and SVMs with the standard kernels.
For example, you might have a feature which is a day of a week. Then you create a one-hot-encoding for each of them.
1000000 Sunday
0100000 Monday
0010000 Tuesday
...
0000001 Saturday
Feature-hashing is mostly used to allow for significant storage compression for parameter vectors: one hashes the high dimensional input vectors into a lower dimensional feature space. Now the parameter vector of a resulting classifier can therefore live in the lower-dimensional space instead of in the original input space. This can be used as a method of dimension reduction thus usually you expect to trade a bit of decreasing of performance with significant storage benefit.
The example in wikipedia is a good one. Suppose your have three documents:
John likes to watch movies.
Mary likes movies too.
John also likes football.
Using a bag-of-words model, you first create below document to words model. (each row is a document, each entry in the matrix indicates whether a word appears in the document).
The problem with this process is that such dictionaries take up a large amount of storage space, and grow in size as the training set grows.
Instead of maintaining a dictionary, a feature vectorizer that uses the hashing trick can build a vector of a pre-defined length by applying a hash function h to the features (e.g., words) in the items under consideration, then using the hash values directly as feature indices and updating the resulting vector at those indices.
Suppose you generate below hashed features with 3 buckets. (you apply k different hash functions to the original features and count how many times the hashed value hit a bucket).
bucket1 bucket2 bucket3
doc1: 3 2 0
doc2: 2 2 0
doc3: 1 0 2
Now you successfully transformed the features in 9-dimensions to 3-dimensions.
A more interesting application of feature hashing is to do personalization. The original paper of feature hashing contains a nice example.
Imagine you want to design a spam filter but customized to each user. The naive way of doing this is to train a separate classifier for each user, which are unfeasible regarding either training (to train and update the personalized model) or serving (to hold all classifiers in memory). A smart way is illustrated below:
Each token is duplicated and one copy is individualized by concatenating each word with a unique user id. (See USER123_NEU and USER123_Votre).
The bag of words model now holds the common keywords and also use-specific keywords.
All words are then hashed into a low dimensioanl feature space where the document is trained and classified.
Now to answer your questions:
Yes. one-hot-encoding should come first since it is transforming a categorical feature to binary feature to make it consumable by linear models.
You can apply both on the same dataset for sure as long as there is benefit to use the compressed feature-space. Note if you can tolerate the original feature dimension, feature-hashing is not required. For example, in a common digit recognition problem, e.g., MINST, the image is represented by 28x28 binary pixels. The input dimension is only 784. For sure feature hashing won't have any benefit in this case.

Elias Gamma Coding and upper bound

While reading about Elias Gamma coding on wikipedia, I see it mentions that:
"Gamma coding is used in applications where the largest encoded value is not known ahead of time."
and that:
"It is used most commonly when coding integers whose upper-bound cannot be determined beforehand."
I don't really understand what is meant by these sentences, because whenever this algorithm is coded, the largest value of the test data or range of the test data would be known before hand. Any help is appreciated!
As far as I'm acquainted with Elias-gamma/delta encoding, the first sentence simply states that these compression methods are global, which means that it does not rely on the input data to generate the code. In other words, these methods do not need to process the input before performing the compression (as local methods do); it compresses the data with a function that does not depend on information from the database.
As for the second sentence, it may be taken as a guarantee that, although there may be some very large integers, the encoding will still perform well (and will represent such values with feasible amount of bytes, i.e., it is a universal method). Notice that, if you knew the biggest integer, some approaches (like minimal hashes) could perform better.
As a last consideration, the same page you referred to also states that:
Gamma coding is used in applications where the largest encoded value is not known ahead of time, or to compress data in which small values are much more frequent than large values.
This may be obtained by generating lists of differences from the original lists of integers, and passing such differences to be compressed instead. For example, in a list of increasing numbers, you could generate:
list: 1 5 29 32 35 36 37
diff: 1 4 24 3 3 1 1
Which will give you many more small numbers, and therefore a greater level of compression, than the first list.

Efficient Function to Map (or Hash) Integers and Integer Ranges into Index

We are looking for the computationally simplest function that will enable an indexed look-up of a function to be determined by a high frequency input stream of widely distributed integers and ranges of integers.
It is OK if the hash/map function selection itself varies based on the specific integer and range requirements, and the performance associated with the part of the code that selects this algorithm is not critical. The number of integers/ranges of interest in most cases will be small (zero to a few thousand). The performance critical portion is in processing the incoming stream and selecting the appropriate function.
As a simple example, please consider the following pseudo-code:
switch (highFrequencyIntegerStream)
case(2) : func1();
case(3) : func2();
case(8) : func3();
case(33-122) : func4();
...
case(10,000) : func40();
In a typical example, there would be only a few thousand of the "cases" shown above, which could include a full range of 32-bit integer values and ranges. (In the pseudo code above 33-122 represents all integers from 33 to 122.) There will be a large number of objects containing these "switch statements."
(Note that the actual implementation will not include switch statements. It will instead be a jump table (which is an array of function pointers) or maybe a combination of the Command and Observer patterns, etc. The implementation details are tangential to the request, but provided to help with visualization.)
Many of the objects will contain "switch statements" with only a few entries. The values of interest are subject to real time change, but performance associated with managing these changes is not critical. Hash/map algorithms can be re-generated slowly with each update based on the specific integers and ranges of interest (for a given object at a given time).
We have searched around the internet, looking at Bloom filters, various hash functions listed on Wikipedia's "hash function" page and elsewhere, quite a few Stack Overflow questions, abstract algebra (mostly Galois theory which is attractive for its computationally simple operands), various ciphers, etc., but have not found a solution that appears to be targeted to this problem. (We could not even find a hash or map function that considered these types of ranges as inputs, much less a highly efficient one. Perhaps we are not looking in the right places or using the correct vernacular.)
The current plan is to create a custom algorithm that preprocesses the list of interesting integers and ranges (for a given object at a given time) looking for shifts and masks that can be applied to input stream to help delineate the ranges. Note that most of the incoming integers will be uninteresting, and it is of critical importance to make a very quick decision for as large a percentage of that portion of the stream as possible (which is why Bloom filters looked interesting at first (before we starting thinking that their implementation required more computational complexity than other solutions)).
Because the first decision is so important, we are also considering having multiple tables, the first of which would be inverse masks (masks to select uninteresting numbers) for the easy to find large ranges of data not included in a given "switch statement", to be followed by subsequent tables that would expand the smaller ranges. We are thinking this will, for most cases of input streams, yield something quite a bit faster than a binary search on the bounds of the ranges.
Note that the input stream can be considered to be randomly distributed.
There is a pretty extensive theory of minimal perfect hash functions that I think will meet your requirement. The idea of a minimal perfect hash is that a set of distinct inputs is mapped to a dense set of integers in 1-1 fashion. In your case a set of N 32-bit integers and ranges would each be mapped to a unique integer in a range of size a small multiple of N. Gnu has a perfect hash function generator called gperf that is meant for strings but might possibly work on your data. I'd definitely give it a try. Just add a length byte so that integers are 5 byte strings and ranges are 9 bytes. There are some formal references on the Wikipedia page. A literature search in ACM and IEEE literature will certainly turn up more.
I just ran across this library I had not seen before.
Addition
I see now that you are trying to map all integers in the ranges to the same function value. As I said in the comment, this is not very compatible with hashing because hash functions deliberately try to "erase" the magnitude information in a bit's position so that values with similar magnitude are unlikely to map to the same hash value.
Consequently, I think that you will not do better than an optimal binary search tree, or equivalently a code generator that produces an optimal "tree" of "if else" statements.
If we wanted to construct a function of the type you are asking for, we could try using real numbers where individual domain values map to consecutive integers in the co-domain and ranges map to unit intervals in the co-domain. So a simple floor operation will give you the jump table indices you're looking for.
In the example you provided you'd have the following mapping:
2 -> 0.0
3 -> 1.0
8 -> 2.0
33 -> 3.0
122 -> 3.99999
...
10000 -> 42.0 (for example)
The trick is to find a monotonically increasing polynomial that interpolates these points. This is certainly possible, but with thousands of points I'm certain you'ed end up with something much slower to evaluate than the optimal search would be.
Perhaps our thoughts on hashing integers can help a little bit. You will also find there a hashing library (hashlib.zip) based on Bob Jenkins' work which deals with integer numbers in a smart way.
I would propose to deal with larger ranges after the single cases have been rejected by the hashing mechanism.