How does perl resolve possible hash collisions in hashes? - perl

As we know, perl implements its 'hash' type as a table with calculated indexes, where these indexes are truncated hashes.
As we also know, a hashing function can (and will, by probability) collide, giving the same hash to 2 or more different inputs.
Then: How does the perl interpreter handle when it finds that a key generated the same hash than another key? Does it handle it at all?
Note: This is not about the algorithm for hashing but about collision resolution in a hash table implementation.

A Perl hash is an array of linked lists.
+--------+ +--------+
| -------->| |
+--------+ +--------+
| | | key1 |
+--------+ +--------+
| ------+ | val1 |
+--------+ | +--------+
| | |
+--------+ | +--------+ +--------+
+-->| ------>| |
+--------+ +--------+
| key2 | | key3 |
+--------+ +--------+
| val2 | | val3 |
+--------+ +--------+
The hashing function produces a value which is used as the array index, then a linear search of the associated linked list is performed.
This means the worse case to lookup is O(N). So why do people say it's O(1)? You could claim that if you kept the list from exceeding some constant length, and that's what Perl does. It uses two mechanisms to achieve this:
Increasing the number of buckets.
Hashing algorithm perturbing.
Doubling the number of buckets should divide the number of entries in a given by half, on average. For example,
305419896 % 4 = 0 and 943086900 % 4 = 0
305419896 % 8 = 0 and 943086900 % 8 = 4
However, a malicious actor could choose values where this doesn't happen. This is where the hash perturbation comes into play. Each hash has its own random number that perturbs (causes variances in) the output of the hashing algorithm. Since the attacker can't predict the random number, they can't choose values that will cause collisions. When needed, Perl can rebuild the hash using a new random number, causing keys to map to different buckets than before, and thus breaking down long chains.

Sets of key-value pairs where the keys produce the same hash value are stored together in a linked list. The gory details are available in hv.c.

Related

Simple cumulative increase in Prometheus

I have an application that increments a Prometheus counter when it receives a particular HTTP request. The application runs in Kubernetes, has multiple instances and redeploys multiple times a day. Using the query http_requests_total{method="POST",path="/resource/aaa",statusClass="2XX"} produces a graph displaying cumulative request counts per instance as is expected.
I would like to create a Grafana graph that shows the cumulative frequency of requests received over the last 7 days.
My first thought was use increase(...[7d]) in order to account for any metrics starting outside of the 7 day window (like in the image shown) and then sum those values.
I've come to the realisation that sum(increase(http_requests_total{method="POST",path="/resource/aaa",statusClass="2XX"}[7d])) does in fact give the correct answer for points in time. However, resulting graph isn't quite what was asked for because the component increase(...) values increase/decrease along the week.
How would I go about creating a graph that shows the cumulative sum of the increase in these metrics over the passed 7 days? For example, given the simplified following data
| Day | # Requests |
|-----|------------|
| 1 | 10 |
| 2 | 5 |
| 3 | 15 |
| 4 | 10 |
| 5 | 20 |
| 6 | 5 |
| 7 | 5 |
| 8 | 10 |
If I was to view a graph of day 2 to day 8 I would like the graph to render a line as follows,
| Day | Cumulative Requests |
|-----|---------------------|
| d0 | 0 |
| d1 | 5 |
| d2 | 20 |
| d3 | 30 |
| d4 | 50 |
| d5 | 55 |
| d6 | 60 |
| d7 | 70 |
Where d0 represents the initial value in the graph
Thanks
Prometheus doesn't provide functionality, which can be used for returning cumulative increase over multiple time series on the selected time range.
If you still need this functionality, then try VictoriaMetrics - Prometheus-like monitoring solution I work on. It allows calculating cumulative increase over multiple counters. For example, the following MetricsQL query returns cumulative increase over all the time series with http_requests_total name on the selected time range in Grafana:
running_sum(sum(increase(http_requests_total)))
How does it work?
It calculates increase per each time series with the http_requests_total name. Note that the increase() in the query above doesn't contain lookbehind window in square brackets. VictoriaMetrics automatically sets the lookbehind window to the step value, which is passed by Grafana to /api/v1/query_range endpoint. The step value is the interval between points on the graph.
It sums increases returned at step 1 with the sum() function individually per each point on the graph.
It calculates cumulative increase over per-step increases returned at step 2 with the running_sum function.
If I understood your question's idea correctly, I think I managed to create such graph with a query like this
sum(max_over_time(counterName{someLabel="desiredlabelValue"}[7d]))
A graph produced by it looks like the blue one:
The reasons why the future part of the graph decreases are both because the future processing hasn't obviously yet happened and because the more-than-7-days-old processing slides out of the moving 7-day inspection window.

how to get only positive results when applying hashCode()?

I am working on a Scala code that convert set of unique strings to unique IDs. I applied HashCode() but I got negative numbers and I need to work only with positive numbers.
I know that I have to use math.abs to get rid of the negative values but I am not sure if this is the correct solution or not.
If I read before that something like this could solve my problem
math.abs(hashCode()) * constant % size
how can I determine the right constant? and does the size means the total number of strings?
previous questions related to that topic solved the question by using math.abs only but if the total number of string is large an overflow could happen and there is a chance to get a negative number as well. by multiplying the result with constant and take the mod of size could help. This is why I need to understand how to determine the constant and the size?
Also is there another way to get unique numbers for unique strings?
We can phrase your problem another way: How to get an unsigned number from a signed number with the same range?
Suppose you are using an Integer. Its value goes from -2147483648 to 2147483647. Now you need to convert this value into the positive range 0 to 2147483647.
Step 1:
ADD a constant to move the range upwards to 0. You can do this by adding 2147483648 to the value. But now the highest possible value is much greater than the MAX.
Step 2:
So use MODULO to move the value back into the required range.
For example, consider the values -2000 and 2000000000.
| STEP | MIN VALUE | EXAMPLE 1 | EXAMPLE 2 | MAX VALUE |
|-------------------|------------|------------|------------|------------|
| original |-2147483648 | -2000 | 2000000000 | 2147483647 |
| add 2147483648 | 0 | 2147481648 | 4147483648 | 4294967295 |
| modulo 2147483648 | 0 | 2147481648 | 2000000001 | 2147483647 |
So the final formula is:
(NUMBER + 2147483648) % 2147481648
Warning:
Hash codes are not designed to give unique values. There are chances of getting the same hash for two different strings. Also, any scaling operations on the hash (like division, modulo) can further reduce uniqueness.
To strip a sign from an Int, you can just use .abs. It does break on Int.MinValue, but you can just special case it:
def stripSign(n: Int) = math.abs(n) max 0
or simply drop the sign bit:
def stripSign2(n: Int) = n & Int.MaxValue
Or just use negative numbers (what's wrong with them anyway?).
To your other question, you cannot convert a bunch of unique strings to ints, and guarantee that there won't be duplications (for the simple reason that there are more strings than distinct Ints, so, if you wanted to assign an unique int to each of them, you'd run out of ints before you run out of strings), so you have to be able to handle collisions, however infrequent.
You can only shoot for lowering the probability of a collision by making your hash longer (with a 32-bit hash code, you have about 50% probability of at least one collision in a population of approximately 75000 strings, with 31 bits (if you do not want negative numbers), it is 55000, but with a 64-bit hash, the "magic number" is about 5 billion, provided that your hash function is good enough, and produces the numbers that are very evenly distributed).

Normalize Count Measure in Tableau

I am trying to create a plot similar to those created by Google's ngram viewer. I have the ngrams that correspond to year, but some years have much more data than others; as a result, plotting from absolute counts doesn't get me the information I want. I'd like to normalize it so that I get the counts as a percentage of the total samples for that year.
I've found ways to normalize data to ranges in Tableau, but nothing about normalizing by count. I also see that there is a count distinct function, but that doesn't appear to do what I want.
How can I do this in Tableau?
Thanks in advance for your help!
Edit:
Here is some toy data and the desired output.
Toy Data:
+---------+------+
| Pattern | Year |
+---------+------+
| a | 1 |
| a | 1 |
| a | 1 |
| b | 1 |
| b | 1 |
| b | 1 |
| a | 2 |
| b | 2 |
| a | 3 |
| b | 4 |
+---------+------+
Desired Output:
Put [Year] on the Columns shelf, and if it is really a Date field instead of a number - choose any truncation level you'd like or choose exact date. Make sure to treat it as a discrete dimension field (the pill should be blue)
Put [Number of Records] on the Rows shelf. Should be a continuous measure, i.e. SUM([Number of Records])
Put Pattern on the Color shelf.
At this point, you should be looking at a graph raw counts. To convert them to percentages, right click on the [Number of Records] field on the Rows shelf, and choose Quick Table Calc->Percent of Total. Finally, right click on [Number of Records] a second time, and choose Compute Using->Pattern.
You might want to sort the patterns. One easy way is to just drag them in the color legend.

How to find all lines in a set that coincide in a single point?

Suppose I am given a set of lines, how can I partition this set into a number of clusters, such that all the lines in each cluster coincide in a single point?
If number of lines N is reasonable, then you can use O(N^3) algorithm:
for ever pair of lines (in form A*x+B*y+C=0) check whether they intersect - exclude pairs of parallel and not-coincident lines with determinant
|A1 B1|
= 0
|A2 B2|
For every intersecting pair if another line shares the same intersecting point, with determinant:
|A1 B1 C1|
|A2 B2 C2| = 0
|A3 B3 C3|
If N is too large for cubic algorithm using, then calculate all intersection points (upto O(N^2) of them) and add these points to any map structure (hashtable, for example). Check for matching points. Don't forget about numerical errors issue.

Training LIBSVM with multivariate data in MATLAB

How LIBSVM works performs multivariate regression is my generalized question?
In detail, I have some data for certain number of links. (Example 3 links). Each link has 3 dependent variables which when used in a model gives output Y. I have data collected on these links in some interval.
LinkId | var1 | var2 | var3 | var4(OUTPUT)
1 | 10 | 12.1 | 2.2 | 3
2 | 11 | 11.2 | 2.3 | 3.1
3 | 12 | 12.4 | 4.1 | 1
1 | 13 | 11.8 | 2.2 | 4
2 | 14 | 12.7 | 2.3 | 2
3 | 15 | 10.7 | 4.1 | 6
1 | 16 | 8.6 | 2.2 | 6.6
2 | 17 | 14.2 | 2.3 | 4
3 | 18 | 9.8 | 4.1 | 5
I need to perform prediction to find the output of
(2,19,10.2,2.3).
How can I do that using above data for training in Matlab using LIBSVM? Can I train the whole data as input to the svmtrain to create a model or do I need to train each link separate and use the model create for prediction? Does it make any difference?
NOTE : Notice each link with same ID has same value.
This is not really a matlab or libsvm question but rather a generic svm related one.
How LIBSVM works performs multivariate regression is my generalized question?
LibSVM is just a library, which in particular - implements the Support Vector Regression model for the regression tasks. In short words, in a linear case, SVR tries to find a hyperplane for which your data points are placed in some margin around it (which is quite a dual approach to the classical SVM which tries to separate data with as big margin as possible).
In non linear case the kernel trick is used (in the same fashion as in SVM), so it is still looking for a hyperplane, but in a feature space induced by the particular kernel, which results in the non linear regression in the input space.
Quite nice introduction to SVRs' can be found here:
http://alex.smola.org/papers/2003/SmoSch03b.pdf
How can I do that using above data for training in Matlab using LIBSVM? Can I train the whole data as input to the svmtrain to create a model or do I need to train each link separate and use the model create for prediction? Does it make any difference? NOTE : Notice each link with same ID has same value.
You could train SVR (as it is a regression problem) with the whole data, but:
seems that var3 and LinkId are the same variables (1->2.2, 2->2.3, 3->4.1), if this is a case you should remove the LinkId column,
are values of var1 unique ascending integers? If so, these are also probably a useless featues (as they do not seem to carry any information, they seem to be your id numbers),
you should preprocess your data before applying SVM so eg. each column contains values from the [0,1] interval, otherwise some features may become more important than others just because of their scale.
Now, if you would like to create a separate model for each link, and follow above clues, you end up with 1 input variable (var2) and 1 output variable var4, so I would not recommend such a step. In general it seems that you have very limited featues set, it would be valuable to gather more informative features.