how to get only positive results when applying hashCode()? - scala

I am working on a Scala code that convert set of unique strings to unique IDs. I applied HashCode() but I got negative numbers and I need to work only with positive numbers.
I know that I have to use math.abs to get rid of the negative values but I am not sure if this is the correct solution or not.
If I read before that something like this could solve my problem
math.abs(hashCode()) * constant % size
how can I determine the right constant? and does the size means the total number of strings?
previous questions related to that topic solved the question by using math.abs only but if the total number of string is large an overflow could happen and there is a chance to get a negative number as well. by multiplying the result with constant and take the mod of size could help. This is why I need to understand how to determine the constant and the size?
Also is there another way to get unique numbers for unique strings?

We can phrase your problem another way: How to get an unsigned number from a signed number with the same range?
Suppose you are using an Integer. Its value goes from -2147483648 to 2147483647. Now you need to convert this value into the positive range 0 to 2147483647.
Step 1:
ADD a constant to move the range upwards to 0. You can do this by adding 2147483648 to the value. But now the highest possible value is much greater than the MAX.
Step 2:
So use MODULO to move the value back into the required range.
For example, consider the values -2000 and 2000000000.
| STEP | MIN VALUE | EXAMPLE 1 | EXAMPLE 2 | MAX VALUE |
|-------------------|------------|------------|------------|------------|
| original |-2147483648 | -2000 | 2000000000 | 2147483647 |
| add 2147483648 | 0 | 2147481648 | 4147483648 | 4294967295 |
| modulo 2147483648 | 0 | 2147481648 | 2000000001 | 2147483647 |
So the final formula is:
(NUMBER + 2147483648) % 2147481648
Warning:
Hash codes are not designed to give unique values. There are chances of getting the same hash for two different strings. Also, any scaling operations on the hash (like division, modulo) can further reduce uniqueness.

To strip a sign from an Int, you can just use .abs. It does break on Int.MinValue, but you can just special case it:
def stripSign(n: Int) = math.abs(n) max 0
or simply drop the sign bit:
def stripSign2(n: Int) = n & Int.MaxValue
Or just use negative numbers (what's wrong with them anyway?).
To your other question, you cannot convert a bunch of unique strings to ints, and guarantee that there won't be duplications (for the simple reason that there are more strings than distinct Ints, so, if you wanted to assign an unique int to each of them, you'd run out of ints before you run out of strings), so you have to be able to handle collisions, however infrequent.
You can only shoot for lowering the probability of a collision by making your hash longer (with a 32-bit hash code, you have about 50% probability of at least one collision in a population of approximately 75000 strings, with 31 bits (if you do not want negative numbers), it is 55000, but with a 64-bit hash, the "magic number" is about 5 billion, provided that your hash function is good enough, and produces the numbers that are very evenly distributed).

Related

MATLAB numeric precision when generating a numeric sequence

I was testing a operation like this:
[input] 3.9/0.1 : 4.1/0.1
[output] 39 40
don't know why 4.1/0.1 is approximated to 40. If I add a round(), it will go as expected:
[input] 3.9/0.1 : round(4.1/0.1)
[output] 39 40 41
What's wrong with the first operation?
In this Q&A I go into detail on how the colon operator works in MATLAB to create a range. But the detail that causes the issue described in this question is not covered there.
That post includes the full code for a function that imitates exactly what the colon operator does. Let's follow that code. We start with start = 3.9/0.1, which is exactly 39, and stop = 4.1/0.1, which, due to rounding errors, is just slightly smaller than 41, and step = 1 (the default if it's not given).
It starts by computing a tolerance:
tol = 2.0*eps*max(abs(start),abs(stop));
This tolerance is intended to be used so that the stop value, if within tol of an exact number of steps, is still used, if the last step would step over it. Without a tolerance, it would be really difficult to build correct sequences using floating-point end points and step sizes.
However, then we get this test:
if start == floor(start) && step == 1
% Consecutive integers.
n = floor(stop) - start;
elseif ...
If the start value is an exact integer, and the step size is 1, then it forces the sequence to be an integer sequence. Unfortunately, it does so by taking the number of steps as the distance between floor(stop) and start. That is, it is not using the tolerance computed earlier in determining the right stop! If stop is slightly above an integer, that integer will be in the range. If stop is slightly below an integer (as in the case of the OP), that integer will not be part of the range.
It could be debated whether MATLAB should round the stop number up in this case or not. MATLAB chose not to. All of the sequences produced by the colon operator use the start and stop values exactly as given by the user. It leaves it up to the user to ensure the bounds of the sequence are as required.
However, if the colon operator hadn't special-cased the sequence of integers, the result would have been less surprising in this case. Let's add a very small number to the start value, so it's not an integer:
>> a = 3.9/0.1 : 4.1/0.1
a =
39 40
>> b = 3.9/0.1 + eps(39) : 4.1/0.1
b =
39.0000 40.0000 41.0000
Floating-point numbers suffer from loss of precision when represented with a fixed number of bits (64-bit in MATLAB by default). This is because there are infinite number of real numbers (even within a small range of say 0.0 to 0.1). On the other hand, a n-bit binary pattern can represent a finite 2^n distinct numbers. Hence, not all the real numbers can be represented. The nearest approximation will be used instead, resulted in loss of accuracy.
The closest representable value for 4.1/0.1 in the computer as a 64-bit double precision floating point number is actually,
4.1/0.1 ≈ 40.9999999999999941713291207...
So, in essence, 4.1/0.1 < 41.0 and that's what you get from the range. If you subtract, for example, 41 - 4.1/0.1 = 7.105427357601002e-15. But when you round, you get the closest value of 41.0 as expected.
The representation scheme for 64-bit double-precision according to the IEEE-754 standard:
The most significant bit is the sign bit (S), with 0 for positive numbers and 1 for negative numbers.
The following 11 bits represent exponent (E).
The remaining 52 bits represents fraction (F).

Designing a hash function that creates keys for a hash table from an alphanumeric number

I am trying to design a hash function using customer IDs that range from AA0001 to ZZ9999.
The keys will be stored in a one dimensional array.
Each element of the array will need to be accessed.
My thinking is that I can sum the ascii values of the customer ids as well as the following numbers.
I am planning to have an array size of 100.
I am new to this subject so not clear whether my thinking is correct.
The smallest number is AA0001 conversion of Ascii of AA is 130 and 1 makes smallest limit to be 131.
Maximum number ZZ9999 is 180 + 9999 = 10179.
I am want to use modulus function but not sure how to use this function to give me a range of numbers between 1 to 100.

Why is the product of two positive integers a negative integer?

This semester i took system proramming course.
Why 50000*50000 will be negative?
I try to understand logic of this.
Here is the screenshot of the slide
slide image
32-bit signed integers are stored by using bits 0-30 as the number and bit 31 indicating the sign of the number.
This means that the maximum value that can be represented is 2,147,483,647 (all bits from 0-30 are set, bit 31 is 0 indicating a positive number).
The product of 50,000 and 50,000 is 25,000,000,000 is greater than this number and you have what is called an overflow. This means that data has "overflowed" from its expected bounds (the bottom 31 bits) into the sign bit).
You now have bit 31 set, indicating that this is a negative number. To figure out a negative number from its binary representation, you take the ones' complement (flip all the bits), add one and then throw a negative sign in front of it.
Be careful when you take the ones' complement that you limit yourself to a 32-bit range... you shouldn't be including bits higher than bit 31.
Check out signed number representations for more information.
Sample Program Pseudo Code
Print --> ("Size of int: " + (Integer.SIZE/8) + " bytes.");
int a=50000;
int b=50000;
Print --> (" Product of a and b " + a*b);
Output :
Size of int: 4 bytes.
Product of a and b:-1794967296
Analysis :
4 bytes= 4*8= 32bits.
Since signed int can hold negative values, one-bit is used for sign (- or +), so bits available for numeric range=31.
Number range = -(2^31) , 0 and (2^31-1)
[one positive number is sacrificed for 0]
-2147483648, 0 and 2147483647
Maximum possible positive int = 2147483647 (greater than 1600000000, so 40000*40000 is fine)
Actual Product 50000*50000=2500000000 (greater than 2147483647)
In practice many portable C programs assume that signed integer overflow wraps around reliably using two's complement arithmetic.
Yet the C standard says that program behavior is undefined on overflow, and in a few cases C programs do not work on some modern implementations because their overflows do not wrap around as their authors expected.
http://www.gnu.org/software/autoconf/manual/autoconf-2.62/html_node/Integer-Overflow.html
This is because in most programming languages, the integer data type has a fixed size.
That means that each integer value have a defined MIN and MAX value.
For example in C# MAX INT is 2147483647 and MIN is -2147483648
In PHP 32 bits it's 2147483647 and -2147483648
In PHP 64 bits it's 9223372036854775807 and -9223372036854775808
What happen when you try to go over that value? Simply the computer will make what's called an integer overflow and the value will loop back to the min value.
In other words, in C# 2147483647 + 1 = -2147483648 (assuming you use an integer datatype, not long or float). That exactly what happen with 50000 * 50000, it just goes over max value and loop from the next value.
The exact min and max values are dependent on the language used, the platform the code is built, the platform the code is run on and the static type of the value.
Hope it clears everything out for you!

Stata: Keep only observations with minimum, maximum and median value of a given variable

In Stata, I have a dataset with two variables: id and var, and say 1000 observations. The variable var is of type float and takes distinct values for all observations. I would like to keep only the three observations where var is either the minimum of var, the maximum of var, or the median of var.
The way I currently do this:
summarize var, detail
local varmax = r(max)
local varmin = r(min)
local varmedian= r(p50)
keep if inlist(float(var),float(`varmax') , float(`varmedian'), float(`varmin'))
The problem that I face is that sometimes the inlist condition will not match one of the value. E.g. I end up with two observations instead of three, for instance the one with min and the one with max, but not the one with median. I suspect this has to do with a precision problem. As you see, I tried to convert all numbers to float, but this is apparently not sufficient.
Any fix to my solution, or alternative solution would be greatly appreciated (if possible without installing additional packages), thanks!
This is not in the first instance a precision problem.
It is an inevitable problem when (1) the number of values is even and (2) the median is the mean of two central values that are different. Then the median itself is not a value in the dataset and will not be found by keep.
Consider a data set 1, 2, 3, 4. The median 2.5 is not in the data. This is very common; indeed it is what is expected with all values distinct and the number of observations even.
Other problems can arise because two or even three of the minimum, median and maximum could be equal to each other. This is not your present problem, but it can bite with other variables (e.g. indicator variables).
Precision problems are possible.
Here is a general solution purported to avoid all these difficulties.
If you collapse to min, median. max and then reshape you can avoid the problem. You will always get three results, even if they are numerically equal and/or not present in the data.
In the trivial example below, the identifier is needed only to appease reshape. In other problems, you might want to collapse using by() and then your identifier comes ready-made. However, you will be less likely to want to reshape in that case.
. clear
. set obs 4
number of observations (_N) was 0, now 4
. gen y = _n
. collapse (min)ymin=y (max)ymax=y (median)ymedian=y
. gen id = _n
. reshape long y, i(id) j(statistic) string
(note: j = max median min)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 1 -> 3
Number of variables 4 -> 3
j variable (3 values) -> statistic
xij variables:
ymax ymedian ymin -> y
-----------------------------------------------------------------------------
. list
+---------------------+
| id statis~c y |
|---------------------|
1. | 1 max 4 |
2. | 1 median 2.5 |
3. | 1 min 1 |
+---------------------+
All that said, having (lots of?) datasets with just three observations sounds poor data management strategy. Perhaps this is extracted from some larger question.
UPDATE
Here is another way to keep precisely 3 observations. Apart from the minimum and maximum, we use the rule that we keep the "low median", i.e. the lower of two values averaged for the median, when the number of observations is even, and a single value that is the median otherwise. (In Stephen Stigler's agreeable terminology, we can talk of "comedians" in the first case.)
. sysuse auto, clear
(1978 Automobile Data)
. sort mpg
. drop if missing(mpg)
(0 observations deleted)
. keep if inlist(_n, 1, cond(mod(_N, 2), ceil(_N/2), floor(_N/2)), _N)
(71 observations deleted)
. l mpg
+-----+
| mpg |
|-----|
1. | 12 |
2. | 20 |
3. | 41 |
+-----+
mod(_N, 2) is 1 if _N is odd and 0 if _N is even. The expression in cond() selects ceil(_N/2) if the number of observations is odd and floor(_N/2) if it is even.

How to generate all possible combinations n-bit strings?

Given a positive integer n, I want to generate all possible n bit combinations in matlab.
For ex : If n=3, then answer should be
000
001
010
011
100
101
110
111
How do I do it ?
I want to actually store them in matrix. I tried
for n=1:2^4
r(n)=dec2bin(n,5);
end;
but that gave error "In an assignment A(:) = B, the number of elements in A and B must be the same.
Just loop over all integers in [0,2^n), and print the number as binary. If you always want to have n digits (e.g. insert leading zeros), this would look like:
for ii=0:2^n-1,
fprintf('%0*s\n', n, dec2bin(ii));
end
Edit: there are a number of ways to put the results in a matrix. The easiest is to use
x = dec2bin(0:2^n-1);
which will produce an n-by-2^n matrix of type char. Each row is one of the bit strings.
If you really want to store strings in each row, you can do this:
x = cell(1, 2^n);
for ii=0:2^n-1,
x{ii} = dec2bin(ii);
end
However, if you're looking for efficient processing, you should remember that integers are already stored in memory in binary! So, the vector:
x = 0 : 2^n-1;
Contains the binary patterns in the most memory efficient and CPU efficient way possible. The only trade-off is that you will not be able to represent patterns with more than 32 of 64 bits using this compact representation.
This is a one-line answer to the question which gives you a double array of all 2^n bit combinations:
bitCombs = dec2bin(0:2^n-1) - '0'
So many ways to do this permutation. If you are looking to implement with an array counter: set an array of counters going from 0 to 1 for each of the three positions (2^0,2^1,2^2). Let the starting number be 000 (stored in an array). Use the counter and increment its 1st place (2^0). The number will be 001. Reset the counter at position (2^0) and increase counter at 2^1 and go on a loop till you complete all the counters.