good hash function for a string of numbers and letters of the form "9X9XX99X9XX999999" - hash

what would be a good hash code for a vehicle identification number, that is a string
of numbers and letters of the form "9X9XX99X9XX999999," where a "9" represents
a digit and an "X" represents a letter?

One reasonable approach is to hash the entire thing using a hash function suitable for strings, e.g. GCC's C++ Standard Library uses MURMUR32.
If you wanted to get more hands on, you could group all the digits to form one 11-digit number, and knowing the 6 letters can have 26 different values which is less than 2^5=32 - you could cheaply create a number from those letters (let's call them ABCDEF) by evaluating: A + B * 2^5 + C * 2^10 + D * 2^15 + E * 2^20 + F * 2^25
Then, separately hash both the 11-digit number and the number created from the letters with a decent hash function, and XOR or add the results; you'll have quite a good hash value for your VIN. I haven't personally evaluated it, but Thomas Mueller recommends and explains something ostensible suitable here:
uint64_t hash(uint64_t x) {
x = (x ^ (x >> 30)) * UINT64_C(0xbf58476d1ce4e5b9);
x = (x ^ (x >> 27)) * UINT64_C(0x94d049bb133111eb);
x = x ^ (x >> 31);
return x;
}

Related

NFA to accept the following language

I need to build an NFA (or DFA) to recognize the following language:
L = {w | w mod 3 = 1}.
So the way I tried it was to make an NFA to recognize numbers divisible by 3 and then just add 1 to them, but this approach is a lot harder than it seems (if not impossible ?).
I only managed to do an NFA to recognize numbers divisible by 3.
I will assume that w is to be interpreted as the decimal representation (without leading zeroes) of a nonnegative integer.
Given this, we can use Myhill-Nerode to iteratively determine the states we need:
the empty string can be followed by any string in L to get to a string in L. We'll call the equivalence class for this [e]. Note that this equivalence class corresponds to the initial state of a minimal DFA for L (if one exists). Note also that the initial state is not accepting since the empty string is not a valid decimal representation of a nonnegative integer.
the string 0 cannot be followed by anything to get a string in L; it leads to a dead state corresponding to equivalence class [0].
strings 1, 4 and 7 are in L so they must correspond to a new state. We'll call the equivalence class for these [1].
strings 2, 5 and 8 are not in L; however, not all strings in L lead them to strings in L. These must correspond to a new equivalence class we'll call [2].
strings 3, 6 and 9 are not in L; but these can be followed by anything in L to get a string in L. This is the same as the empty string, so we don't need a new equivalence class or state: the equivalence class is [e].
it can be verified that every two-digit decimal string is indistinguishable from some one-digit decimal string above. so, no new equivalence classes or states are needed.
To determine the transitions, simply append the transition symbol to the equivalence class's representative element and see what equivalence class the resulting string belongs to: that will be where the transition terminates. For instance, there is a transition from [e] to [0] on 0, from [e] to [1] on 1, etc.
Because 10 = 1 (mod 3), adding a new digit to the end of a decimal string will cause the new value modulo 3 to be the sum of the original number's value modulo 3 with the value of the new digit modulo 3:
x = a (mod 3)
y = b (mod 3)
x * 10 = x * 1 (mod 3) since 10 = 1 (mod 3)
x . y = x * 10 + y = x * 1 + y = x + y (mod 3)
Filling in the transitions is left as an exercise.

convert number string into float with specific precision (without getting rounding errors)

I have a vector of cells (say, size of 50x1, called tokens) , each of which is a struct with properties x,f1,f2 which are strings representing numbers. for example, tokens{15} gives:
x: "-1.4343429"
f1: "15.7947111"
f2: "-5.8196158"
and I am trying to put those numbers into 3 vectors (each is also 50x1) whose type is float. So I create 3 vectors:
x = zeros(50,1,'single');
f1 = zeros(50,1,'single');
f2 = zeros(50,1,'single');
and that works fine (why wouldn't it?). But then when I try to populate those vectors: (L is a for loop index)
x(L)=tokens{L}.x;
.. also for the other 2
I get :
The following error occurred converting from string to single:
Conversion to single from string is not possible.
Which I can understand; implicit conversion doesn't work for single. It does work if x, f1 and f2 are of type 50x1 double.
The reason I am doing it with floats is because the data I get is from a C program which writes the some floats into a file to be read by matlab. If I try to convert the values into doubles in the C program I get rounding errors...
So, (after what I hope is a good question,) how might I be able to get the numbers in those strings, at the right precision? (all the strings have the same number of decimal places: 7).
The MCVE:
filedata = fopen('fname1.txt','rt');
%fname1.txt is created by a C program. I am quite sure that the problem isn't there.
scanned = textscan(filedata,'%s','Delimiter','\n');
raw = scanned{1};
stringValues = strings(50,1);
for K=1:length(raw)
stringValues(K)=raw{K};
end
clear K %purely for convenience
regex = 'x=(?<x>[\-\.0-9]*),f1=(?<f1>[\-\.0-9]*),f2=(?<f2>[\-\.0-9]*)';
tokens = regexp(stringValues,regex,'names');
x = zeros(50,1,'single');
f1 = zeros(50,1,'single');
f2 = zeros(50,1,'single');
for L=1:length(tokens)
x(L)=tokens{L}.x;
f1(L)=tokens{L}.f1;
f2(L)=tokens{L}.f2;
end
Use function str2double before assigning into yours arrays (and then cast it to single if you want). Strings (char arrays) must be explicitely converted to numbers before using them as numbers.

Where in the sequence of a Probabilistic Suffix Tree does "e" occur?

In my data there are only missing data (*) on the right side of the sequences. That means that no sequence starts with * and no sequence has any other markers after *. Despite this the PST (Probabilistic Suffix Tree) seems to predict a 90% chance of starting with a *. Here's my code:
# Load libraries
library(RCurl)
library(TraMineR)
library(PST)
# Get data
x <- getURL("https://gist.githubusercontent.com/aronlindberg/08228977353bf6dc2edb3ec121f54a29/raw/c2539d06771317c5f4c8d3a2052a73fc485a09c6/challenge_level.csv")
data <- read.csv(text = x)
# Load and transform data
data <- read.table("thread_level.csv", sep = ",", header = F, stringsAsFactors = F)
# Create sequence object
data.seq <- seqdef(data[2:nrow(data),2:ncol(data)], missing = NA, right= NA, nr = "*")
# Make a tree
S1 <- pstree(data.seq, ymin = 0.05, L = 6, lik = TRUE, with.missing = TRUE)
# Look at first state
cmine(S1, pmin = 0, state = "N3", l = 1)
This generates:
[>] context: e
EX FA I1 I2 I3 N1 N2 N3 NR
S1 0.006821066 0.01107234 0.01218274 0.01208756 0.006821066 0.002569797 0.003299492 0.001554569 0.0161802
QU TR *
S1 0.01126269 0.006440355 0.9097081
How can the probability for * be 0.9097081 at the very beginning of the sequence, meaning after context e?
Does it mean that the context can appear anywhere inside a sequence, and that e denotes an arbitrary starting point somewhere inside a sequence?
A PST is a representation of a variable length Markov model (VLMC). As a classical Markov model a VLMC is assumed to be homogeneous (or stationary) meaning that the conditional probabilities of the outcome given the context are the same at each position in the sequence. In other words, the context can appear anywhere in the sequence. Actually, the search for contexts is done by exploring the tree that is supposed to apply anywhere in the sequences.
In your example, for l=1 (l is 1 + the length of the context), you look only for 0-length context, i.e., the only possible context is the empty sequence e. Your condition pmin=0, state=N3 (have a probability greater than 0 for N3) is equivalent to no condition at all. So you get the overall probability to observe each state. Because your sequences (with the missing states) are all of the same length, you would get the same results using TraMineR with
seqmeant(data.seq, with.missing=TRUE)/max(seqlength(data.seq))
To get the distribution at the first position, you can use TraMineR and look at the first column of the table of cross-sectional distributions at the successive positions returned by
seqstatd(data.seq, with.missing=TRUE)
Hope this helps.

Scientific notation in MATLAB

Say I have an array that contains the following elements:
1.0e+14 *
1.3325 1.6485 2.0402 1.0485 1.2027 2.0615 1.7432 1.9709 1.4807 0.9012
Now, is there a way to grab 1.0e+14 * (base and exponent) individually?
If I do arr(10), then this will return 9.0120e+13 instead of 0.9012e+14.
Assuming the question is to grab any elements in the array with coefficient less than one. Is there a way to obtain 1.0e+14, so that I could just do arr(i) < 1.0e+14?
I assume you want string output.
Let a denote the input numeric array. You can do it this way, if you don't mind using evalc (a variant of eval, which is considered bad practice):
s = evalc('disp(a)');
s = regexp(s, '[\de+-\.]+', 'match');
This produces a cell array with the desired strings.
Example:
>> a = [1.2e-5 3.4e-6]
a =
1.0e-04 *
0.1200 0.0340
>> s = evalc('disp(a)');
>> s = regexp(s, '[\de+-\.]+', 'match')
s =
'1.0e-04' '0.1200' '0.0340'
Here is the original answer from Alain.
Basic math can tell you that:
floor(log10(N))
The log base 10 of a number tells you approximately how many digits before the decimal are in that number.
For instance, 99987123459823754 is 9.998E+016
log10(99987123459823754) is 16.9999441, the floor of which is 16 - which can basically tell you "the exponent in scientific notation is 16, very close to being 17".
Now you have the exponent of the scientific notation. This should allow you to get to whatever your goal is ;-).
And depending on what you want to do with your exponent and the number, you could also define your own method. An example is described in this thread.

Is this the simplified version of this boolean expression? Or is this reviewer wrong

Cause I've tried doing the truth table unfortunately one has 3 literals and the other has 4 so i got confused.
F = (A+B+C)(A+B+D')+B'C;
and this is the simplified version
F = A + B + C
http://www.belley.org/etc141/Boolean%20Sinplification%20Exercises/Boolean%20Simplification%20Exercise%20Questions.pdf
cause I think there's something wrong with this reviewer.. or is it accurate?
btw is simplification different from minimizing from Sum of Minterms to Sum of Products?
Yes, it is the same.
Draw the truth table for both expressions, assuming that there are four input variables in both. The value of D will not play into the second truth table: values in cells with D=1 will match values in cells with D=0. In other words, you can think of the second expression as
F = A +B + C + (0)(D)
You will see that both tables match: the (A+B+C)(A+B+D') subexpression has zeros in ABCD= {0000, 0001, 0011}; (A+B+C) has zeros only at {0000, 0001}. Adding B'C patches zero at 0011 in the first subexpressions, so the results are equivalent.