Is there a Levenshtein distance in PMML standard? How do you calculate it? - pmml

I am trying to get Levenshtein distance between two fields using PMML. Is there an easy implimentation of this?
I have looked at this :
http://dmg.org/pmml/v4-2-1/Transformations.html
but it's not clear to me if I can use this to calculate Levenshtein distance and where this might fit into my pmml document.

There is no "standalone" function for computing the Levenshtein distance between two string values.
If you want to perform string comparison, then you should check out the TextIndex transformation. The TextIndex element has maxLevenshteinDistance attribute, which can be used to implement queries like "is string value A equal to string value B if the maximum allowed Levenshtein distance is X".
<TextIndex textField="myStringField" localTermWeights="binary" maxLevenshteinDistance="3">
<Constant dataType="string">Hello World!</Constant>
</TextIndex>
The above should return 1 if the input string value myStringField is less than three edits away from the reference string value Hello String (and 0 otherwise).

Related

How can I find a max value of a selected region in a fit?

I am trying to find a max value of a curve fitted plot for a certain region in this plot. I have a 4th order fit, and when i use max(x), the ans for this is an extrapolated value, while I am actually looking fot the max value of the 'bump' in my data.
So question, how do I select the max for only a certain region in the data while using a cfit? Or how do I exclude a part of the fit?
LF = pol4Fit(L,F);
Coefs= coeffvalues(LF);
This code does only give the optimum (the max value) of the real points:
L_opt = feval(LF,L);
[F_opt,Num_Length]= max (L_opt);
Opt_Length= L(Num_Length);
So now I was trying something like: y=max(LF(F)), but this is not specific to select a region.
Try to only evaluate the region you are interested in.
For instance, let's say the specific region is a vector named S.
You can simply rewrite your code like below:
L_opt = feval(LF,S);
Use the specific domain region S instead of the whole domain L and it only evaluates the region you are concerned with. Then using max function should work properly for you.

What does EuclideanHash in Location Sensitive Hash mean?

I find that Location Sensitive Hash support EuclideanHash CosineHash and some other hash according to the repository in github: lsh families. Anyway, CosineHash is easy to understand:
double result = vector.dot(randomProjection);
return result > 0 ? 1 : 0;
But then EuclideanHash is hard to understand:
double hashValue = (vector.dot(randomProjection)+offset)/Double.valueOf(w); // offset = rand.nextInt(w)
return (int) Math.round(hashValue);
Generally Euclidean hash in lsh mean that hash function that map a data (vector) in nearby position in Euclidean space to an integer.
One way to do this is by generating random line, and dividing the line into segments where a segment represent a hash number. Then, hash can be obtained by projecting the data vector to this line and observing which segment it falls to.
The function you asked seems to be using similar approach but using dot product instead of projection

Matlab double function outputs infinity when taking a big number of type "sym"

This is literally the number I obtain (from symsum function), which is of type sym:
a=328791078344903739363762093060350430076929707044786898291940722052812676355129485878814911641516759087483581972443760841410582114920781832660013389681326267351368505696628653562484228680842650173635989588528021721039959787053654401351638478786763875479187208098871238084448485336138651690856082810553570419028927840285091142054111375001
I would like to make mathematical operations (in particular, take a natural log) on this number and so want to transform it to double, however the output from double(a) is simply "Inf". How to go about this problem and convert it from "sum" to a numeric type?
Your number is ~3.3x10335 but the largest number that can be represented by MATLAB's double precision floating point numbers is ~1.8x10308 (see the output of realmax). Converting your number to double precision causes overflow because the number is larger than can be represented so MATLAB just returns Inf.
For an exhaustive overview of floating point representations and arithmetic, you can check out this PDF.
Can you count the digits and insert a decimal point before converting to double?
If so, take advantage of the fact that the natural log of a number that overflows may not itself overflow.
Using "^" for power, you can represent your number as 3.28791078344903739363762093060350430076929707044786898291940722052812676355129485878814911641516759087483581972443760841410582114920781832660013389681326267351368505696628653562484228680842650173635989588528021721039959787053654401351638478786763875479187208098871238084448485336138651690856082810553570419028927840285091142054111375001 * (10 ^ 335).
The decimal log of (10^335) is 335. Its natural log is 335*log(10).
The natural log of the original number is:
log(3.287910783449037393637620930603504300769297070447868982919407220528)
+ 335*log(10)
All inputs, intermediate results, and the final result of this calculation are in the double range.

Finding the maximum value of a function under uncertainty

I have three values X,Y and Z. These values have a range of values between 0 and 1 (0 and 1 included).
When I call a function f(X,Y,Z) it returns a value V (value between 0 and 1). My Goal is to choose X,Y,Z so that the returned value V is as close as possible to 1.
The selection Process should be automated and the right values for X,Y,Z are unknown.
Due to my Use Case it is possible to set Y and Z to 1 (the value 1 hasn't any influence on the output) and search for the best value of X.
After that I can replace X by that value and do the same for Y. Same procedure for Z.
How can I find the "maximum of the function"? Is there somekind of "gradient descend" or hill climbing algorithm or something like that?
The whole modul is written in perl so maybe there is an package for perl that can solve that problem?
You can use Simulated Annealing. Its a multi-variable optimization technique. It is also used to get a partial solution for the Travelling Salesperson problem. Its one of the search algorithms mentioned in Peter Norvig's Intro to AI book as well.
Its a hill climbing algorithm which depends on random variables. Also it won't necessarily give you the 'optimal' answer. You can also vary the iterations required by it as per your computational/time needs.
http://en.wikipedia.org/wiki/Simulated_annealing
http://www1bpt.bridgeport.edu/sed/projects/449/Fall_2000/fangmin/chapter2.htm
I suggest you take a look at Math::Amoeba which implements the Nelder–Mead method for finding stationary points on functions.

How do I deterimine if a double is not infinity, -infinity or Nan in SSRS?

If got a dataset returned from SSAS where some records may be infinity or -infinity (calculated in SSAS not in the report).
I want to calculate the average of this column but ignore those records that are positive or negative infinity.
My thought is to create a calculated field that would logically do this:
= IIF(IsInfinity(Fields!ASP.Value) or IsNegativeInfinity(Fields!ASP.Value), 0 Fields!ASP.Value)
What I can't figure out is how to do the IsInfinity or IsNegativeInfinity.
Or conversely is there a way to calculate Average for a column ignoring those records?
Just stumbled across this problem and found a simple solution for determining whether a numeric field is infinity.
=iif((Fields!Amount.Value+1).Equals(Fields!Amount.Value), false,true)
I'm assuming you are using the business intelligence studio rather than the report builder tool.
Maybe you are trying a formula because you can't change the SSAS MDX query, but if you could then the infinity would likely be caused by a divide by zero. The NaN is likely caused by trying to do Maths with NULL values.
Ideally change the cube itself so that the measure is safe from divide by zero (e.g. IIF [measure] = 0,don't do the division just return "", otherwise do it) . Second option would be create a calculated measure in the MDX query that does something similar.
As for a formula, there are no IsInfinity functions so you would have to look at the value of the field and see if its 1.#IND or 1.#INF or NaN.