Modifying Levenshtein Distance for positional Bias - distance

I am using the Levenshtein distance algorithm to compare a company name provided as a user input against a database of known company names to find closest match. By itself, the algorithm works okay, but I want to build in a Bias so that the edit distance is considered lower if the initial parts of the strings match.
For Example, if the search criteria is "ABCD", then both "ABCD Co." and "XYX ABCD" have identical Edit Distance. However I want to add weight to the fact that the initial parts of the first string matches the search criteria more closely than the second string.
One way of doing this might be to modify the insert/delete/replace costs to be higher at the beginning of the strings and lower towards the end. Does anyone have an example of a successful implementation of this? Is using Levenshtein distance still the best way to do what I am trying to achieve? Is my assumption of the approach accurate?
UPDATE: For my immediate purposes I have decided to forgo the above and instead use the Jaro Winkler edit distance which seems to solve the problem. However I will leave this open for further inputs.

What you're looking for looks like a Smith-Waterman local alignment: http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm

Related

Sampling search domain

In Minizinc, is it possible to sample the domain ? lets say my domain has many solutions, running --all-solutions will initially return very similar solutions.
1) is there a way to sample the domain ? perhaps BFS ? the purpose is for follow up solutions analysis.
2) Is there any methods to estimate search domain size in CP?
my domain is a Staff Rostering Problem
Regards,
H
It is not possible to choose BFS in MiniZinc but there is search annotations. With the search annotations you can choose in which order the variables should be branched on. You can also choose which value will be branched on. Unfortunately, MiniZinc does not support random variable search.
In your case I would branch on a dom_w_deg with a random value but any other variable selection can work, try them.
solve::seq_search([int_search(some_array, dom_w_deg, indomain_random,complete)]) satisfy;
Do note that not all solvers support the usage of search annotations.
Other alternatives are to add constraints that remove the similar results.
You can always calculate the number of permutations you can have in your solution, the number of variables multiplied with their domain. This will not consider any constraints and the real search space can be much smaller.
Another way of visualizing the search is by using gist or other programs to visualize the search.
(source: marco at www.imada.sdu.dk)
You can expand and retract parts of the search tree and see which variables have been branched on.

How to write elisp to auto-correct misspelt variables

Suppose I wrote
var longlongname = 1;
and I misspelled it as linglongname. How can I find a package or write a function to correct it?
(sometimes I'm lazy and prefer to trigger a key to fix previous misspelt word rather than move cursor around and correct it manually.)
This problem that you want to solve is undecidable in the general case.
In particular cases, you can use flymake combined with flyspell.
Other packages that would be useful for yor in combination with flymake would be company and auto-complete.
Sounds like a fun Elisp exercise. Recipe for a function that does what you want:
Build a dictionary of all words occurring in the buffer.
Calculate the Levenshtein distance of all these words to the word in question.
Replace the word in question with the most similar word according to the Levenshtein distance.
The Levenshtein distance is easy to compute. It basically counts the number of changes you have to make to one word in order to transform it into another. I'm sure someone has already implemented the Levenshtein distance in Elisp. In a more advanced solution you could perhaps use the syntax table to narrow down the dictionary to tokens that actually are variables. If there is more than one token with the minimal Levenshtein distance, you'd have to prompt the user before substituting. If the Levenshtein distance of the closest match is above a certain threshold (e.g., Levenshtein distance / total number of characters in the two words > 1/5), you might not want to replace at all because the closest match is not close enough to be a plausible candidate.

Find points near LineString in mongodb sorted by distance

I have an array of points representing a street (black line) and points, representing a places on map (red points). I want to find all the points near the specified street, sorted by distance. I also need to have the ability to specify max distance (blue and green areas). Here is a simple example:
I thought of using the $near operator but it only accepts Point as an input, not LineString.
How mongodb can handle this type of queries?
As you mentioned, Mongo currently doesn't support anything other than Point. Have you come across the concept of a route boxer? 1 It was very popular a few years back on Google Maps. Given the line that you've drawn, find stops that are within dist(x). It was done by creating a series of bounding boxes around each point in the line, and searching for points that fall within the bucket.
I stumbled upon your question after I just realised that Mongo only works with points, which is reasonable I assume.
I already have a few options of how to do it (they expand on what #mnemosyn says in the comment). With the dataset that I'm working on, it's all on the client-side, so I could use the routeboxer, but I would like to implement it server-side for performance reasons. Here are my suggestions:
break the LineString down into its individual coordinate sets, and query for $near using each of those, combine results and extract an unique set. There are algorithms out there for simplifying a complex line, by reducing the number of points, but a simple one is easy to write.
do the same as above, but as a stored procedure/function. I haven't played around with Mongo's stored functions, and I don't know how well they work with drivers, but this could be faster than the first option above as you won't have to do roundtrips, and depending on the machine that your instance(s) of Mongo is(are) hosted, calculations could be faster by microseconds.
Implement the routeboxer approach server-side (has been done in PHP), and then use either of the above 2 to find stops that are $within the resulting bounding boxes. Heck since the routeboxer method returns rectangles, it would be possible to merge all these rectangles into one polygon covering your route, and just do a $within on that. (What #mnemosyn suggested).
EDIT: I thought of this but forgot about it, but it might be possible to achieve some of the above using the aggregation framework.
It's something that I'm going to be working on soon (hopefully), I'll open-source my result(s) based on which I end up going with.
EDIT: I must mention though that 1 and 2 have the flaw that if you have 2 points in a line that are say 2km apart, and you want points that are within 1.8km of your line, you'll obviously miss all the points between that part of your line. The solution is to inject points onto your line when simplifying it (I know, beats the objective of reducing points when adding new ones back in).
The flaw with 3 then is that it won't always be accurate as some points within your polygon are likely to have a distance greater than your limit, though the difference wouldn't be a significant percentage of your limit.
[1] google maps utils routeboxer
As you said Mongo's $near only works on points not lines as the centre point however if you flip your premise from find points near the line to find the line near the point then you can use your points as the centre and line as the target
this is the difference between
foreach line find points near it
and
foreach point find line near it
if you have a large number of points to check you can combine this with nevi_me's answer to reduce the list of points that need checking to a much smaller subset

Problem with block matching in matlab

I have written matlab codes for two different block matching algorithms, extensive search and three step search, but i am not sure how i can check whether i am getting the correct results. Is there any standard way to check these or any standard code which i can run and compare my result with.I read somewhere that JM software can be used but i didnt find any way to use it.
You can always use the results produced by your algorithms to create the next frame of video and then analyze its quality by either visually inspecting it (which is rather subjective, and we like to deal in numbers) or calculating the mean square error between the produced image and the one you're trying to estimate. Mean square error of the exhaustive (extensive) search should be lower than the one three-step gives you.
Well, did you try to plot it? I mean,after the block-matching you have a new image, right?.
A way to know if you result if true or not is to check the sum of the difference of 2 frames.
A - pre_frame
B - post_frame
C - Compensated frame
If abs(abs(A-B)) is lower than abs(abs(A-C))) that mean it could be true.
Next time, try to specify your algoritm. Also, put your code here to help you more.

Calculation route length

I have a map with about 80 annotations. I would like to do 3 things.
1) From my current location, I would like to know the actual route distance to that position. Not the linear distance.
2) I want to be able to show a list of all the annotations, but for every annotation (having lon/lat) I would like to know the actual route distance from my position to that position.
3) I would like to know the closest annotation to my possition using route distance. Not linear distance.
I think the answer to all these three points will be the same. But please keep in mind that I don't want to create a route, I just want to know the distance to the annotation.
I hope someone can help me.
Best regards,
Paul Peelen
From what I understand of your post, I believe you seek the Haversine formula. Luckily for you, there are a number of Objective-C implementations, though writing your own is trivial once the formula's in front of you.
I originally deleted this because I didn't notice that you didn't want linear distance at first, but I'm bringing it back in case you decide that an approximation is good enough at that particular point of the user interaction.
I think as pointed out before, your query would be extremely heavy for google maps API if you perform exactly what you are saying. Do you need all that information at once ? Maybe first it would be good enough to query just some of the distances based on some heuristic or in the user needs.
To obtain the distances, you could use a Google Maps GDirections object... as pointed out here ( at the bottom of the page there's "Routes and Steps" section, with an advanced example.
"The GDirections object also supports multi-point directions, which can be constructed using the GDirections.loadFromWaypoints() method. This method takes an array of textual input addresses or textual lat/lon points. Each separate waypoint is computed as a separate route and returned in a separate GRoute object, each of which contains a series of GStep objects."
Using the Google Maps API in the iPhone shouldn't be too difficult, and I think your question doesn't cover that, but if you need some basic example, you could look at this question, and scroll to the answer.
Good Luck!
Calculating route distance to about 80 locations is certain to be computationally intensive on Google's part and I can't imagine that you would be able to make those requests to the Google Maps API, were it possible to do so on a mobile device, without being severely limited by either the phone connection or rate limits on the server.
Unfortunately, calculating route distance rather than geometric distance is a very expensive computation involving a lot of data about the area - data you almost certainly don't have. This means, unfortunately, that this isn't something that Core Location or MapKit can help you with.
What problem are you trying to solve, exactly? There may be other heuristics other than route distance you can use to approximate some sort of distance ranking.