Suppose I wrote
var longlongname = 1;
and I misspelled it as linglongname. How can I find a package or write a function to correct it?
(sometimes I'm lazy and prefer to trigger a key to fix previous misspelt word rather than move cursor around and correct it manually.)
This problem that you want to solve is undecidable in the general case.
In particular cases, you can use flymake combined with flyspell.
Other packages that would be useful for yor in combination with flymake would be company and auto-complete.
Sounds like a fun Elisp exercise. Recipe for a function that does what you want:
Build a dictionary of all words occurring in the buffer.
Calculate the Levenshtein distance of all these words to the word in question.
Replace the word in question with the most similar word according to the Levenshtein distance.
The Levenshtein distance is easy to compute. It basically counts the number of changes you have to make to one word in order to transform it into another. I'm sure someone has already implemented the Levenshtein distance in Elisp. In a more advanced solution you could perhaps use the syntax table to narrow down the dictionary to tokens that actually are variables. If there is more than one token with the minimal Levenshtein distance, you'd have to prompt the user before substituting. If the Levenshtein distance of the closest match is above a certain threshold (e.g., Levenshtein distance / total number of characters in the two words > 1/5), you might not want to replace at all because the closest match is not close enough to be a plausible candidate.
Related
In my class, for a final project, we are working on improving an algorithm that matches a prefix with a power of 2
(I.E. input="25", output="2^8=256", input="99", output="2^99=9903520314283042199192993792...")
Anyways, we are relying on logarithms to identify matching prefixes. Logarithm precision actually does matter and we are looking for better log functions. The standard log function and the calc-function both have the same precision. Are there any options if I wanted even better precision?
Based off several google searches, I've come to the conclusion that one currently does not exist. If someone where to inform me otherwise, I would happily change my best answer however.
In Minizinc, is it possible to sample the domain ? lets say my domain has many solutions, running --all-solutions will initially return very similar solutions.
1) is there a way to sample the domain ? perhaps BFS ? the purpose is for follow up solutions analysis.
2) Is there any methods to estimate search domain size in CP?
my domain is a Staff Rostering Problem
Regards,
H
It is not possible to choose BFS in MiniZinc but there is search annotations. With the search annotations you can choose in which order the variables should be branched on. You can also choose which value will be branched on. Unfortunately, MiniZinc does not support random variable search.
In your case I would branch on a dom_w_deg with a random value but any other variable selection can work, try them.
solve::seq_search([int_search(some_array, dom_w_deg, indomain_random,complete)]) satisfy;
Do note that not all solvers support the usage of search annotations.
Other alternatives are to add constraints that remove the similar results.
You can always calculate the number of permutations you can have in your solution, the number of variables multiplied with their domain. This will not consider any constraints and the real search space can be much smaller.
Another way of visualizing the search is by using gist or other programs to visualize the search.
(source: marco at www.imada.sdu.dk)
You can expand and retract parts of the search tree and see which variables have been branched on.
I know there are ways to find synonyms either by using NLTK/pywordnet or Pattern package in python but it isn't solving my problem.
If there are words like
bad,worst,poor
bag,baggage
lost,lose,misplace
I am not able to capture them. Can anyone suggest me a possible way?
There have been numerous research in this area in past 20 years. Yes computers don't understand language but we can train them to find similarity or difference in two words with the help of some manual effort.
Approaches may be:
Based on manually curated datasets that contain how words in a language are related to each other.
Based on statistical or probabilistic measures of words appearing in a corpus.
Method 1:
Try Wordnet. It is a human-curated network of words which preserves the relationship between words according to human understanding. In short, it is a graph with nodes as something called 'synsets' and edges as relations between them. So any two words which are very close to each other are close in meaning. Words that fall within the same synset might mean exactly the same. Bag and Baggage are close - which you can find either by iteratively exploring node-to-node in a breadth first style - like starting with 'baggage', exploring its neighbors in an attempt to find 'baggage'. You'll have to limit this search upto a small number of iterations for any practical application. Another style is starting a random walk from a node and trying to reach the other node within a number of tries and distance. It you reach baggage from bag say, 500 times out of 1000 within 10 moves, you can be pretty sure that they are very similar to each other. Random walk is more helpful in much larger and complex graphs.
There are many other similar resources online.
Method 2:
Word2Vec. Hard to explain it here but it works by creating a vector of a user's suggested number of dimensions based on its context in the text. There has been an idea for two decades that words in similar context mean the same. e.g. I'm gonna check out my bags and I'm gonna check out my baggage both might appear in text. You can read the paper for explanation (link in the end).
So you can train a Word2Vec model over a large amount of corpus. In the end, you will be able to get 'vector' for each word. You do not need to understand the significance of this vector. You can this vector representation to find similarity or difference between words, or generate synonyms of any word. The idea is that words which are similar to each other have vectors close to each other.
Word2vec came up two years ago and immediately became the 'thing-to-use' in most of NLP applications. The quality of this approach depends on amount and quality of your data. Generally Wikipedia dump is considered good training data for training as it contains articles about almost everything that makes sense. You can easily find ready-to-use models trained on Wikipedia online.
A tiny example from Radim's website:
>>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
>>> model.similarity('woman', 'man')
0.73723527
First example tells you the closest word (topn=1) to words woman and king but meanwhile also most away from the word man. The answer is queen.. Second example is odd one out. Third one tells you how similar the two words are, in your corpus.
Easy to use tool for Word2vec :
https://radimrehurek.com/gensim/models/word2vec.html
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf (Warning : Lots of Maths Ahead)
I am using the Levenshtein distance algorithm to compare a company name provided as a user input against a database of known company names to find closest match. By itself, the algorithm works okay, but I want to build in a Bias so that the edit distance is considered lower if the initial parts of the strings match.
For Example, if the search criteria is "ABCD", then both "ABCD Co." and "XYX ABCD" have identical Edit Distance. However I want to add weight to the fact that the initial parts of the first string matches the search criteria more closely than the second string.
One way of doing this might be to modify the insert/delete/replace costs to be higher at the beginning of the strings and lower towards the end. Does anyone have an example of a successful implementation of this? Is using Levenshtein distance still the best way to do what I am trying to achieve? Is my assumption of the approach accurate?
UPDATE: For my immediate purposes I have decided to forgo the above and instead use the Jaro Winkler edit distance which seems to solve the problem. However I will leave this open for further inputs.
What you're looking for looks like a Smith-Waterman local alignment: http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm
I'm doing a map-coloring problem with Scheme, and I used minimum remaining values (Select the vertex with the fewest legal colors) and degree heuristics select the vertex that has the largest number of neighbors). If there exists a solution for a certain configuration, will these heuristics ensures that it won't need to backtrack?
Let's do a simple theoretical analysis.
Graph coloring is NP-complete for general graphs (if not asking for a coloring with less than 4 colors). This means there exists no known polynomial time algorithm.
Your heuristic is computable in polynomial time.
Assuming you need no backtracking, then you need to make n steps, each of which requires polynomial time (n is number of vertices). Thus you can solve the problem in polynomial time.
Either you have proven P=NP or your assumption is wrong.
I leave it up to you to decide upon which option in point (4) is more plausible.
In general: no, MRV and your other heuristic will not guarantee a straight walk to the goal. (I imagine they might if your problem has some very specific structure, but don't count on it until you've seen the theorem.)
Heuristics prune the search space, or change the order of the search to make an early termination more likely. This is not the same thing as backtracking.
But it's a related concept.
We prune some spaces because we are confident that the solution does not lie in those branches of the search tree, or change the order because we have some reason to believe that it will be quicker if we look in some subtrees before others.
We also cut ourselves off from backtracking because we are confident that the solution is in the branch of the space we are in now (so that if we don't find it in this subtree, we can declare failure and don't bother).
Both kinds of strategies are ultimately about searching less of the space somehow and getting to the answer (positive or negative) without searching everything.
MRV and the degrees heuristic are about reordering the sub-searches, not about avoiding backtracking. Heuristics can be right and make a short search but that's not the same
thing as eliminating backtracking (e.g. the "cut" operator in Prolog). When you find what you're looking for, you can declare success, and of course that eliminates further backtracking. But real backtracking elimination means making a decision not to backtrack no matter what, before the search completes.
E.g. if you're doing a depth-first search, and you find what you're looking for by dumb luck without backtracking, we cannot say that dumb luck is a fence operation that eliminates backtracking. :)