Checking whether a particular word is a noun or verb - word

I tried to cheak with Microsoft Word (using vba code)
for pump ((oil pump) whether it is a noun or verb
according to Microsoft word it is verb ( actually pump is a noun)
I need to check it for a list of words (mostly technical)
Is it possible to compare to any database?
Something else?

I think the only answer here is "It can't be done".
Even with context, you'd need human interpretation to determine the word type in some cases.
Time flies like an arrow.
It can mean that time passes very quickly. In that case,
Time (noun) flies (verb) like an arrow (prepositional phrase).
Or it can mean that a group of insects have a preference for pointy things
Time flies (compound noun) like (verb) an arrow (noun as an object).
Or it can be a suggestion to measure the speed of insects in the same way an arrow does.
Time (verb) flies (noun) like an arrow (prepositional phrase).
The Merriam Webster Learner's Dictionary has seven possible word types for "like": verb, noun, preposition, adjective, noun (yes, again, but with another meaning), adverb, conjunction. Each of these has several sub-categories for slightly different use cases. And they don't even mention the teenager use ("And I'm like REALLY, and he's like YES...")
The reason that dictionary entries (not the MS Word spelling dictionary, but references that explain the use and meaning) are so complex is that language is complex.
It is impossible to write some VBA, throw in some RegEx and determine the word class without fault.

I'd create a context sensitive "local dictionary" for your project.
If you make "oil pump" a single entry - and search for that first, you can eliminate false readings.

Related

Functional programming: Curry & Fold - what are the etymologies?

Curry & Fold - what are the etymologies in the programmatic sense?
I do not see how any of the English meanings of these homonyms is related to the functionality of these terms.
If you had to rename them to something more obvious - how would you do it?
Curry is the last name of Haskell Curry, a prominent 20th century logician after whom Haskell got its name.
And "folding" simply because the fold operator figuratively represents folding, like a hand of cards can be folded to look like a single card. Think of foldr (+) 0 [1,2,3] == 6 as a hand of cards 1, 2 and 3 folded into a single card 6.
The word "reducing", which also means folding, can be illustrated using a similar analogy.
Of course, Haskell is more magic than even the bluffiest and luckiest game of poker, so folds in functional programming can actually produce a deck of cards that holds more cards than the hand it was folded from, or cards can be folded into cats, etc: foldr (\i, acc -> [show i,show i,show i] ++ acc) [] [1,2,3] == ["1","1","1","2","2","2","3","3","3"]. Therefore what started out as folding eventually evolved into an extremely universal operator that can produce map as well as filter etc, so don't get too carried away with the poker comparison and etymology.
As to what to name them to: renaming a dead person might not be the most ethical thing to do. The poor guy is so successful BOTH of his names are used for big things, and then you want to deprive him posthumously of his joy and rename him to something else? Unless perhaps that something else is Newton Watt Scoville or Kelvin Celsius Ângström, I'd seriously not attempt a rename.
However if you meant renaming the programming concept: it could instead be referred to by the name "ricing" in my hungry opinion. But Mr. Curry might still feel intimidated.
Folding could actually be renamed to bluffing, if you're not fulfilled by the multitude of presently available names for it — thanks to som-snytt for the constructive idea.
I believe that the "fold" term comes mainly from the use of the word "fold" in phrases like "to fold into..." which is a term most commonly used by chefs, I believe (I watch a lot of cooking shows...). We use it in the context of functional programming because we say that, for example, for lists, the head of the list is "folded into" the resulting of folding the tail. For example, the function foldr is a "recipe" for how to "cook" a list, and part of that recipe is "fold this into that", if you like.
The oldest reference to "folding" that I could find on the internet in the context of functional programming in this report, published in 1985 by the University of Cambridge, which has this to say:
The function gather applies a function of two arguments “between” each element of a list and a terminal value. [...] This function is also known as reduce or fold in other languages.
So clearly the term "fold" was at least somewhat common even 30 years ago!

Use exact search with OR operator inside Sphinx query

There are different search options possible with sphinx extended syntax.
Exact search: "I love to eat" //will match exact phrase
OR search: (eat|sleep|dream)
Is it possible to mix them and build queries like that:
"I love to (eat|sleep|dream)"
I know it is possible to simplify it and split OR condition to different exact phrases, like:
"I love to eat" | "I love to sleep" | "I love to dream"
But I plan to use lot of OR groups with lot of options inside, and extending this query will end up with huge one.
So is it possible to use OR syntax inside exact match syntax in Sphinx?
No, its not possible to use 'OR' within the Phrase Operator (the "s around words that enforces adjacent words) - the proper name for what you call 'exact match'.
Alas there isn't another combined 'strict order' operator and 'near' (ie there isnt a 'just before' operator). So you forced to use both, so something like
("I love to" << (eat|sleep|dream)) NEAR/3 ("I love to" NEAR/1 (eat|sleep|dream))
Which is no simpler, and would argue is more complicated and convoluted! The NEAR/3 in the middle is needed to make sure you matching on the same sentance within the document (otherwise there are edge cases with false positives).
An off the wall idea, if you have a lot of these 'OR' lists, is rather than implement them in the query, use wordforms instead. The drawback is you need to know them in advance (ie compiled into the index) and 'opting out' is more complicated.

Emacs Word-Reordering Completion Style

I'm missing one piece in Emacs' already superbly unique completion system (completion-styles and completion-styles-alist), namely word and sub-word-reordering a la google search.
As an example, file-write should complete to write-file if no other style finds a completion. Word-separating characters could for-example be matched using the regular expression "\\s_".
Even cooler and more general would be if applied the Damerau-Levenshtein Edit Distance (D) to words instead of letters. The completion candidates could the be sorted on increasing distance D, meaning closest match first.
My plan is quite clear on how to implement this and an implementation of D already exists. I ask anyway so I don't reinvent the wheel yet another time:
Has anybody implemented such a completion style already?
Per --
You cannot do what you want with vanilla Emacs (well, you can use Lisp to code whatever you need -- but you cannot do what you want not out of the box).
Icicles gives you exactly what you want. It's called "progressive completion", and the idea is similar to using a pipeline of grep commands.
Nutshell view of progressive completion (and chipping away the non-elephant)
Progressive completion
You can also use LevenShtein matching for completion with Icicles, and combine that with progressive completion to match the words in any order.

Doing a different kind of completion

I am writing a function that uses the minibuffer and requires a somewhat different style of completion that might require deleting some characters. For example:
ar<tab> -> artist:
artist:ba<tab> -> artist:'Johann Sebastian Bach'
artist:'Johann Sebastian Bach'<tab> -> artist:'Bela Bartók'
artist:'Bela Bartók' and album:<tab>
etc...
I've already written the completion function, that generates a list of possible strings for the current input, yet I cannot use it with completing-read and completion-table-dynamic, because only the alternatives that do not require deletion are displayed. In this case, only the first step, from ar to artist.
To do the job, I'm considering using the lower-level (read-from-minibuffer) with a custom keymap to do the completion and display the alternatives. Is there a simpler solution? If not, which functions are there to handle displayed and cycling through the Completions buffer?
Thanks!
EDIT: In the end, I rolled my own. Here is the code, if anybody's interested.
Yes, Icicles should give you what you want, IIUC.
I'm not sure how you're handling things like 'artist' and 'album', but there are a few ways Icicles could help.
You can match parts of a completion candidate in any order. So if 'foo' and 'bar' are parts of the same candidate, then you can match any candidates that have both, in either order. This is "progressive completion": you can keep adding patterns to narrow the choices. Patterns are ANDed. You can also subtract candidates that match a pattern.
You can have candidates that are "multi-completions". These are composed of parts, separated by a configurable string. You can match against any or all parts. So, for example, a candidate might be an album name plus artist names.
If you also use Bookmark+ then you can tag files, a la delicious.com tagging. You need not visit files to tag them. Tags are generally strings (including newlines if you want), but they can also have associated Lisp values. Here, you might have a file for each album and tag albums with descriptions and various artists. Then you can complete to find a given album file by matching parts of its name and/or artists.
For all Icicles completion, you can use progressive completion and complementing (#1). For each part of a progression you can use substring or regexp matching (or even fuzzy matching of various sorts).
Some links that might help:
http://www.emacswiki.org/emacs/Icicles_-_Progressive_Completion
http://www.emacswiki.org/emacs/Icicles_-_Multi-Completions
http://www.emacswiki.org/emacs/Icicles_-_Bookmark_Enhancements -- see, for example, command `icicle-find-file-tagged'
You may benefit from the Icicles library. It contains many features for enhancing minibuffer completion.

Is there a diff algorithm that preserves line ownership

My goal is coming up with a script to track the point a line was added, even if the line is subsequently modified or moved around (both of which confuse traditional vcs 'blame' scripts. I've done some minor background research (see bottom) but didn't find anything useful. I have a concept for how to proceed but the runtime would be atrocious (there's a factorial involved).
The two missing features are tracking edited-in-place lines separate from a deletion-and-addition of that line, and tracking entire functions moved around so they're in different hunks. For those experienced with diff but unfamiliar with the terminology, a subsequence is a contiguous group of + or - lines, with a type of either delete (all -), add (all +), or replace (a combination). I need more information, on moves and edit-in-place lines, vaguely alluded to in an entry on c2: DiffAlgorithm (paragraph starts with "My favorite mode"). Does anyone know what that is? (seems to be based on Tichy, see bottom.)
Here's more info on the two missing features:
no concept of a change on a line, (a fourth type, something like edit-in-place). In this hunk, the parent of 'bc' is 'b' but 'd' is new and isn't a descendant of 'b':
a
-b
+bc
+d
The workaround for this isn't too complicated, if the position of edits is the same (just an expanded version of markup_instraline_changes but comparing edit distance on all equal-sized subsets of old and new lines.
no concept of "moving" code that preserves the ownership of the lines, e.g. this diff shouldn't alter the ownership of "line", although its position changes.
a
-line
c
+line
This could be dealt with in the same way but with much worse runtime (instead of only checking single blocks marked 'replace', you'd need to check Levenshtein distance between all added against all removed lines) and with likely false positives (some, like whitespace-only lines, aren't relevant to my problem).
Research I've done: reading about gestalt pattern matching (Ratcliff and Obershelp, used in Python's difflib) and An O(ND) Difference Algorithm and its Variations (EW Myers).
After posting the question, I found references to Tichy84 which appears to be The string-to-string correction problem with block moves (which I haven't read yet) according to Walter Tichy's paper a year later on RCS
You appear to be interested in origin tracking, the problem of tracing where a line came from.
Ideally, you'd instrument the editor to remember how things were edited, and store the edits with the text in your repository, thus solving the problem trivially, but none of us software engineers seem to be smart enough to implement this simple idea.
As a weak substitute, one can look at a sequence of source code revisions from the repository and reconstruct a "plausible" history of changes. This is what you seem to be doing by proposing the use of "diff". As you've noted, diff doesn't understand the idea of "moving" or "copying".
SD Smart Differencer tools compare source text by parsing the text according to the langauge it is in, discovering the code structures, and computing least-Levensthein differences in terms of programming language constructs (identifiers, expressions, statements, blocks, classes, ...) and abstract editing operators "insert", "delete", "copy", "move" and "rename identifier within a scope". They produce diff-like output, a little richer because they tell you line/column -> line/column with different editing operations.
Obviously the "move" and "copy" edits are the ones most interesting to you in terms of tracking specific lines (well, specific language constructs). Our experience is that code goes through lots of copy and edits, too, which I suspect won't surprise you.
These tools are in Beta, and are presently available for COBOL, Java and C#. Lots of other langauges are in the pipe, because the SmartDifferencer is built on top of a langauge-parameterized infrastructure, DMS Software Reengineering Toolkit, which has quite a number of already existing, robust langauge grammars.
I think the idea of what amount of editing a line that can be done while it remains a descendent of some previously written line is very subjective, and based on context, both things that a computer cannot work with. You'd have to specify some sort of configurable minimum similarity on lines in your program I think... The other problem is that it is entirely possible for two identical lines to be written completely independently (for example incrementing the value of some variable), and this will be be quite a common thing, so your desired algorithm won't really give truthful or useful information about a line quite often.
I would like to suggest an algorithm for this though (which makes tons of hopefully obvious assumptions by the way) so here goes:
Convert both texts to lists of lines
Copy the lists and Strip all whitespace from inside of each line
Delete blank lines from both lists
Repeat
Do a Levenshtein distance from the old to new lists ...
... keeping all intermediate data
Find all lines in the new text that were matched with old lines
Mark the line in both new/old original lists as having been matched
Delete the line from the new text (the copy)
Optional: If some matched lines are in a contiguous sequence ...
... in either original text assign them to a grouping as well!
Until there is nothing left but unmatchable lines in the new text
Group together sequences of unmatched lines in both old and new texts ...
... which are contiguous in the original text
Attribute each with the line match before and after
Run through all groups in old text
If any match before and after attributes with new text groups for each
//If they are inside the same area basically
Concatenate all the lines in both groups (separately and in order)
Include a character to represent where the line breaks are
Repeat
Do a Levenshtein distance on these concatenations
If there are any significantly similar subsequences found
//I can't really define this but basically a high proportion
//of matches throughout all lines involved on both sides
For each matched subsequence
Find suitable newline spots to delimit the subsequence
Mark these lines matched in the original text
//Warning splitting+merging of lines possible
//No 1-to-1 correspondence of lines here!
Delete the subsequence from the new text group concat
Delete also from the new text working list of lines
Until there are no significantly similar subsequences found
Optional: Regroup based on remaining unmatched lines and repeat last step
//Not sure if there's any point in trying that at the moment
Concatenate the ENTIRE list of whitespaced-removed lines in the old text
Concatenate the lines in new text also (should only be unmatched ones left)
//Newline character added in both cases
Repeat
Do Levenshtein distance on these concatenations
Match similar subsequences in the same way as earlier on
//Don't need to worry deleting from list of new lines any more though
//Similarity criteria should be a fair bit stricter here to avoid
// spurious matchings. Already matched lines in old text might have
// even higher strictness, since all of copy/edit/move would be rare
While you still have matchings
//Anything left unmatched in the old text is deleted stuff
//Anything left unmatched in the new text is newly written by the author
Print out some output to show all the comparing results!
Well, hopefully you can see the basics of what I mean with that completely untested algorithm. Find obvious matches first, and verbatim moves of chunks of decreasing size, then compare stuff that's likely to be similar, then look for anything else which is similar, but both modified and moved: probably just coincidentally similar.
Well, if you try implementing this, tell me how it works out, and what details you changed, and what kind of assignments you made to the various variables involved... I expect there will be some test cases where it works brilliantly and others where it just abyssmally fails due to some massive oversight. The idea is that most stuff will be matched before you get to the inefficient final loop, and indeed the previous one