How arbitrary is one? - numbers

My teacher says our homework program must handle "an arbitrary number of input lines". It seems pretty arbitrary to only accept one line, but is it arbitrary enough? My roommate said seven is more a arbitrary number than one, and maybe he's right. But I just have no idea how to measure the arbitrariness of a number and Google doesn't seem to help.
UPDATE:
It sounds like maybe the best thing to do is accepty any given number of input lines, and hope the prof can see that that makes a lot more sense than insisting that the user just give you one specific arbitrary number of input lines. Especially since we weren't instructed to notify the user about what the arbitrary number is. You can't just make the user guess, that's crazy.

"Arbitrary" doesn't mean you get to pick a random number to accept. It means that it should handle an input with any number of lines.
So if someone decides to give your program an input with 0 lines, 1 line, 2 lines... n lines, then it should still do the right thing (and not crash).

Arbitrary means it could be ANY number. 0, 1, 7, 100124453225.
I would probably test for 0 and display some sort of error in that case since it's supposed to have SOME text. Other than that so long as there are more lines your program should keep doing whatever it's designed to do.

Typically when teachers indicate that a program should accept arbitrary amounts of input they are indicating to you that you should consider corner cases which you may not have thought about, one of the most common being no input at all which can often cause errors in programs if the programmer hasn't considered this case.
The point of the word is to emphasize that your program should be able to handle different inputs instead of simply crashing unless input comes in a certain quantity or is formatted in a specific way.

Related

What is "dont" and "isnt" in the pertained GloVe vector files (e.g. glove.6B.50d.txt)?

I found these 2 words "dont" and "isnt" in the vector file glove.6B.50d.txt downloaded from https://nlp.stanford.edu/projects/glove/. I wonder if they were originally "don't" and "isn't". This will likely depend on the sentence_to_word parsing algorithms they used. If someone is familiar, please confirm if this is the case.
A secondary question is if this is a common way to deal with apostrophe for words like "don't", "isn't", "hasn't" and so on. i.e. just filter replace that apostrophe with an empty string such that "don" and "t" becomes one word.
Finally, I am also not sure if GloVe comes with API to do sentence_to_word parsing so you can be consistent with what the researchers have done originally.
I think dont and isnt really are originally don't and isn't. I have seen a few other such examples. I suspect this is just the specific way GloVe researchers handle this.

Variable output hash function

I know that there are hash functions that from a variable length input can give a fixed output. To take the simplest one, using the module of ten no matter how big is the input number I will always get an output between 0 and 9.
I need to do have from an unknown password, a variable length output. My first thought was to use the module, increasing the prim number as much as many digits I need to have as output.
My problems are:
I must handle short passwords as well as I would with long passwords;
I don't know how long should the output be before writing the program, and even though I would know after the user has set the password I may need to change it if he modifies the file.
My first thought was using a simple function and modify it based on my needs.
If I have to hash 123 but I need to have 5 characters as output, that's what I would do:
I add 2 zeros on the right, changing the input to 12300;
I take the lowest 5 digits prime number (10007);
And I then I have my hash doing 12300 % 10007 = 02293.
But since I would probably need output in the order of hundreds if not thousands I'm pretty sure module is not the solution to my problem.
I could also try to create my own hash function, but I have no idea how to verify if it works or if it's trash.
Are there some common solutions in literature for this kind of problem?

At which lines in my MATLAB code a variable is accessed?

I am defining a variable in the beginning of my source code in MATLAB. Now I would like to know at which lines this variable effects something. In other words, I would like to see all lines in which that variable is read out. This wish does not only include all accesses in the current function, but also possible accesses in sub-functions that use this variable as an input argument. In this way, I can see in a quick way where my change of this variable takes any influence.
Is there any possibility to do so in MATLAB? A graphical marking of the corresponding lines would be nice but a command line output might be even more practical.
You may always use "Find Files" to search for a certain keyword or expression. In my R2012a/Windows version is in Edit > Find Files..., with the keyboard shortcut [CTRL] + [SHIFT] + [F].
The result will be a list of lines where the searched string is found, in all the files found in the specified folder. Please check out the options in the search dialog for more details and flexibility.
Later edit: thanks to #zinjaai, I noticed that #tc88 required that this tool should track the effect of the name of the variable inside the functions/subfunctions. I think this is:
very difficult to achieve. The problem of running trough all the possible values and branching on every possible conditional expression is... well is hard. I think is halting-problem-hard.
in 90% of the case the assumption that the output of a function is influenced by the input is true. But the input and the output are part of the same statement (assigning the result of a function) so looking for where the variable is used as argument should suffice to identify what output variables are affected..
There are perverse cases where functions will alter arguments that are handle-type (because the argument is not copied, but referenced). This side-effect will break the assumption 2, and is one of the main reasons why 1. Outlining the cases when these side effects take place is again, hard, and is better to assume that all of them are modified.
Some other cases are inherently undecidable, because they don't depend on the computer states, but on the state of the "outside world". Example: suppose one calls uigetfile. The function returns a char type when the user selects a file, and a double type for the case when the user chooses not to select a file. Obviously the two cases will be treated differently. How could you know which variables are created/modified before the user deciding?
In conclusion: I think that human intuition, plus the MATLAB Debugger (for run time), and the Find Files (for quick search where a variable is used) and depfun (for quick identification of function dependence) is way cheaper. But I would like to be wrong. :-)

When should forms be cleaned?

By "cleaned" I mean formatting inputs such as "a1b2c3" into "A1B 2C3" or "5551234567" into "(555) 123-4567". I figure we have few options:
As the user is typing. For instance, when a user is typing a postal code, all letters are instantly capitalized, or after the user types 3 digits of a phone number, it puts brackets around them.
When the field loses focus.
Never. Formatting happens on the server-side only, just before it is inserted into the DB. The user never gets to see how it was formatted unless it is displayed on the site somewhere.
(3b) If there were form errors, or on the confirmation page. If there are form errors and the form needs to be re-displayed, the formatting on the valid inputs will appear, or if you have a confirmation page (are these inputs correct?) they will show there.
Never ever. Data should be dumped into the database as-is and only formatted in the template/view just before it is displayed back to the user.
What do you think? I think I like (2). Reminds me of how code-formatting works in Visual Studio (happens when you close a brace or type a semi-colon).
I like to either filter the field just after it loses focus (when it is critical that the field be formatted correctly before they move on to the next field - which is rarely), or I filter the field content as soon as the user hits the "SUBMIT" button (or whatever you want to call it) to send the data to the server.
This has a few advantages for me:
The user's input is not interrupted with annoying "auto-corrections" - being auto-corrected can sometimes feel like demonic possession if it is not done well.
The user really neither cares, nor needs to know that you do not want the (,), or -, in your phone number field... so take it out quietly for them. No notes, or instructions needed.
Also, I ALWAYS filter the field values anyway to protect against any kind of code-injection attacks (which are alarmingly easy to pull off if you know what you are doing). I have read about entire databases being compromised because the author did not remove potential SQL markup from submitted data.... it makes me shudder.
It also allows me to check for ALL input errors (if any), or non-filled-out required fields and report a single set of issues to the user at a single time... I have been to sites that give you so many messages while filling out a form it feels a bit like having a nagging relative over your shoulder.
I'd go with either (1) or (2), depending on the kind of input. (1) is probably most user-friendly if done right, but it will be more complex to implement neatly (e.g., what happens if I delete a digit from a hyphenated phone number - or a hyphen?). Go with (1) if you can afford it, otherwise (2).
I follow the same method I use for validation. Once on the client side, once on the server side. Whether it happens on loose focus or as they type it doesn't really matter.
As the user is typing. For instance, when a user is typing a postal code, all letters are instantly capitalized, or after the user types 3 digits of a phone number, it puts brackets around them.
This type of input is excellent for things such as entering serial codes or CD keys for software or games. I notice a lot of people get confused whether or not the code is case sensitive or if they should be inputting the dashes as well.
If you have an iPhone you'll notice when entering a phone number it is also auto formatted with brackets and spaces as you enter it. But this often turns out to be confusing as a partially typed number is not always 'grouped' correctly.
Answer: It all depends on context.

Is there a diff algorithm that preserves line ownership

My goal is coming up with a script to track the point a line was added, even if the line is subsequently modified or moved around (both of which confuse traditional vcs 'blame' scripts. I've done some minor background research (see bottom) but didn't find anything useful. I have a concept for how to proceed but the runtime would be atrocious (there's a factorial involved).
The two missing features are tracking edited-in-place lines separate from a deletion-and-addition of that line, and tracking entire functions moved around so they're in different hunks. For those experienced with diff but unfamiliar with the terminology, a subsequence is a contiguous group of + or - lines, with a type of either delete (all -), add (all +), or replace (a combination). I need more information, on moves and edit-in-place lines, vaguely alluded to in an entry on c2: DiffAlgorithm (paragraph starts with "My favorite mode"). Does anyone know what that is? (seems to be based on Tichy, see bottom.)
Here's more info on the two missing features:
no concept of a change on a line, (a fourth type, something like edit-in-place). In this hunk, the parent of 'bc' is 'b' but 'd' is new and isn't a descendant of 'b':
a
-b
+bc
+d
The workaround for this isn't too complicated, if the position of edits is the same (just an expanded version of markup_instraline_changes but comparing edit distance on all equal-sized subsets of old and new lines.
no concept of "moving" code that preserves the ownership of the lines, e.g. this diff shouldn't alter the ownership of "line", although its position changes.
a
-line
c
+line
This could be dealt with in the same way but with much worse runtime (instead of only checking single blocks marked 'replace', you'd need to check Levenshtein distance between all added against all removed lines) and with likely false positives (some, like whitespace-only lines, aren't relevant to my problem).
Research I've done: reading about gestalt pattern matching (Ratcliff and Obershelp, used in Python's difflib) and An O(ND) Difference Algorithm and its Variations (EW Myers).
After posting the question, I found references to Tichy84 which appears to be The string-to-string correction problem with block moves (which I haven't read yet) according to Walter Tichy's paper a year later on RCS
You appear to be interested in origin tracking, the problem of tracing where a line came from.
Ideally, you'd instrument the editor to remember how things were edited, and store the edits with the text in your repository, thus solving the problem trivially, but none of us software engineers seem to be smart enough to implement this simple idea.
As a weak substitute, one can look at a sequence of source code revisions from the repository and reconstruct a "plausible" history of changes. This is what you seem to be doing by proposing the use of "diff". As you've noted, diff doesn't understand the idea of "moving" or "copying".
SD Smart Differencer tools compare source text by parsing the text according to the langauge it is in, discovering the code structures, and computing least-Levensthein differences in terms of programming language constructs (identifiers, expressions, statements, blocks, classes, ...) and abstract editing operators "insert", "delete", "copy", "move" and "rename identifier within a scope". They produce diff-like output, a little richer because they tell you line/column -> line/column with different editing operations.
Obviously the "move" and "copy" edits are the ones most interesting to you in terms of tracking specific lines (well, specific language constructs). Our experience is that code goes through lots of copy and edits, too, which I suspect won't surprise you.
These tools are in Beta, and are presently available for COBOL, Java and C#. Lots of other langauges are in the pipe, because the SmartDifferencer is built on top of a langauge-parameterized infrastructure, DMS Software Reengineering Toolkit, which has quite a number of already existing, robust langauge grammars.
I think the idea of what amount of editing a line that can be done while it remains a descendent of some previously written line is very subjective, and based on context, both things that a computer cannot work with. You'd have to specify some sort of configurable minimum similarity on lines in your program I think... The other problem is that it is entirely possible for two identical lines to be written completely independently (for example incrementing the value of some variable), and this will be be quite a common thing, so your desired algorithm won't really give truthful or useful information about a line quite often.
I would like to suggest an algorithm for this though (which makes tons of hopefully obvious assumptions by the way) so here goes:
Convert both texts to lists of lines
Copy the lists and Strip all whitespace from inside of each line
Delete blank lines from both lists
Repeat
Do a Levenshtein distance from the old to new lists ...
... keeping all intermediate data
Find all lines in the new text that were matched with old lines
Mark the line in both new/old original lists as having been matched
Delete the line from the new text (the copy)
Optional: If some matched lines are in a contiguous sequence ...
... in either original text assign them to a grouping as well!
Until there is nothing left but unmatchable lines in the new text
Group together sequences of unmatched lines in both old and new texts ...
... which are contiguous in the original text
Attribute each with the line match before and after
Run through all groups in old text
If any match before and after attributes with new text groups for each
//If they are inside the same area basically
Concatenate all the lines in both groups (separately and in order)
Include a character to represent where the line breaks are
Repeat
Do a Levenshtein distance on these concatenations
If there are any significantly similar subsequences found
//I can't really define this but basically a high proportion
//of matches throughout all lines involved on both sides
For each matched subsequence
Find suitable newline spots to delimit the subsequence
Mark these lines matched in the original text
//Warning splitting+merging of lines possible
//No 1-to-1 correspondence of lines here!
Delete the subsequence from the new text group concat
Delete also from the new text working list of lines
Until there are no significantly similar subsequences found
Optional: Regroup based on remaining unmatched lines and repeat last step
//Not sure if there's any point in trying that at the moment
Concatenate the ENTIRE list of whitespaced-removed lines in the old text
Concatenate the lines in new text also (should only be unmatched ones left)
//Newline character added in both cases
Repeat
Do Levenshtein distance on these concatenations
Match similar subsequences in the same way as earlier on
//Don't need to worry deleting from list of new lines any more though
//Similarity criteria should be a fair bit stricter here to avoid
// spurious matchings. Already matched lines in old text might have
// even higher strictness, since all of copy/edit/move would be rare
While you still have matchings
//Anything left unmatched in the old text is deleted stuff
//Anything left unmatched in the new text is newly written by the author
Print out some output to show all the comparing results!
Well, hopefully you can see the basics of what I mean with that completely untested algorithm. Find obvious matches first, and verbatim moves of chunks of decreasing size, then compare stuff that's likely to be similar, then look for anything else which is similar, but both modified and moved: probably just coincidentally similar.
Well, if you try implementing this, tell me how it works out, and what details you changed, and what kind of assignments you made to the various variables involved... I expect there will be some test cases where it works brilliantly and others where it just abyssmally fails due to some massive oversight. The idea is that most stuff will be matched before you get to the inefficient final loop, and indeed the previous one