Starspace: What is the interpretation of the labelDoc fileFormat? - facebook

The starspace documentation is unclear on the parameter 'fileFormat' which takes the value 'labelDoc' or 'fastText'.
I would like to understand intuitively what material difference setting this paramter would have.
Currently, my best guess is that if you set fileFormat to 'fastText' then all tokens in the training file that do not have the prefix '__label__' will be broken down into character-level n-grams as in fastText.
Alternatively, if you set fileFormat to 'labelDoc' then starspace will assume that all tokens are actually labels, and you do not need to prepend '__label__' to the tokens, because they will be recognized as labels anyway.
Is my thinking correct?

The way StarSpace uses the labels highly depends on the trainMode you are using. The labelDoc format is useful when you go for a trainMode that just relies on labels (trainMode 1 through 4) where it may be the same thing to use a fastText format specifying the __label__ prefix but some trainModes benefit from labelDoc format (i.e. trainMode 1 or 3) to use a whole sentence as a label element for that trainMode.
So to clarify that, if you are performing a text classification task(as explained in this example labelDoc wouldn't have any input recognized but on the other hand, as you stated, using fastText format will breakdown all non-labeled text as input and learn to predict the __label__ tags.
And an example for labelDoc format would be developing a content based recommender system (as explained in this example) every tab separated sentence is used at LHS or RHS during training time. But if you go on a collaborative approach (the content of the articles or wherever you sentences come from is not taken in account) it can be trained either with fastText (specifying the __label__ prefix) or labelDoc file format as labels are picked randomly during training time for LHS or RHS. (This second example is explained here).

Related

Uniform list pretty-printer

It is known that default printer can be confusing wrt lists because of no output for empty lists and 3 different notations being mixed (, vs (x;y;z) vs 1 2 3) and not obvious indentation/columnization (which is apparently optimized for table data). I am currently using -3! but it is still not ideal.
Is there a ready-made pretty-printer that has consistent uniform output format (basically what I am used to in any other language where list is not special)?
I've started using .j.j for string outputs in error messages more recently in preference to -3!. Mainly I think it is easier to parse in a text log, but also doesn't truncate in the same way.
It still transforms atoms and lists differently so it might not exactly meet your needs, if you really want that you could compose it with the old "ensure this is a list" trick:
myPrinter:('[.j.j;(),])
You might need to supply some examples to better explain your issues and your use-case for pretty-printing.
In general -3! is the most clear visual representation of the data. It is the stringified equivalent to another popular display method which is 0N!.
The parse function is useful for understanding how the interpreter reads/executes commands but I don't think that will be useful in your case

ImageMagick Command-Line Option Order (and Categories of Command-Line Parameters)

My supervisor has asked me to convert the parts of our Perl scripts that use PerlMagick to instead pipe and use the command line version of ImageMagick (for various unrelated reasons).
Using the our existing interface (crop, scale, save, etc) I'm building up a list of operations the user wants to perform on an image, constructing the statement to pipe and then executing it.
What I would like to know is:
Are convert operations performed from left to right? ie the order I pass them
What happens if I pass the same option twice? Are they performed separately?
Obviously the order in which operations are performed on an image is vital, so I'm trying to work out if I can perform all of the operations in one go (possibly gaining efficiency?) or if it I'm going to have to perform each operation separately.
Thanks
Unfortunately, the accepted answer to this question is not yet complete... :-)
Three (major) classes of parameters
Assuming, your ImageMagick version is a recent one, here is an important amendment to it:
you should differentiate between 3 major classes of command line parameters:
Image Settings
Image Operators
Image Sequence Operators
These three classes do behave differently:
Image Settings
An image setting persists as it appears on the command line.
It may affect all subsequent processing (but not previous processing):
processing such as reading an image or more images later in the command line;
processing done by a following image operator;
processing conducted by writing an image as output.
An image setting stays in effect...
...either until it is reset or replaced by a different setting of the same type,
...or until the command line terminates.
Image Operators
An image operator is applied to a (single) image and forgotten.
It differs from an image setting because it affects the image immediately as it appears on the command line.
(Remember: an image setting which persists until the command line terminates, or until it is reset.)
If you need to apply the same image operator to a second image in the same commandline, you have to repeat that exact operator on the commandline.
Strictly speaking, in compliance with the new architecture of ImageMagick command lines, all image operators should be written after the loading of the image it is meant for.
However, the IM developers compromised:
in the interest of backward compatibility, image operators can still appear before loading an image -- they will then be applied to the first image that is available to them.
Image Sequence Operators
An image sequence operator is applied to all currently loaded images (and then forgotten).
It differs from a simple image operator in that it does not only affect a single image.
(Some operators only make sense if their operation has multiple images for consumption: think of -append, -combine, -composite, -morph...)
From above principles you can already conclude: the order of the command line parameters is significant in most cases. (If you know what they do, you also know which order you need to use applying them.)
(For completeness' sake I should add: there is another class of miscellanious, or other parameters, which do not fall into any of the above listed categories. Think -debug, -verbose or -version.)
Unfortunately, the clear distinction between the 3 classes of IM command line paramaters is not yet common knowledge among (otherwise quite savvy) IM users. So it merits to get much more exposure.
This clear differentiation was introduced with ImageMagick major version 6. Before, it was more confusing: some settings' semantics changed with context and also with the order they were given. Results from complex commands were not always predictable and sometimes surprising and illogical. (Now they may be surprising too sometimes, but when you closely look at them, understanding the above, they are always quite logical.)
Which is which ?!?
Whenever you are not sure, which class one particular parameter belongs to, run
convert -help | less
Search for your parameter. Once found, scroll back: you should then find the "heading" under which it appears. Now you can be sure which type it is: an Image Setting, an Image Operator, or an Image Sequence Operator -- and take into account what I've said about them above.
Some more advice
If your job is to port your ImageMagick interface from PerlMagick to CLI, you should be aware of one other trick: You can insert
+write output-destination
anywhere on the command line (even multiple times). This will then write out the currently loaded image (or the currently loaded image sequence) in its currently processed state to the given output destination. (Think of it as something similar as the tee-command for shell/terminal usage, which re-directs a copy of <stdout> into a file.) Output destination can be a file, or show: or whatever else is valid for IM outputs. After writing to the output, processing of the complete command will continue.
Of course, it only makes sense to insert +write after the first (or any other) image operator -- otherwise the current image list will not have changed.
Should there by multiple output images (because the current image list consists of more than one image), then ImageMagick will automatically assign index numbers to the respective filename.
This is a great help with debugging (or optimizing, streamlining, simplifying...) complex command setups.
Are convert operations performed from left to right? ie the order I
pass them
Yes. If I take the following two examples, which are identical except for the operations order, I can expect different results based on the left to right.
convert rose: -sample 300% -wave 5x10 rose_post_wave.png
convert rose: -wave 5x10 -sample 300% rose_pre_wave.png
You can see the effects of the wave operation impact the image after, or before the sampling of the image.
What happens if I pass the same option twice? Are they performed
separately?
The will be executed twice. No special locking, or automatic operation counting exists.
convert rose: -blur 0.5x0.5 -scale 300% rose_blur1.png
convert rose: -blur 0.5x0.5 -blur 0.5x0.5 -scale 300% rose_blur2.png

At which lines in my MATLAB code a variable is accessed?

I am defining a variable in the beginning of my source code in MATLAB. Now I would like to know at which lines this variable effects something. In other words, I would like to see all lines in which that variable is read out. This wish does not only include all accesses in the current function, but also possible accesses in sub-functions that use this variable as an input argument. In this way, I can see in a quick way where my change of this variable takes any influence.
Is there any possibility to do so in MATLAB? A graphical marking of the corresponding lines would be nice but a command line output might be even more practical.
You may always use "Find Files" to search for a certain keyword or expression. In my R2012a/Windows version is in Edit > Find Files..., with the keyboard shortcut [CTRL] + [SHIFT] + [F].
The result will be a list of lines where the searched string is found, in all the files found in the specified folder. Please check out the options in the search dialog for more details and flexibility.
Later edit: thanks to #zinjaai, I noticed that #tc88 required that this tool should track the effect of the name of the variable inside the functions/subfunctions. I think this is:
very difficult to achieve. The problem of running trough all the possible values and branching on every possible conditional expression is... well is hard. I think is halting-problem-hard.
in 90% of the case the assumption that the output of a function is influenced by the input is true. But the input and the output are part of the same statement (assigning the result of a function) so looking for where the variable is used as argument should suffice to identify what output variables are affected..
There are perverse cases where functions will alter arguments that are handle-type (because the argument is not copied, but referenced). This side-effect will break the assumption 2, and is one of the main reasons why 1. Outlining the cases when these side effects take place is again, hard, and is better to assume that all of them are modified.
Some other cases are inherently undecidable, because they don't depend on the computer states, but on the state of the "outside world". Example: suppose one calls uigetfile. The function returns a char type when the user selects a file, and a double type for the case when the user chooses not to select a file. Obviously the two cases will be treated differently. How could you know which variables are created/modified before the user deciding?
In conclusion: I think that human intuition, plus the MATLAB Debugger (for run time), and the Find Files (for quick search where a variable is used) and depfun (for quick identification of function dependence) is way cheaper. But I would like to be wrong. :-)

Easy way to print full solution (all decision variables) in minizinc

The zinc spec says this:
If no output item is present, the implementation should print all the global variables and their values in a readable format.
However this does not appear to work with minizinc version 1.6.0:
G12 MiniZinc evaluation driver, version 1.6.0
I've tried the default command (minizinc) and mzn-gecode.
I'd really like to avoid repeating all the variable names in the output expression. What I really want is to have all decision variables output in some structured format (e.g. YAML), but I'd settle for some way to avoid this repetition.
To clarify: my model doesn't match the typical examples of CSP, e.g. there's no big array or matrix. It's just a fairly big (in relative terms) set of individual decision variables.
EDIT: bug created.
EDIT2: bug is now fixed in the minizinc 2.0 git repository so it conforms to the spec.
What I know, all FlatZinc solver just show a "----------" for every solution when there is no defined output item in the model. So it seems that the spec is wrong/obsolete on this part.
There have been some (more or less radical) changes regarding the output item during the years. In some early MiniZinc version it worked the way the spec described, and it was quite handy when modelling a problem (though it was very hard to get nice output). It was a real nuisance when the behaviour was changed so an output item was required for showing the result.
Interestingly, Zinc (the "big brother" of MiniZinc, http://www.minizinc.org/g12_www/zinc/ ) works as described i.e. shows all global variables when there is no output item. Perhaps the spec writers just forget to mention that MiniZinc differs.

Is there a diff algorithm that preserves line ownership

My goal is coming up with a script to track the point a line was added, even if the line is subsequently modified or moved around (both of which confuse traditional vcs 'blame' scripts. I've done some minor background research (see bottom) but didn't find anything useful. I have a concept for how to proceed but the runtime would be atrocious (there's a factorial involved).
The two missing features are tracking edited-in-place lines separate from a deletion-and-addition of that line, and tracking entire functions moved around so they're in different hunks. For those experienced with diff but unfamiliar with the terminology, a subsequence is a contiguous group of + or - lines, with a type of either delete (all -), add (all +), or replace (a combination). I need more information, on moves and edit-in-place lines, vaguely alluded to in an entry on c2: DiffAlgorithm (paragraph starts with "My favorite mode"). Does anyone know what that is? (seems to be based on Tichy, see bottom.)
Here's more info on the two missing features:
no concept of a change on a line, (a fourth type, something like edit-in-place). In this hunk, the parent of 'bc' is 'b' but 'd' is new and isn't a descendant of 'b':
a
-b
+bc
+d
The workaround for this isn't too complicated, if the position of edits is the same (just an expanded version of markup_instraline_changes but comparing edit distance on all equal-sized subsets of old and new lines.
no concept of "moving" code that preserves the ownership of the lines, e.g. this diff shouldn't alter the ownership of "line", although its position changes.
a
-line
c
+line
This could be dealt with in the same way but with much worse runtime (instead of only checking single blocks marked 'replace', you'd need to check Levenshtein distance between all added against all removed lines) and with likely false positives (some, like whitespace-only lines, aren't relevant to my problem).
Research I've done: reading about gestalt pattern matching (Ratcliff and Obershelp, used in Python's difflib) and An O(ND) Difference Algorithm and its Variations (EW Myers).
After posting the question, I found references to Tichy84 which appears to be The string-to-string correction problem with block moves (which I haven't read yet) according to Walter Tichy's paper a year later on RCS
You appear to be interested in origin tracking, the problem of tracing where a line came from.
Ideally, you'd instrument the editor to remember how things were edited, and store the edits with the text in your repository, thus solving the problem trivially, but none of us software engineers seem to be smart enough to implement this simple idea.
As a weak substitute, one can look at a sequence of source code revisions from the repository and reconstruct a "plausible" history of changes. This is what you seem to be doing by proposing the use of "diff". As you've noted, diff doesn't understand the idea of "moving" or "copying".
SD Smart Differencer tools compare source text by parsing the text according to the langauge it is in, discovering the code structures, and computing least-Levensthein differences in terms of programming language constructs (identifiers, expressions, statements, blocks, classes, ...) and abstract editing operators "insert", "delete", "copy", "move" and "rename identifier within a scope". They produce diff-like output, a little richer because they tell you line/column -> line/column with different editing operations.
Obviously the "move" and "copy" edits are the ones most interesting to you in terms of tracking specific lines (well, specific language constructs). Our experience is that code goes through lots of copy and edits, too, which I suspect won't surprise you.
These tools are in Beta, and are presently available for COBOL, Java and C#. Lots of other langauges are in the pipe, because the SmartDifferencer is built on top of a langauge-parameterized infrastructure, DMS Software Reengineering Toolkit, which has quite a number of already existing, robust langauge grammars.
I think the idea of what amount of editing a line that can be done while it remains a descendent of some previously written line is very subjective, and based on context, both things that a computer cannot work with. You'd have to specify some sort of configurable minimum similarity on lines in your program I think... The other problem is that it is entirely possible for two identical lines to be written completely independently (for example incrementing the value of some variable), and this will be be quite a common thing, so your desired algorithm won't really give truthful or useful information about a line quite often.
I would like to suggest an algorithm for this though (which makes tons of hopefully obvious assumptions by the way) so here goes:
Convert both texts to lists of lines
Copy the lists and Strip all whitespace from inside of each line
Delete blank lines from both lists
Repeat
Do a Levenshtein distance from the old to new lists ...
... keeping all intermediate data
Find all lines in the new text that were matched with old lines
Mark the line in both new/old original lists as having been matched
Delete the line from the new text (the copy)
Optional: If some matched lines are in a contiguous sequence ...
... in either original text assign them to a grouping as well!
Until there is nothing left but unmatchable lines in the new text
Group together sequences of unmatched lines in both old and new texts ...
... which are contiguous in the original text
Attribute each with the line match before and after
Run through all groups in old text
If any match before and after attributes with new text groups for each
//If they are inside the same area basically
Concatenate all the lines in both groups (separately and in order)
Include a character to represent where the line breaks are
Repeat
Do a Levenshtein distance on these concatenations
If there are any significantly similar subsequences found
//I can't really define this but basically a high proportion
//of matches throughout all lines involved on both sides
For each matched subsequence
Find suitable newline spots to delimit the subsequence
Mark these lines matched in the original text
//Warning splitting+merging of lines possible
//No 1-to-1 correspondence of lines here!
Delete the subsequence from the new text group concat
Delete also from the new text working list of lines
Until there are no significantly similar subsequences found
Optional: Regroup based on remaining unmatched lines and repeat last step
//Not sure if there's any point in trying that at the moment
Concatenate the ENTIRE list of whitespaced-removed lines in the old text
Concatenate the lines in new text also (should only be unmatched ones left)
//Newline character added in both cases
Repeat
Do Levenshtein distance on these concatenations
Match similar subsequences in the same way as earlier on
//Don't need to worry deleting from list of new lines any more though
//Similarity criteria should be a fair bit stricter here to avoid
// spurious matchings. Already matched lines in old text might have
// even higher strictness, since all of copy/edit/move would be rare
While you still have matchings
//Anything left unmatched in the old text is deleted stuff
//Anything left unmatched in the new text is newly written by the author
Print out some output to show all the comparing results!
Well, hopefully you can see the basics of what I mean with that completely untested algorithm. Find obvious matches first, and verbatim moves of chunks of decreasing size, then compare stuff that's likely to be similar, then look for anything else which is similar, but both modified and moved: probably just coincidentally similar.
Well, if you try implementing this, tell me how it works out, and what details you changed, and what kind of assignments you made to the various variables involved... I expect there will be some test cases where it works brilliantly and others where it just abyssmally fails due to some massive oversight. The idea is that most stuff will be matched before you get to the inefficient final loop, and indeed the previous one