Tags in vowpal wabbit - tags

I am doing binary classification using vowpal-wabbit. A particular record(set of features) has 10 zeroes and 5 ones. So, I am creating two lines in vowpal-format
-1 10 `50 |f f1
1 5 `50 |f f1
Since the prediction(probability) for both these records would be same, I want to keep the same tag, so that I can dedupe the predictions({tag,prediction}) later and join with my original raw-data.
Is it possible to keep the same tag for more than one record in vowpal-wabbit?

First, the syntax above isn't correct
To be identified as such, tags should either:
Touch the | separator (no space between them) OR
The leading quote, needs to be a simple quote, not a backquote, by convention.
(or both).
Otherwise you get:
warning: `50 is not a good float, replacing with 0
warning: `50 is not a good float, replacing with 0
Which hints that vw interprets these "tags" as prediction-base.
For details, see Input format in the official documentation
Once the example is fixed to the correct syntax:
-1 10 '50|f f1
1 5 '50|f f1
Which runs fine, we can answer the question:
Is it possible to keep the same tag for more than one record in vowpal-wabbit?
Yes, you can. The tag is merely a simple way to connect input and output (when predictions are involved), there's no check for uniqueness anywhere. If you duplicate tags on input, you'll simply get the same duplicate tags on prediction output as well.
More notes:
Even if two examples are identical, you may get different predictions, if the model has changed somewhat in between them. Remember vw is an online learner, so the model can continuously change with each example unless you add the -t (test-only, don't learn) option.
Features whose value is zero are ignored, so you can drop them. The standard way in vw to say this is 'positive' and this is 'negative' is to use the values {+1, -1}. This is true for both labels and input features.

Related

Starspace: What is the interpretation of the labelDoc fileFormat?

The starspace documentation is unclear on the parameter 'fileFormat' which takes the value 'labelDoc' or 'fastText'.
I would like to understand intuitively what material difference setting this paramter would have.
Currently, my best guess is that if you set fileFormat to 'fastText' then all tokens in the training file that do not have the prefix '__label__' will be broken down into character-level n-grams as in fastText.
Alternatively, if you set fileFormat to 'labelDoc' then starspace will assume that all tokens are actually labels, and you do not need to prepend '__label__' to the tokens, because they will be recognized as labels anyway.
Is my thinking correct?
The way StarSpace uses the labels highly depends on the trainMode you are using. The labelDoc format is useful when you go for a trainMode that just relies on labels (trainMode 1 through 4) where it may be the same thing to use a fastText format specifying the __label__ prefix but some trainModes benefit from labelDoc format (i.e. trainMode 1 or 3) to use a whole sentence as a label element for that trainMode.
So to clarify that, if you are performing a text classification task(as explained in this example labelDoc wouldn't have any input recognized but on the other hand, as you stated, using fastText format will breakdown all non-labeled text as input and learn to predict the __label__ tags.
And an example for labelDoc format would be developing a content based recommender system (as explained in this example) every tab separated sentence is used at LHS or RHS during training time. But if you go on a collaborative approach (the content of the articles or wherever you sentences come from is not taken in account) it can be trained either with fastText (specifying the __label__ prefix) or labelDoc file format as labels are picked randomly during training time for LHS or RHS. (This second example is explained here).

How to use OR condition in LibreOffice?

I am trying to use the formula below to set conditions in LibreOffice but I keep getting an error. What am I doing wrong with the statement below:
=IF(G2<=2,'negative',IF(OR(G2>2 & G2<=3,'neutral',IF(OR(G2>=4,'positive))))))
Thanks
It seems, that in your formula is missing the last ':
'positive))))))
should be 'positive'))))))
Also the
&
is string-concatenation in LibreOffice, so you need here the equivalent to OR() and that is AND().
But you can simplify your formula to
=IF(G2<=2,'negative',IF(AND(G2>2,G2<=3),'neutral','positive'))
The first test is if the number is lower than 2 (negative),
the second test is if the number is between 2 and 3 (neutral)
and then there is no further test needed as it is the only remainig possiblity.
For a different locale, a slightly shorter, and I'd say simpler, version that also avoids the need for OR/AND:
=IF(G2<=2,"negative",IF(G2<=3,"neutral","positive"))
Once <=2 first test is handled (either but outputting negative or by proceeding to the 'result if FALSE') there is no longer the possibility of 2 or less, so the AND is not necessary.
The above though does fill a gap left by OP between 3 and 4.

Variable output hash function

I know that there are hash functions that from a variable length input can give a fixed output. To take the simplest one, using the module of ten no matter how big is the input number I will always get an output between 0 and 9.
I need to do have from an unknown password, a variable length output. My first thought was to use the module, increasing the prim number as much as many digits I need to have as output.
My problems are:
I must handle short passwords as well as I would with long passwords;
I don't know how long should the output be before writing the program, and even though I would know after the user has set the password I may need to change it if he modifies the file.
My first thought was using a simple function and modify it based on my needs.
If I have to hash 123 but I need to have 5 characters as output, that's what I would do:
I add 2 zeros on the right, changing the input to 12300;
I take the lowest 5 digits prime number (10007);
And I then I have my hash doing 12300 % 10007 = 02293.
But since I would probably need output in the order of hundreds if not thousands I'm pretty sure module is not the solution to my problem.
I could also try to create my own hash function, but I have no idea how to verify if it works or if it's trash.
Are there some common solutions in literature for this kind of problem?

Write records from one PF to another without READ operation or DOW loop or move operation.

I know how to copy records from one pf to another by reading one file in dow loop and writing into another file like below. Files are PF1 and PF2 having record format rec1 and rec2 respectively where each file have only one field named fld1 and #fld1 respectively-
READ PF1
DOW not %eof(PF1) and not %error
eval fld1 = #fld1
write Rec2
READ PF1
ENDDO
As the comments in Buck's answer mention, your team mate is alluding to using the RPG cycle to process the file. The cycle is basically an implicit read loop of files declared as 'P'rimary.
http://www-01.ibm.com/support/knowledgecenter/ssw_ibm_i_71/rzasc/sc09250726.htm%23wq121
Originally, even RPG IV programs included code to used as part of the cycle, such as automatically opening files, even if you didn't actually declare any input primary files. Now however, you can create "Linear Main" programs using the MAIN() h-spec and your program will be cycle free.
Using the cycle is frowned upon in modern RPG. Primarily because the implicit nature of what's going on makes it tricky to understand non-trivial code. Additionally, cycle code doesn't perform any better than non-cycle code; it's just less to write. The I/Os being done remain exactly the same.
Finally, again as mentioned in the comments. If you want to optimize performace, use SQL. The set based nature of SQL beats RPG's one row at a time. I haven't benchmarked it recently, but way back on v5r2 or so, copying 100 or more rows was faster with SQL than RPG.
For reference only, FWiW; i.e. not recommendations, just examples of what can be done, esp. in cases alluded but for which no specifics were given:
My team mate told me that he can write code for this problem only in 4 lines including declaration of both files in F-spec. He will also not use read, move or dow loop. I don't know how can he do this. That's why I am eager to know this.
The following source is an example Cycle-program; my FLD1 of REC1 had a 10-byte field but I described my output for 20-bytes, so to avoid failed compile per sev-20 RNF7501 "Length of data structure in Result-Field does not equal the record length of Factor 2.", I specified GENLVL(20) on the CRTBNDRPG:
FPF1 IP E DISK rename(rec1:rcd1)
FPF2 O F 20 DISK
DINOUT E DS EXTNAME(PF1)
C WRITE PF2 INOUT
I don't want to use CL program. I just want to do it with a single program either in RPG3 or RPG4
A similar RPG Cycle-program could perform effectively the same thing, similarly copying the data from PF1 to PF2 despite different column name and [thus inherently also] the different record format, using the CL command without a CL program and almost as few lines. The following example depends on the must-always-be-one-row table called QSQPTABL in QSYS2 that would typically be in the system Library List, and the second argument could reflect the actual length of the command string, but just as easily codes the max prototyped length per the Const definition assuring the blank-padding up to that length without actually having to count the [~53] bytes of the concatenated string expression:
FQSQPTABL IP E DISK rename(qsqptabl:qsqptable)
DQcmdExc PR ExtPgm('QSYS/QCMDEXC')
D 200A const
D 15P05 const
c callp QcmdExc('cpyf pf1 pf2 mbropt(*add)'
c +' fmtopt(*nochk) crtfile(*no)':200)
Whereas both of the above sources are probably an enigma to anyone unfamiliar with the Cycle, the overall effects of the latter are quite likely to be inferred correctly [¿perhaps more appropriately described as guessed correctly?], by just about anyone with an understanding of the CL command string, despite their lack of understanding of the Cycle.
And of course, as was also noted, with the SQL the program is probably arguably even easier\simpler; possibly even more readable to the uninitiated [although the WITH NONE clause, shown as WITH NC, added just in case the COMMIT(*NONE) was overlooked on the compile request, probably is not easily intuited]:
C/Exec SQL
C+ insert into pf2 select * from pf1 WITH NC
C/End-Exec
C SETON LR
P.S. The source-code from the OP was originally [at least was, prior to my comment added here] incorrectly coded with eval fld1 = #fld1 when surely what was intended was eval #fld1 = fld1 according to the setup\given.
If you need to use RPG, use embedded SQL. Look up INSERT INTO.
If you aren't limited to RPG, consider CPYF... MBROPT(*ADD).
What business problem are you trying to solve by doing it another way?

Is there a diff algorithm that preserves line ownership

My goal is coming up with a script to track the point a line was added, even if the line is subsequently modified or moved around (both of which confuse traditional vcs 'blame' scripts. I've done some minor background research (see bottom) but didn't find anything useful. I have a concept for how to proceed but the runtime would be atrocious (there's a factorial involved).
The two missing features are tracking edited-in-place lines separate from a deletion-and-addition of that line, and tracking entire functions moved around so they're in different hunks. For those experienced with diff but unfamiliar with the terminology, a subsequence is a contiguous group of + or - lines, with a type of either delete (all -), add (all +), or replace (a combination). I need more information, on moves and edit-in-place lines, vaguely alluded to in an entry on c2: DiffAlgorithm (paragraph starts with "My favorite mode"). Does anyone know what that is? (seems to be based on Tichy, see bottom.)
Here's more info on the two missing features:
no concept of a change on a line, (a fourth type, something like edit-in-place). In this hunk, the parent of 'bc' is 'b' but 'd' is new and isn't a descendant of 'b':
a
-b
+bc
+d
The workaround for this isn't too complicated, if the position of edits is the same (just an expanded version of markup_instraline_changes but comparing edit distance on all equal-sized subsets of old and new lines.
no concept of "moving" code that preserves the ownership of the lines, e.g. this diff shouldn't alter the ownership of "line", although its position changes.
a
-line
c
+line
This could be dealt with in the same way but with much worse runtime (instead of only checking single blocks marked 'replace', you'd need to check Levenshtein distance between all added against all removed lines) and with likely false positives (some, like whitespace-only lines, aren't relevant to my problem).
Research I've done: reading about gestalt pattern matching (Ratcliff and Obershelp, used in Python's difflib) and An O(ND) Difference Algorithm and its Variations (EW Myers).
After posting the question, I found references to Tichy84 which appears to be The string-to-string correction problem with block moves (which I haven't read yet) according to Walter Tichy's paper a year later on RCS
You appear to be interested in origin tracking, the problem of tracing where a line came from.
Ideally, you'd instrument the editor to remember how things were edited, and store the edits with the text in your repository, thus solving the problem trivially, but none of us software engineers seem to be smart enough to implement this simple idea.
As a weak substitute, one can look at a sequence of source code revisions from the repository and reconstruct a "plausible" history of changes. This is what you seem to be doing by proposing the use of "diff". As you've noted, diff doesn't understand the idea of "moving" or "copying".
SD Smart Differencer tools compare source text by parsing the text according to the langauge it is in, discovering the code structures, and computing least-Levensthein differences in terms of programming language constructs (identifiers, expressions, statements, blocks, classes, ...) and abstract editing operators "insert", "delete", "copy", "move" and "rename identifier within a scope". They produce diff-like output, a little richer because they tell you line/column -> line/column with different editing operations.
Obviously the "move" and "copy" edits are the ones most interesting to you in terms of tracking specific lines (well, specific language constructs). Our experience is that code goes through lots of copy and edits, too, which I suspect won't surprise you.
These tools are in Beta, and are presently available for COBOL, Java and C#. Lots of other langauges are in the pipe, because the SmartDifferencer is built on top of a langauge-parameterized infrastructure, DMS Software Reengineering Toolkit, which has quite a number of already existing, robust langauge grammars.
I think the idea of what amount of editing a line that can be done while it remains a descendent of some previously written line is very subjective, and based on context, both things that a computer cannot work with. You'd have to specify some sort of configurable minimum similarity on lines in your program I think... The other problem is that it is entirely possible for two identical lines to be written completely independently (for example incrementing the value of some variable), and this will be be quite a common thing, so your desired algorithm won't really give truthful or useful information about a line quite often.
I would like to suggest an algorithm for this though (which makes tons of hopefully obvious assumptions by the way) so here goes:
Convert both texts to lists of lines
Copy the lists and Strip all whitespace from inside of each line
Delete blank lines from both lists
Repeat
Do a Levenshtein distance from the old to new lists ...
... keeping all intermediate data
Find all lines in the new text that were matched with old lines
Mark the line in both new/old original lists as having been matched
Delete the line from the new text (the copy)
Optional: If some matched lines are in a contiguous sequence ...
... in either original text assign them to a grouping as well!
Until there is nothing left but unmatchable lines in the new text
Group together sequences of unmatched lines in both old and new texts ...
... which are contiguous in the original text
Attribute each with the line match before and after
Run through all groups in old text
If any match before and after attributes with new text groups for each
//If they are inside the same area basically
Concatenate all the lines in both groups (separately and in order)
Include a character to represent where the line breaks are
Repeat
Do a Levenshtein distance on these concatenations
If there are any significantly similar subsequences found
//I can't really define this but basically a high proportion
//of matches throughout all lines involved on both sides
For each matched subsequence
Find suitable newline spots to delimit the subsequence
Mark these lines matched in the original text
//Warning splitting+merging of lines possible
//No 1-to-1 correspondence of lines here!
Delete the subsequence from the new text group concat
Delete also from the new text working list of lines
Until there are no significantly similar subsequences found
Optional: Regroup based on remaining unmatched lines and repeat last step
//Not sure if there's any point in trying that at the moment
Concatenate the ENTIRE list of whitespaced-removed lines in the old text
Concatenate the lines in new text also (should only be unmatched ones left)
//Newline character added in both cases
Repeat
Do Levenshtein distance on these concatenations
Match similar subsequences in the same way as earlier on
//Don't need to worry deleting from list of new lines any more though
//Similarity criteria should be a fair bit stricter here to avoid
// spurious matchings. Already matched lines in old text might have
// even higher strictness, since all of copy/edit/move would be rare
While you still have matchings
//Anything left unmatched in the old text is deleted stuff
//Anything left unmatched in the new text is newly written by the author
Print out some output to show all the comparing results!
Well, hopefully you can see the basics of what I mean with that completely untested algorithm. Find obvious matches first, and verbatim moves of chunks of decreasing size, then compare stuff that's likely to be similar, then look for anything else which is similar, but both modified and moved: probably just coincidentally similar.
Well, if you try implementing this, tell me how it works out, and what details you changed, and what kind of assignments you made to the various variables involved... I expect there will be some test cases where it works brilliantly and others where it just abyssmally fails due to some massive oversight. The idea is that most stuff will be matched before you get to the inefficient final loop, and indeed the previous one