Vowpal Wabbit Interaction Redundancy - feature-selection

I am curious about the way VW appears to create interaction terms, through the -q parameter.
For the purpose of this illustration I am using this toy data, which is called cats.vm:
1 |a black |b small green |c numvar1:1.62 numvar2:342 |d cat |e numvar3:554
1 |a white |b large yellow |c numvar1:1.212 numvar2:562 |d cat |e numvar3:632
-1 |a black |b small green |c numvar1:12.03 numvar2:321 |d hamster |e numvar3:754
1 |a white |b large green |c numvar1:5.8 numvar2:782 |d dog |e numvar3:234
-1 |a black |b small yellow |c numvar1:2.322 numvar2:488 |d dog |e numvar3:265
1 |a black |b large yellow |c numvar1:3.99 numvar2:882 |d hamster |e numvar3:543
There seems to be some inconsistency in the way VW creates interaction terms. Here are a couple examples, where the command is always the following, with only -q being changed:
vw -d cats.vm --loss_function logistic --invert_hash readable.cat.mod -q X
1. -q aa
Here we have an interaction within a namespace with only one feature and only get the quadratic terms for black and white (black^2 and white^2) as expected.
Constant:116060:0.082801
a^black:53863:-0.039097
a^black^a^black:247346:-0.039097
a^white:55134:0.223999
a^white^a^white:227140:0.223999
b^green:114666:0.027346
b^large:192199:0.330261
b^small:80587:-0.096200
b^yellow:255950:0.075754
c^numvar1:132428:0.004266
c^numvar2:30074:0.000211
d^cat:11261:0.188487
d^dog:173570:0.006734
d^hamster:247835:-0.085219
e^numvar3:12042:0.000115
2. -q ab
With interaction between 2 namespaces (one of which has more than 1 feature), things are as expected except there are no quadratic terms of items in either a or b (e.g. black*black)
Question 1: Is there a way to force these 'across namespace interactions' to include polynomials terms such as black*black?
Constant:116060:0.079621
a^black:53863:-0.035646
a^black^b^green:46005:-0.017797
a^black^b^large:123538:0.137239
a^black^b^small:11926:-0.088733
a^black^b^yellow:187289:-0.053135
a^white:55134:0.206693
a^white^b^green:24528:0.127449
a^white^b^large:102061:0.206693
a^white^b^yellow:165812:0.114003
b^green:114666:0.025218
b^large:192199:0.302959
b^small:80587:-0.088733
b^yellow:255950:0.072339
c^numvar1:132428:0.004038
c^numvar2:30074:0.000199
d^cat:11261:0.176863
d^dog:173570:0.007334
d^hamster:247835:-0.080986
e^numvar3:12042:0.000109
3. -q bb
Here we have interaction within a namespace where there are two features. There are duplicates (e.g. b^large^b^green:81557:0.112864 and b^green^b^large:110857:0.112864).
Question 2: Are these duplicated terms in the model or is this some issue in the --invert_hash? The weights are the same for all duplicates. Should we multiply green*large weight by 2, for example, in order to get the full effect of green and large interaction?
Constant:116060:0.062784
a^black:53863:-0.043486
a^white:55134:0.182450
b^green:114666:0.023035
b^green^b^green:33324:0.023035
b^green^b^large:110857:0.112864
b^green^b^small:261389:-0.016840
b^large:192199:0.252576
b^large^b^green:81557:0.112864
b^large^b^large:159090:0.252576
b^large^b^yellow:222841:0.187498
b^small:80587:-0.079945
b^small^b^green:249481:-0.016840
b^small^b^small:215402:-0.079945
b^small^b^yellow:128621:-0.123284
b^yellow:255950:0.051017
b^yellow^b^large:68957:0.187498
b^yellow^b^small:219489:-0.123284
b^yellow^b^yellow:132708:0.051017
c^numvar1:132428:0.003217
c^numvar2:30074:0.000164
d^cat:11261:0.158140
d^dog:173570:0.008735
d^hamster:247835:-0.085383
e^numvar3:12042:0.000086

First, the basics: when you cross features, vowpal wabbit uses:
For the crossed feature name/identity: the murmur32 hash (modulo weight-vector size) of the concatenated original feature names (strings)
For the crossed feature value: the crossing (multiplication) of the original feature values (weights)
So looking at your question #3 above: the concatenated names are
b^green^b^large or b^large^b^green. They have the same value: 0.112864 since the multiplication of the two feature values is the same. However, because of the two possible ways to concatenate, we get two different hash values and a 'split' feature. This redundant (with transposed-order) feature-pair phenomenon seems to appear only in self crosses. I'm not sure why, and it may be a bug.
To answer the other questions (1, and 2):
To force black^black (actually ^a^black^a^black) you need to pass -q aa because black is only in name-space a.
Note that you can pass multiple -q arguments to vw to achieve any crossing you want:
-q aa -q ab -q ...
You can use the wildcard : name-space to cross every name space with every other:
-q ::
For more power:
There's also a --cubic option, allowing you to fit cubic-polynomials. --cubic takes 3 name-space leading chars as an argument, e.g. --cubic abc.
Finally, you may also use --keep and --ignore to keep or ignore name-spaces starting with a certain character.

Related

Talend: Equivalent of logstash "key value" filter

I'm discovering Talend Open Source Data Integrator and I would like to transform my data file into a csv file.
My data are some sets of key value data like this example:
A=0 B=3 C=4
A=2 C=4
A=2 B=4
A= B=3 C=1
I want to transform it into a CSV like this one:
A,B,C
0,3,4
2,,4
2,4,
With Logstash, I was using the "key value" filter which is able to do this job with a few lines of code. But with Talend, I don't find a similar transformation. I tried a "delimiter file" job and some other jobs without success.
This is quite tricky and interesting, because Talend is schema-based, so if you don't have the input/output schema predefined, it could be quite hard to achieve what you want.
Here is something you can try, there is a bunch of components to use, I didn't manage to get to a solution with fewer components. My solution is using unusual components like tNormalize and tPivotToColumnsDelimited. There is one flaw, as you'll get an extra column in the end.
1 - tFileInputRaw, because if you don't know your input schema, just read the file with this one.
2 - tConvertType : here you can convert Object to String type
3 - tNormalize : you'll have to separate manually your lines (use \n as separator)
4 - tMap : add a sequence "I"+Numeric.sequence("s1",1,1) , this will be used later to identify and regroup lines.
5 - tNormalize : here I normalize on 'TAB' separator, to get one line for each key=value pair
6 - tMap : you'll have to split on "=" sign.
At this step, you'll have an output like :
|seq|key|value|
|=--+---+----=|
|I1 |A |1 |
|I1 |B |2 |
|I1 |C |3 |
|I2 |A |2 |
|I2 |C |4 |
|I3 |A |2 |
|I3 |B |4 |
'---+---+-----'
where seq is the line number.
7 - Finally, with the tPivotToColumnDelimited, you'll have the result. Unfortunately, you'll have the extra "ID" column, as the output schema provided by the component tPivot is not editable. (the component is creating the schema, actually, which is very unusual amongst the talend components).
Use ID column as the regroup column.
Hope this helps, again, Talend is not a very easy tool if you have dynamic input/output schemas.
Corentin's answer is excellent, but here's an enhanced version of it, which cuts down on some components:
Instead of using tFileInputRaw and tConvertType, I used tFileInputFullRow, which reads the file line by line into a string.
Instead of splitting the string manually (where you need to check for nulls), I used tExtractDelimitedFields with "=" as a separator in order to extract a key and a value from the "key=value" column.
The end result is the same, with an extra column at the beginning.
If you want to delete the column, a dirty hack would be to read the output file using a tFileInputFullRow, and use a regex like ^[^;]+; in a tReplace to replace anything up to (and including) the first ";" in the line with an empty string, and write the result to another file.

Named column formulas in an org mode table

I have the following emacs org mode table.
|---+-----+-----+-----+-----|
| ! | foo | bar | baz | duu |
| # | -5 | 2 | | |
To calculate the values of baz and duu I have the formulas
#+TBLFM: $4=vsum($foo..$bar)::$duu=vsum($foo..$bar)
When I reevaluate all formulas with C-u C-c C-c or C-u C-c *, the value of baz is computed fine, but the value of duu remains empty. Since the only difference between the formulas for baz and duu is that the former uses a number reference and the latter a named reference, I assume that one cannot use named column reference on the left side of the assignment operator. However I find it rather inconvenient as I don't want to hardcode the column numbers because I might need to add/remove columns in the future.
Is there a way to create a column formula that uses names for all columns involved?

How does mercurial's bisect work when the range includes branching?

If the bisect range includes multiple branches, how does hg bisect's search work. Does it effectively bisect each sub-branch (I would think that would be inefficient)?
For instance, borrowing, with gratitude, a diagram from an answer to this related question, what if the bisect got to changeset 7 on the "good" right-side branch first.
# 12:8ae1fff407c8:bad6
|
o 11:27edd4ba0a78:bad5
|
o 10:312ba3d6eb29:bad4
|\
| o 9:68ae20ea0c02:good33
| |
| o 8:916e977fa594:good32
| |
| o 7:b9d00094223f:good31
| |
o | 6:a7cab1800465:bad3
| |
o | 5:a84e45045a29:bad2
| |
o | 4:d0a381a67072:bad1
| |
o | 3:54349a6276cc:good4
|/
o 2:4588e394e325:good3
|
o 1:de79725cb39a:good2
|
o 0:2641cc78ce7a:good1
Will it then look only between 7 and 12, missing the real first-bad that we care about? (thus using "dumb" numerical order) or is it smart enough to use the full topography and to know that the first bad could be below 7 on the right-side branch, or could still be anywhere on the left-side branch.
The purpose of my question is both (a) just to understand the algorithm better, and (b) to understand whether I can liberally extend my initial bisect range without thinking hard about what branch I go to. I've been in high-branching bisect situations where it kept asking me after every test to extend beyond the next merge, so that the whole procedure was essentially O(n). I'm wondering if I can just throw the first "good" marker way back past some nest of merges without thinking about it much, and whether that would save time and give correct results.
To quote from Mercurial: The Definitive Guide:
The hg bisect command is aware of the “branchy” nature of a Mercurial
project's revision history, so it has no problems dealing with
branches, merges, or multiple heads in a repository. It can prune
entire branches of history with a single probe, which is how it
operates so efficiently.
The code that does the work is in hbisect.py and actually looks at the descendent and ancestor trees from each node where the state has been determined.
It looks to me like the changeset chosen to test is chosen by weighting "how central" it is in graph of those yet to test (i.e. bisecting by ancestors vs. non-ancestors, rather than chronology):
108 x = len(a) # number of ancestors
109 y = tot - x # number of non-ancestors
110 value = min(x, y) # how good is this test?

Alternate output format for psql showing one column per line with column name

I am using PostgreSQL 8.4 on Ubuntu. I have a table with columns c1 through cN. The columns are wide enough that selecting all columns causes a row of query results to wrap multiple times. Consequently, the output is hard to read.
When the query results constitute just a few rows, it would be convenient if I could view the query results such that each column of each row is on a separate line, e.g.
c1: <value of row 1's c1>
c2: <value of row 1's c1>
...
cN: <value of row 1's cN>
---- some kind of delimiter ----
c1: <value of row 2's c1>
etc.
I am running these queries on a server where I would prefer not to install any additional software. Is there a psql setting that will let me do something like that?
I just needed to spend more time staring at the documentation. This command:
\x on
will do exactly what I wanted. Here is some sample output:
select * from dda where u_id=24 and dda_is_deleted='f';
-[ RECORD 1 ]------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
dda_id | 1121
u_id | 24
ab_id | 10304
dda_type | CHECKING
dda_status | PENDING_VERIFICATION
dda_is_deleted | f
dda_verify_op_id | 44938
version | 2
created | 2012-03-06 21:37:50.585845
modified | 2012-03-06 21:37:50.593425
c_id |
dda_nickname |
dda_account_name |
cu_id | 1
abd_id |
See also:
man psql => \x
man psql => --expanded
man psql => \pset => expanded
(New) Expanded Auto Mode: \x auto
New for Postgresql 9.2; PSQL automatically fits records to the width of the screen. previously you only had expanded mode on or off and had to switch between the modes as necessary.
If the record can fit into the width of the screen; psql uses normal formatting.
If the record can not fit into the width of the screen; psql uses expanded mode.
To get this use: \x auto
Postgresql 9.5 Documentation on PSQL command.
Wide screen, normal formatting:
id | time | humanize_time | value
----+-------+---------------------------------+-------
1 | 09:30 | Early Morning - (9.30 am) | 570
2 | 11:30 | Late Morning - (11.30 am) | 690
3 | 13:30 | Early Afternoon - (1.30pm) | 810
4 | 15:30 | Late Afternoon - (3.30 pm) | 930
(4 rows)
Narrow screen, expanded formatting:
-[ RECORD 1 ]-+---------------------------
id | 1
time | 09:30
humanize_time | Early Morning - (9.30 am)
value | 570
-[ RECORD 2 ]-+---------------------------
id | 2
time | 11:30
humanize_time | Late Morning - (11.30 am)
value | 690
-[ RECORD 3 ]-+---------------------------
id | 3
time | 13:30
humanize_time | Early Afternoon - (1.30pm)
value | 810
-[ RECORD 4 ]-+---------------------------
id | 4
time | 15:30
humanize_time | Late Afternoon - (3.30 pm)
value | 930
How to start psql with \x auto?
Configure \x auto command on startup by adding it to .psqlrc in your home folder and restarting psql. Look under 'Files' section in the psql doc for more info.
~/.psqlrc
\x auto
You have so many choices, how could you be confused :-)? The main controls are:
# \pset format
# \H
# \x
# \pset pager off
Each has options and interactions with the others. The most automatic options are:
# \x off;\pset format wrapped
# \x auto
The newer "\x auto" option switches to line-by-line display only "if needed".
-[ RECORD 1 ]---------------
id | 6
description | This is a gallery of oilve oil brands.
authority | I love olive oil, and wanted to create a place for
reviews and comments on various types.
-[ RECORD 2 ]---------------
id | 19
description | XXX Test A
authority | Testing
The older "\pset format wrapped" is similar in that it tries to fit the data neatly on screen, but falls back to unaligned if the headers won't fit. Here's an example of wrapped:
id | description | authority
----+--------------------------------+---------------------------------
6 | This is a gallery of oilve | I love olive oil, and wanted to
; oil brands. ; create a place for reviews and
; ; comments on various types.
19 | Test Test A | Testing
One interesting thing is we can view the tables horizontally, without folding. we can use PAGER environment variable. psql makes use of it. you can set
export PAGER='/usr/bin/less -S'
or just less -S if its already availble in command line, if not with the proper location. -S to view unfolded lines. you can pass in any custom viewer or other options with it.
I've written more in Psql Horizontal Display
pspg is a simple tool that offers advanced table formatting, horizontal scrolling, search and many more features.
git clone https://github.com/okbob/pspg.git
cd pspg
./configure
make
make install
then make sure to update PAGER variable e.g. in your ~/.bashrc
export PAGER="pspg -s 6"
where -s stands for color scheme (1-14). If you're using pgdg repositories simply install a package (on Debian-like distribution):
sudo apt install pspg
If you are looking for psql command-line mode like me,
here is the syntax --pset expanded=auto
psql command-line options:
-P expanded=auto
--pset expanded=auto
-x
--expanded
...
Another way is -q option ref
Also be sure to check out \H, which toggles HTML output on/off. Not necessarily easy to read at the console, but interesting for dumping into a file (see \o) or pasting into an editor/browser window for viewing, especially with multiple rows of relatively complex data.
you can use the zenity to displays the query output as html table.
first implement bash script with following code:
cat > '/tmp/sql.op';
zenity --text-info --html --filename='/tmp/sql.op';
save it like mypager.sh
Then export the environment variable PAGER by set full path of the script as value.
for example:- export PAGER='/path/mypager.sh'
Then login to the psql program then execute the command \H
And finally execute any query,the tabled output will displayed in the zenity in html table format.

Easy to remember fingerprints for data?

I need to create fingerprints for RSA keys that users can memorize or at least easily recognize. The following ideas have come to mind:
Break the SHA1 hash into portions of, say 4 bits and use them as coordinates for Bezier splines. Draw the splines and use that picture as a fingerprint.
Use the SHA1 hash as input for some fractal algorithm. The result would need to be unique for a given input, i.e. the output can't be a solid square half the time.
Map the SHA1 hash to entries in a word list (as used in spell checkers or password lists). This would create a passphrase consisting of real words.
Instead of a word list, use some other large data set like Google maps (map the SHA1 hash to map coordinates and use the map region(s) as a fingerprint)
Any other ideas? I'm sure this has been implemented in one form or another.
OpenSSH contains something like that, under the name "visual host key". Try this:
ssh -o VisualHostKey=yes somesshhost
where somesshhost is some machine with a SSH server running. It will print out a "fingerprint" of the server key, both in hexadecimal, and as an ASCII-art image which may look like this:
+--[ RSA 2048]----+
| .+ |
| + o |
| o o + |
| + o + |
| . o E S |
| + * . |
| X o . |
| . * o |
| .o . |
+-----------------+
Or like this:
+--[ RSA 1024]----+
| .*BB+ |
| . .++o |
| = oo. |
| . =o+.. |
| So+.. |
| ..E. |
| |
| |
| |
+-----------------+
Apparently, this is inspired from techniques described in this article. OpenSSH is opensource, with a BSD-like license, so chances are that you could simply reuse their code (it seems to be in the key.c file, function key_fingerprint_randomart()).
For item 3 (entries in a word list), see RFC-1751 - A Convention for Human-Readable 128-bit Keys, which notes that
The authors of S/Key devised a system to make the 64-bit one-time
password easy for people to enter.
Their idea was to transform the password into a string of small
English words. English words are significantly easier for people to
both remember and type. The authors of S/Key started with a
dictionary of 2048 English words, ranging in length from one to four
characters. The space covered by a 64-bit key (2^64) could be covered
by six words from this dictionary (2^66) with room remaining for
parity. For example, an S/Key one-time password of hex value:
EB33 F77E E73D 4053
would become the following six English words:
TIDE ITCH SLOW REIN RULE MOT
You could also use a compound fingerprint to improve memorability, like english words followed (or preceeded) by one or more key-dependent images.
For generating the image, you could use things like Identicon, Wavatar, MonsterID, or RoboHash.
Example:
TIDE ITCH SLOW
REIN RULE MOT
I found something called random art which generates an image from a hash. There is a Python implementation available for download: http://www.random-art.org/about/
There is also a paper about using random art for authentication: http://sparrow.ece.cmu.edu/~adrian/projects/validation/validation.pdf
It's from 1999; I don't know if further research has been done on this.
Your first suggestion (draw the path of splines for every four bytes, then fill using the nonzero fill rule) is exactly what I use for visualization in hashblot.