Linear Regression with Apache Beam - apache-beam

How might one go about fitting a large number of linear regressions in a beam pipeline? I have a large csv, and I want to normalize every column (about 500) according to two columns A and B. That is, I would like to get standard residuals for X ~ A + B for each column in the csv X.

That's an interesting use case. You can do something like so:
INDEX_A = # Something
INDEX_B = # Something else
parsed_rows = pipeline | beam.ReadFromText(my_csv)
| beam.Map(parse_each_line)
def column_paired_rows(row):
for idx, val in row:
if idx in (INDEX_A, INDEX_B): continue
# Yield the values keyed with the independent + dependent variable indices
yield ((INDEX_A, idx), {'independent_var_value': row[INDEX_A],
'independent_var_idx': INDEX_A,
'dependent_var_value': val,
'dependent_var_idx': idx})
yield ((INDEX_B, idx), {'independent_var_value': row[INDEX_B],
'independent_var_idx': INDEX_B,
'dependent_var_value': val,
'dependent_var_idx': idx})
column_pairs = parsed_rows | beam.FlatMap(column_paired_rows) | beam.GroupByKey()
The column_pairs PCollection will group all your elements by independent, dependent variable pairs, and then you can run the analysis.
def perform_linear_regression(elm):
key = elm[0] # KEY is a tuple with (independent variable index, dependent variable index)
values = elm[1] # This is an iterable with the data points that you need.
pairs = [(v['independent_var_value'], v['dependent_var_value']) for v in values]
model = linear_regression(pairs)
return (key, model)
models = column_pairs | beam.Map(perform_linear_regression)
LMK if you'd like me to add further detail

Related

updating subset of parameters in dynet

Is there a way to update a subset of parameters in dynet? For instance in the following toy example, first update h1, then h2:
model = ParameterCollection()
h1 = model.add_parameters((hidden_units, dims))
h2 = model.add_parameters((hidden_units, dims))
...
for x in trainset:
...
loss.scalar_value()
loss.backward()
trainer.update(h1)
renew_cg()
for x in trainset:
...
loss.scalar_value()
loss.backward()
trainer.update(h2)
renew_cg()
I know that update_subset interface exists for this and works based on the given parameter indexes. But then it is not documented anywhere how we can get the parameter indexes in dynet Python.
A solution is to use the flag update = False when creating expressions for parameters (including lookup parameters):
import dynet as dy
import numpy as np
model = dy.Model()
pW = model.add_parameters((2, 4))
pb = model.add_parameters(2)
trainer = dy.SimpleSGDTrainer(model)
def step(update_b):
dy.renew_cg()
x = dy.inputTensor(np.ones(4))
W = pW.expr()
# update b?
b = pb.expr(update = update_b)
loss = dy.pickneglogsoftmax(W * x + b, 0)
loss.backward()
trainer.update()
# dy.renew_cg()
print(pb.as_array())
print(pW.as_array())
step(True)
print(pb.as_array()) # b updated
print(pW.as_array())
step(False)
print(pb.as_array()) # b not updated
print(pW.as_array())
For update_subset, I would guess that the indices are the integers suffixed at the end of parameter names (.name()).
In the doc, we are supposed to use a get_index function.
Another option is: dy.nobackprop() which prevents the gradient to propagate beyond a certain node in the graph.
And yet another option is to zero the gradient of the parameter that do not need to be updated (.scale_gradient(0)).
These methods are equivalent to zeroing the gradient before the update. So, the parameter will still be updated if the optimizer uses its momentum from previous training steps (MomentumSGDTrainer, AdamTrainer, ...).

Where in the sequence of a Probabilistic Suffix Tree does "e" occur?

In my data there are only missing data (*) on the right side of the sequences. That means that no sequence starts with * and no sequence has any other markers after *. Despite this the PST (Probabilistic Suffix Tree) seems to predict a 90% chance of starting with a *. Here's my code:
# Load libraries
library(RCurl)
library(TraMineR)
library(PST)
# Get data
x <- getURL("https://gist.githubusercontent.com/aronlindberg/08228977353bf6dc2edb3ec121f54a29/raw/c2539d06771317c5f4c8d3a2052a73fc485a09c6/challenge_level.csv")
data <- read.csv(text = x)
# Load and transform data
data <- read.table("thread_level.csv", sep = ",", header = F, stringsAsFactors = F)
# Create sequence object
data.seq <- seqdef(data[2:nrow(data),2:ncol(data)], missing = NA, right= NA, nr = "*")
# Make a tree
S1 <- pstree(data.seq, ymin = 0.05, L = 6, lik = TRUE, with.missing = TRUE)
# Look at first state
cmine(S1, pmin = 0, state = "N3", l = 1)
This generates:
[>] context: e
EX FA I1 I2 I3 N1 N2 N3 NR
S1 0.006821066 0.01107234 0.01218274 0.01208756 0.006821066 0.002569797 0.003299492 0.001554569 0.0161802
QU TR *
S1 0.01126269 0.006440355 0.9097081
How can the probability for * be 0.9097081 at the very beginning of the sequence, meaning after context e?
Does it mean that the context can appear anywhere inside a sequence, and that e denotes an arbitrary starting point somewhere inside a sequence?
A PST is a representation of a variable length Markov model (VLMC). As a classical Markov model a VLMC is assumed to be homogeneous (or stationary) meaning that the conditional probabilities of the outcome given the context are the same at each position in the sequence. In other words, the context can appear anywhere in the sequence. Actually, the search for contexts is done by exploring the tree that is supposed to apply anywhere in the sequences.
In your example, for l=1 (l is 1 + the length of the context), you look only for 0-length context, i.e., the only possible context is the empty sequence e. Your condition pmin=0, state=N3 (have a probability greater than 0 for N3) is equivalent to no condition at all. So you get the overall probability to observe each state. Because your sequences (with the missing states) are all of the same length, you would get the same results using TraMineR with
seqmeant(data.seq, with.missing=TRUE)/max(seqlength(data.seq))
To get the distribution at the first position, you can use TraMineR and look at the first column of the table of cross-sectional distributions at the successive positions returned by
seqstatd(data.seq, with.missing=TRUE)
Hope this helps.

Minizinc constraints from another array

I'm attempting my first constraint programming with minizinc. I'm trying to create a schedule, with n slots and n people, with a different person allocated to each slot. I'm using an array of var int to model the schedule, usingalldifferent() to ensure a different person in each slot.
A separate array of size n, names contains their names, as below:
% Pseudo enum
set of int: NameIndex = 1..2;
int: Forename = 1;
int: Surname = 2;
int: n = 4; % Number of slots and people
set of int: slots = 1..n;
array[slots, NameIndex] of string: names = [| "John", "Doe"
| "Ann", "Jones"
| "Fred", "Doe"
| "Barry", "Doe" |];
% The schedule
array[slots] of var slots: schedule;
% Every person is scheduled:
include "alldifferent.mzn";
constraint alldifferent(schedule);
% How to constrain by a value from names, say alphabetic order by forename.
% Want to compare each value in schedule to the next one.
%constraint forall (i in 1..n-1) (names[schedule[i],Forename] < names[schedule[i+1],Forename]);
solve satisfy;
output [show(i) ++ ": " ++ show(names[schedule[i], Forename]) ++ " " ++ show(names[schedule[i], Surname]) ++ "\n"
| i in slots]
% should be:
% | i in schedule]
How can I constrain schdule by values from names? In my (broken) example above, when the forall constraint is uncommented, i get (using the Minizinc IDE):
in call 'forall'
in array comprehension expression
with i = 1
in binary '<' operator expression
in array access
cannot find matching declaration
I follow the error until not understanding which declaration cannot be found. The output block show()s values from names quite happily when I index into array from the schdule value.
What am I missing? Is there a better way to model the names? I'm hoping to extend names to additional 'attributes' of people and create additional constraints. I'm sure both the model and my forall constraint are quite naive!
The problem is that MiniZinc don't have much support for strings; and specific for your example: there's no support for comparing strings ("<").
That said, there are some plans to support var strings (i.e. using strings as decision variables), but I don't know the exact status when it will be released.
Here's a very simple fix but it requires some preprocessing:
1) Create a new array which includes the (ordering) index of each name:
array[slots] of int: names_ix = [4, % John Doe
1, % Ann Jones
3, % Fred Doe
2, % Barry Doe
];
2) Change the ordering constraint to use this new array
constraint
forall (i in 1..n-1) (
names_ix[schedule[i]] <= names_ix[schedule[i+1]]
);
[There's a more complex variant that requires that you convert each character in the names to integers (in a matrix of "var int"), and then require that the words - as collection of integers - are ordered. Note that this tends to be quite messy, for example to handle strings of different lengths etc.]

Iterating over an array in a function

For my Google Docs spreadsheet module, I'd like a function to be able to accept an array of values and iterate over them, adding them to a hash. The spreadsheet submission form wants values in a format like this:
{"entry.0.single": value0,
"entry.1.single": value1,
"entry.2.single": value2}
If the function accepts an array like the following,
[value0, value1, value2]
is it possible to loop over them, keep a running counter, and create a hash? This would be a simple task in other languages. Python suffices for illustration:
hash = dict()
i = 0
for val in values:
hash["entry.%s.single" % i] = val
i += 1
Can that be done in KRL?
Recursive functions are your friend:
a = ['value0', 'value1', 'value2'];
r = function(a, h, n){
top = a.head();
newhash = h.put({'entry.#{n}.single':top});
a.length() > 1 => r(a.tail(), newhash, n+1) | newhash;
};
out = r(a, {}, 0);
out has the value of {'entry.1.single' :'value1','entry.0.single' :'value0','entry.2.single' :'value2'};
A recursive function is needed here because you are doing a structure conversion. If you wanted an array back, you could have used the map() method.
Also, watch your base case writing recursive functions. KNS does enforce a recursion limit.

MATLAB: is fieldnames' order defined?

For the same input structure, will fieldnames always return the same cell array, even on different computers, different OS's, and different MATLAB versions? Or could it order the field names differently? E.g.:
myStructure = load myStructure;
x = fieldnames(myStructure);
% days later, diff computer, diff OS, and diff version of MATLAB...
y = fieldnames(myStructure);
x == y %?
The documentation for fieldnames does not seem to promise that the same order is returned every time. But on the other hand, the existence of orderfields seems to imply that fieldnames predictably returns an underlying, normally unchanging order.
I believe the structure fields are ordered as they created. If you save the structure into mat-file and open it later with another MATLAB, the order will be kept. You can always reorder fields with ORDERFIELDS function. You can order in many different ways (sort alphabetically, using a cell arrays, another structure or permutation vector), see the documentation for more details.
By the way, fields order does not affect structures comparison.
s1 = struct('a',0,'b',1)
s1 =
a: 0
b: 1
s2 = struct('b',1,'a',0)
s2 =
b: 1
a: 0
isequal(s1,s2)
ans =
1
s1=orderfields(s1,s2)
s1 =
b: 1
a: 0
UPDATE:
Here is the quote from the MATLAB documentation for structure data type under "Listing the Fields of a Structure" subtitle:
The fields appear in the order in which they were created.
Hope this answers your question.